audio_pipeline
This commit is contained in:
1
.gitignore
vendored
Normal file
1
.gitignore
vendored
Normal file
@@ -0,0 +1 @@
|
||||
.env
|
||||
661
README.md
Normal file
661
README.md
Normal file
@@ -0,0 +1,661 @@
|
||||
# Audio Pipeline
|
||||
|
||||
Пайплайн обработки аудиозаписей звонков: от появления файла на диске до классификации, анализа по промптам и публикации итогового результата в очередь `final`.
|
||||
|
||||
Система построена на **файловом триггере** + **RabbitMQ** + **PostgreSQL** + внешних API (Nexara STT, Yandex LLM).
|
||||
|
||||
---
|
||||
|
||||
## Содержание
|
||||
|
||||
1. [Общая схема](#общая-схема)
|
||||
2. [Структура проекта](#структура-проекта)
|
||||
3. [Инфраструктура](#инфраструктура)
|
||||
4. [Этапы обработки](#этапы-обработки)
|
||||
5. [Файловое хранилище](#файловое-хранилище)
|
||||
6. [RabbitMQ](#rabbitmq)
|
||||
7. [PostgreSQL](#postgresql)
|
||||
8. [Воркеры](#воркеры)
|
||||
9. [Форматы сообщений](#форматы-сообщений)
|
||||
10. [Промпты для анализа](#промпты-для-анализа)
|
||||
11. [Агрегация и очередь final](#агрегация-и-очередь-final)
|
||||
12. [Конфигурация (.env)](#конфигурация-env)
|
||||
13. [Запуск и управление](#запуск-и-управление)
|
||||
14. [Логирование](#логирование)
|
||||
15. [Обработка ошибок](#обработка-ошибок)
|
||||
16. [Переключение промптов на API](#переключение-промптов-на-api)
|
||||
17. [Что не реализовано](#что-не-реализовано)
|
||||
|
||||
---
|
||||
|
||||
## Общая схема
|
||||
|
||||
```
|
||||
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
|
||||
│ incoming/ │────▶│ watcher │────▶│ RabbitMQ │
|
||||
│ (файлы) │ │ (сканер) │ │ audio.new │
|
||||
└─────────────┘ └──────────────┘ └──────┬──────┘
|
||||
│
|
||||
┌──────────────┐ ▼
|
||||
│ processing/ │ ┌─────────────┐
|
||||
│ (аудио) │◀────│ transcribe │── Nexara API (STT)
|
||||
└──────────────┘ └──────┬──────┘
|
||||
│
|
||||
fanout transcription_done
|
||||
┌───────────┴───────────┐
|
||||
▼ ▼
|
||||
┌─────────────┐ ┌─────────────┐
|
||||
│ analyse │ │ tagging │
|
||||
│ Yandex LLM │ │ Yandex LLM │
|
||||
└──────┬──────┘ └──────┬──────┘
|
||||
│ │
|
||||
└───────────┬───────────┘
|
||||
▼
|
||||
┌─────────────┐
|
||||
│ PostgreSQL │
|
||||
│ results │
|
||||
└──────┬──────┘
|
||||
│
|
||||
оба готовы ───────┤
|
||||
▼
|
||||
┌─────────────┐
|
||||
│ final │ (RabbitMQ)
|
||||
│ + удаление │
|
||||
│ файла │
|
||||
└─────────────┘
|
||||
```
|
||||
|
||||
**Ключевые принципы:**
|
||||
|
||||
- Каждый этап — отдельный Docker-сервис (воркер).
|
||||
- Связь между этапами — через RabbitMQ (асинхронно).
|
||||
- Промежуточные и итоговые результаты LLM — в PostgreSQL.
|
||||
- `analyse` и `tagging` работают **параллельно** после транскрипции.
|
||||
- В `final` публикует тот воркер, который завершился **последним**.
|
||||
|
||||
---
|
||||
|
||||
## Структура проекта
|
||||
|
||||
```
|
||||
audio-pipeline/
|
||||
├── docker-compose.yml # инфраструктура и все сервисы
|
||||
├── .env # конфигурация (не в git)
|
||||
├── db/
|
||||
│ └── init.sql # схема PostgreSQL
|
||||
├── storage/ # аудиофайлы на хосте (монтируется в контейнеры)
|
||||
│ ├── incoming/ # сюда кладут новые файлы
|
||||
│ ├── processing/ # файлы в обработке
|
||||
│ └── failed/ # файлы при критических ошибках watcher
|
||||
├── watcher/ # сканер файловой системы
|
||||
│ ├── cmd/watcher/main.go
|
||||
│ └── internal/
|
||||
│ ├── config/
|
||||
│ ├── scanner/
|
||||
│ └── publisher/
|
||||
└── workers/
|
||||
├── transcribe/ # STT + загрузка промптов + fanout
|
||||
│ ├── cmd/transcribe/main.go
|
||||
│ ├── configs/prompts.json
|
||||
│ └── internal/
|
||||
│ ├── consumer/
|
||||
│ ├── nexara/
|
||||
│ ├── prompts/
|
||||
│ └── models/
|
||||
├── analyse/ # анализ по промптам (Yandex LLM)
|
||||
│ └── cmd/analyse/main.go
|
||||
└── tagging/ # классификация диалога (Yandex LLM)
|
||||
└── cmd/tagging/main.go
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Инфраструктура
|
||||
|
||||
| Сервис | Контейнер | Порты | Назначение |
|
||||
|-------------|-------------|----------------|-------------------------------|
|
||||
| `rabbit` | rabbit | 5672, 15672 | RabbitMQ + Management UI |
|
||||
| `postgres` | postgres | 5432 | Хранение результатов |
|
||||
| `watcher` | watcher | — | Мониторинг `incoming/` |
|
||||
| `transcribe`| transcribe | — | Транскрипция + fanout |
|
||||
| `analyse` | analyse | — | Анализ по промптам |
|
||||
| `tagging` | tagging | — | Классификация диалога |
|
||||
|
||||
**RabbitMQ UI:** http://localhost:15672 (логин/пароль из `.env`)
|
||||
|
||||
---
|
||||
|
||||
## Этапы обработки
|
||||
|
||||
### 1. Watcher — обнаружение файла
|
||||
|
||||
**Триггер:** появление аудиофайла в `{STORAGE_ROOT}/incoming/`.
|
||||
|
||||
**Алгоритм:**
|
||||
|
||||
1. Каждые `POLL_INTERVAL` (по умолчанию 5 с) сканирует `incoming/`.
|
||||
2. Пропускает скрытые файлы (`.`) и временные (`.tmp`).
|
||||
3. Проверяет расширение: `.mp3`, `.wav`, `.m4a`, `.ogg`, `.flac`, `.webm`.
|
||||
4. Ждёт стабилизации размера файла (`STABLE_WINDOW` / `STABLE_CHECKS`) — защита от незавершённой загрузки.
|
||||
5. Генерирует ULID `task_id`.
|
||||
6. Атомарно переименовывает: `incoming/name.wav` → `processing/{task_id}.wav`.
|
||||
7. Публикует задачу в RabbitMQ (`audio_pipeline` / `audio.new`).
|
||||
|
||||
**При ошибке публикации:** файл возвращается в `incoming/`. Если откат невозможен — перемещается в `failed/`.
|
||||
|
||||
### 2. Transcribe — транскрипция
|
||||
|
||||
**Вход:** очередь `transcribe.tasks`.
|
||||
|
||||
**Алгоритм:**
|
||||
|
||||
1. Читает аудиофайл по `file_path` из сообщения.
|
||||
2. Отправляет в **Nexara API** (Speech-to-Text).
|
||||
3. Загружает промпты (`prompts.json` или HTTP API).
|
||||
4. Формирует `TranscriptionResult` и публикует в fanout-exchange `transcription_done`.
|
||||
5. Сообщение доставляется **одновременно** в очереди `analyse` и `tagging`.
|
||||
|
||||
### 3. Tagging — классификация диалога
|
||||
|
||||
**Вход:** очередь `tagging`.
|
||||
|
||||
**Алгоритм:**
|
||||
|
||||
1. Получает транскрипцию из сообщения.
|
||||
2. Отправляет **один** запрос в **Yandex LLM** с промптом классификации (L1/L2/L3, risk_level и т.д.).
|
||||
3. Сохраняет результат в `results.tagging` (PostgreSQL).
|
||||
4. Если `results.analysis` уже заполнен — публикует в `final` и удаляет файл.
|
||||
|
||||
### 4. Analyse — анализ по промптам
|
||||
|
||||
**Вход:** очередь `analyse`.
|
||||
|
||||
**Алгоритм:**
|
||||
|
||||
1. Получает транскрипцию и массив `prompts` из сообщения.
|
||||
2. Для **каждого** промпта — отдельный запрос в **Yandex LLM** (сейчас 3 промпта: behavioral, client_data, cargo_data).
|
||||
3. Сохраняет результат в `results.analysis`, метаданные — в `results.metadata`.
|
||||
4. Если `results.tagging` уже заполнен — публикует в `final` и удаляет файл.
|
||||
|
||||
### 5. Final — итоговое сообщение
|
||||
|
||||
**Выход:** очередь `final` (default exchange, routing key = `final`).
|
||||
|
||||
Публикуется **полный JSON** из таблицы `results`: транскрипция, analysis, tagging, metadata, статус, timestamps.
|
||||
|
||||
После успешной публикации аудиофайл удаляется из `processing/`.
|
||||
|
||||
> **Consumer для очереди `final` пока не реализован** — сообщения накапливаются в очереди до подключения обработчика.
|
||||
|
||||
---
|
||||
|
||||
## Файловое хранилище
|
||||
|
||||
Путь на хосте задаётся `STORAGE_ROOT` (по умолчанию `./storage`).
|
||||
|
||||
| Директория | Назначение |
|
||||
|---------------|-------------------------------------------------|
|
||||
| `incoming/` | Новые файлы для обработки |
|
||||
| `processing/` | Файлы в работе (после claim watcher) |
|
||||
| `failed/` | Файлы при невосстановимых ошибках watcher |
|
||||
|
||||
**Жизненный цикл файла:**
|
||||
|
||||
```
|
||||
incoming/recording.wav
|
||||
→ processing/01KTN....wav (watcher)
|
||||
→ остаётся до завершения (transcribe читает по пути)
|
||||
→ удаляется (после публикации в final)
|
||||
```
|
||||
|
||||
Внутри контейнеров путь: `/data/storage/...` (volume mount).
|
||||
|
||||
---
|
||||
|
||||
## RabbitMQ
|
||||
|
||||
### Топология
|
||||
|
||||
```
|
||||
exchange: audio_pipeline (direct)
|
||||
└── queue: transcribe.tasks ← routing key: audio.new
|
||||
|
||||
exchange: transcription_done (fanout)
|
||||
├── queue: analyse
|
||||
└── queue: tagging
|
||||
|
||||
queue: final (default exchange, без binding)
|
||||
```
|
||||
|
||||
### Очереди
|
||||
|
||||
| Очередь | Producer | Consumer | Описание |
|
||||
|--------------------|------------|------------|-----------------------------|
|
||||
| `transcribe.tasks` | watcher | transcribe | Новые аудиозадачи |
|
||||
| `analyse` | transcribe | analyse | Результат транскрипции |
|
||||
| `tagging` | transcribe | tagging | Результат транскрипции |
|
||||
| `final` | analyse/tagging | — | Итоговый результат |
|
||||
|
||||
### Dead Letter
|
||||
|
||||
Очередь `transcribe.tasks` настроена с DLX (`dlx`). Сообщения с `Nack(requeue=false)` уходят в dead-letter. Отдельная DLQ-очередь может потребовать дополнительной настройки.
|
||||
|
||||
---
|
||||
|
||||
## PostgreSQL
|
||||
|
||||
### Таблица `results`
|
||||
|
||||
```sql
|
||||
CREATE TABLE results (
|
||||
task_id TEXT PRIMARY KEY,
|
||||
filename TEXT,
|
||||
transcription TEXT,
|
||||
analysis JSONB, -- результат analyse (Yandex LLM)
|
||||
tagging JSONB, -- результат tagging (Yandex LLM)
|
||||
metadata JSONB, -- file_path, segments, prompts, language, transcribed_at
|
||||
status TEXT DEFAULT 'pending', -- pending | done
|
||||
created_at TIMESTAMPTZ,
|
||||
updated_at TIMESTAMPTZ
|
||||
);
|
||||
```
|
||||
|
||||
### Кто что пишет
|
||||
|
||||
| Поле | Пишет | Когда |
|
||||
|-----------------|----------|--------------------------------|
|
||||
| `tagging` | tagging | После классификации |
|
||||
| `analysis` | analyse | После анализа |
|
||||
| `filename` | оба | При сохранении своей части |
|
||||
| `transcription` | analyse | При сохранении analysis |
|
||||
| `metadata` | analyse | file_path, segments, prompts… |
|
||||
| `status` | оба | `done` когда оба поля заполнены|
|
||||
|
||||
### Просмотр данных
|
||||
|
||||
```bash
|
||||
docker exec -it postgres psql -U pipeline -d pipeline
|
||||
```
|
||||
|
||||
```sql
|
||||
SELECT task_id, filename, status, updated_at FROM results ORDER BY updated_at DESC;
|
||||
|
||||
SELECT task_id, jsonb_pretty(analysis), jsonb_pretty(tagging)
|
||||
FROM results WHERE task_id = 'ВАШ_TASK_ID';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Воркеры
|
||||
|
||||
### Watcher
|
||||
|
||||
- **Язык:** Go
|
||||
- **Зависимости:** RabbitMQ
|
||||
- **Volume:** `${STORAGE_ROOT}:/data/storage`
|
||||
- **Не использует:** Postgres, Yandex, Nexara
|
||||
|
||||
### Transcribe
|
||||
|
||||
- **Язык:** Go
|
||||
- **API:** Nexara (STT)
|
||||
- **Промпты:** static file или HTTP
|
||||
- **Volume:** storage + `configs/prompts.json`
|
||||
|
||||
### Tagging
|
||||
|
||||
- **Язык:** Go
|
||||
- **API:** Yandex Cloud LLM (`YANDEX_API_KEY`, `YANDEX_MODEL`)
|
||||
- **При старте:** тестовый запрос к Yandex API (проверка подключения)
|
||||
- **Volume:** storage (для удаления файлов) + `.env` (hot-reload токена)
|
||||
|
||||
### Analyse
|
||||
|
||||
- **Язык:** Go
|
||||
- **API:** Yandex Cloud LLM
|
||||
- **Вызовов LLM на задачу:** по числу промптов (сейчас 3)
|
||||
- **Без повторов:** один вызов на промпт, при ошибке — сообщение отбрасывается
|
||||
- **Volume:** storage + `.env`
|
||||
|
||||
---
|
||||
|
||||
## Форматы сообщений
|
||||
|
||||
### 1. Watcher → Transcribe
|
||||
|
||||
**Exchange:** `audio_pipeline` (direct)
|
||||
**Routing key:** `audio.new`
|
||||
**Queue:** `transcribe.tasks`
|
||||
|
||||
```json
|
||||
{
|
||||
"task_id": "01KTNVA3EKW8CY2QDAYKKVZ40W",
|
||||
"file_path": "/data/storage/processing/01KTNVA3EKW8CY2QDAYKKVZ40W.wav",
|
||||
"filename": "1.wav",
|
||||
"size": 1234567,
|
||||
"created_at": 1780914907
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Transcribe → Analyse + Tagging
|
||||
|
||||
**Exchange:** `transcription_done` (fanout)
|
||||
**Queues:** `analyse`, `tagging` (одинаковое тело)
|
||||
|
||||
```json
|
||||
{
|
||||
"task_id": "01KTNVA3EKW8CY2QDAYKKVZ40W",
|
||||
"filename": "1.wav",
|
||||
"file_path": "/data/storage/processing/01KTNVA3EKW8CY2QDAYKKVZ40W.wav",
|
||||
"transcription": "полный текст транскрипции...",
|
||||
"language": "ru",
|
||||
"segments": [
|
||||
{"start": 0.0, "end": 27.96, "text": "..."}
|
||||
],
|
||||
"prompts": [
|
||||
{
|
||||
"id": 1,
|
||||
"id_section": 1,
|
||||
"name": "behavioral",
|
||||
"prompt": "Ты — строгий классификатор звонков...",
|
||||
"dt_create": "2026-06-09T09:00:00"
|
||||
}
|
||||
],
|
||||
"transcribed_at": 1780914907
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Final — итоговое сообщение
|
||||
|
||||
**Queue:** `final`
|
||||
|
||||
```json
|
||||
{
|
||||
"task_id": "01KTNVA3EKW8CY2QDAYKKVZ40W",
|
||||
"filename": "1.wav",
|
||||
"transcription": "полный текст...",
|
||||
"analysis": {
|
||||
"behavioral": {
|
||||
"greeting": {"value": true, "evidence": "Здравствуйте", "confidence": 0.95},
|
||||
"initiative": {"value": true, "evidence": "...", "confidence": 0.8},
|
||||
"questions_check": {"value": false, "evidence": null, "confidence": 0.0},
|
||||
"closing": {"value": true, "evidence": "всего доброго", "confidence": 0.9}
|
||||
},
|
||||
"client_data": { "...": "..." },
|
||||
"cargo_data": { "...": "..." }
|
||||
},
|
||||
"tagging": {
|
||||
"L1": "tracking",
|
||||
"L2": "location_request",
|
||||
"L3": "delay",
|
||||
"risk_level": "medium",
|
||||
"has_action_items": true,
|
||||
"has_deadline": false
|
||||
},
|
||||
"file_path": "/data/storage/processing/01KTNVA3EKW8CY2QDAYKKVZ40W.wav",
|
||||
"language": "ru",
|
||||
"segments": [...],
|
||||
"prompts": [...],
|
||||
"transcribed_at": 1780914907,
|
||||
"status": "done",
|
||||
"created_at": "2026-06-09T09:00:00Z",
|
||||
"updated_at": "2026-06-09T09:05:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Промпты для анализа
|
||||
|
||||
### Источник
|
||||
|
||||
Файл: `workers/transcribe/configs/prompts.json`
|
||||
Или HTTP API (см. [переключение на API](#переключение-промптов-на-api)).
|
||||
|
||||
Transcribe загружает промпты и **вкладывает их в сообщение** для analyse. Analyse не знает, откуда они пришли.
|
||||
|
||||
### Текущие промпты (3 штуки)
|
||||
|
||||
| name | Назначение |
|
||||
|----------------|---------------------------------------------------------|
|
||||
| `behavioral` | Приветствие, инициативность, вопросы, прощание |
|
||||
| `client_data` | Первый раз, город, тип клиента, контакты, источник |
|
||||
| `cargo_data` | Характер груза, параметры, стоимость |
|
||||
|
||||
Каждый промпт — **полный текст инструкции** с форматом JSON-ответа. Analyse отправляет:
|
||||
|
||||
```
|
||||
<текст промпта из конфига>
|
||||
|
||||
=== ТРАНСКРИПЦИЯ ===
|
||||
"""
|
||||
<текст звонка>
|
||||
"""
|
||||
```
|
||||
|
||||
Ответ LLM сохраняется целиком под ключом `name` промпта в `analysis`.
|
||||
|
||||
### Tagging — отдельный промпт
|
||||
|
||||
Tagging **не использует** `prompts.json`. У него встроенный промпт классификации логистических диалогов (L1/L2/L3, risk_level, has_action_items, has_deadline).
|
||||
|
||||
---
|
||||
|
||||
## Агрегация и очередь final
|
||||
|
||||
Оба воркера (`analyse`, `tagging`) пишут в одну строку `results` по `task_id`.
|
||||
|
||||
**Атомарная проверка готовности** (SQL):
|
||||
|
||||
```sql
|
||||
UPDATE results SET ...
|
||||
RETURNING (analysis IS NOT NULL AND tagging IS NOT NULL)
|
||||
```
|
||||
|
||||
- Если `RETURNING = false` — воркер ждёт второго.
|
||||
- Если `RETURNING = true` — этот воркер:
|
||||
1. Читает полную строку из БД
|
||||
2. Публикует JSON в очередь `final`
|
||||
3. Удаляет файл из `processing/` (только пути с `/processing/`)
|
||||
|
||||
---
|
||||
|
||||
## Конфигурация (.env)
|
||||
|
||||
Пример ключевых переменных:
|
||||
|
||||
```env
|
||||
# Storage
|
||||
STORAGE_ROOT=./storage
|
||||
|
||||
# Watcher
|
||||
POLL_INTERVAL=5s
|
||||
STABLE_WINDOW=2s
|
||||
STABLE_CHECKS=3
|
||||
|
||||
# RabbitMQ
|
||||
RABBITMQ_URL=amqp://admin:secret123@rabbit:5672/
|
||||
RABBITMQ_EXCHANGE=audio_pipeline
|
||||
RABBITMQ_ROUTING_KEY=audio.new
|
||||
|
||||
# Transcribe
|
||||
INPUT_QUEUE=transcribe.tasks
|
||||
OUTPUT_EXCHANGE=transcription_done
|
||||
ANALYSE_QUEUE=analyse
|
||||
TAGGING_QUEUE=tagging
|
||||
FINAL_QUEUE=final
|
||||
PREFETCH=1
|
||||
|
||||
# Nexara (STT)
|
||||
NEXARA_BASE_URL=https://api.nexara.ru
|
||||
NEXARA_API_KEY=...
|
||||
NEXARA_MODEL=whisper-1
|
||||
NEXARA_TIMEOUT=10m
|
||||
|
||||
# Prompts
|
||||
PROMPTS_SOURCE=static
|
||||
PROMPTS_FILE=/app/configs/prompts.json
|
||||
PROMPTS_SECTION=1
|
||||
|
||||
# Postgres
|
||||
POSTGRES_USER=pipeline
|
||||
POSTGRES_PASSWORD=pipeline_secret
|
||||
POSTGRES_DB=pipeline
|
||||
DATABASE_URL=postgres://pipeline:pipeline_secret@postgres:5432/pipeline?sslmode=disable
|
||||
|
||||
# Yandex LLM (tagging + analyse)
|
||||
YANDEX_API_KEY=t1....
|
||||
YANDEX_MODEL=gpt://folder_id/model_name
|
||||
YANDEX_API_URL=https://ai.api.cloud.yandex.net/v1/chat/completions
|
||||
```
|
||||
|
||||
### Hot-reload токена Yandex
|
||||
|
||||
Воркеры `tagging` и `analyse` монтируют `.env` как `/config/.env` и перечитывают при **каждом старте** контейнера:
|
||||
|
||||
```bash
|
||||
docker compose restart tagging analyse
|
||||
```
|
||||
|
||||
> `docker compose restart` не пересоздаёт контейнер, но перезапускает процесс, который читает свежий `.env`.
|
||||
|
||||
---
|
||||
|
||||
## Запуск и управление
|
||||
|
||||
### Первый запуск
|
||||
|
||||
```bash
|
||||
cd audio-pipeline
|
||||
docker compose up -d --build
|
||||
```
|
||||
|
||||
### Полный сброс (очереди + БД)
|
||||
|
||||
```bash
|
||||
docker compose down -v
|
||||
docker compose up -d --build
|
||||
```
|
||||
|
||||
### Пересборка отдельных воркеров
|
||||
|
||||
```bash
|
||||
docker compose up -d --build transcribe analyse tagging
|
||||
```
|
||||
|
||||
### Обработка нового файла
|
||||
|
||||
```bash
|
||||
cp recording.wav storage/incoming/
|
||||
```
|
||||
|
||||
### Проверка статуса
|
||||
|
||||
```bash
|
||||
docker compose ps
|
||||
docker compose logs -f watcher transcribe analyse tagging
|
||||
```
|
||||
|
||||
### RabbitMQ — просмотр очереди final
|
||||
|
||||
UI: http://localhost:15672 → Queues → `final`
|
||||
|
||||
---
|
||||
|
||||
## Логирование
|
||||
|
||||
Все воркеры пишут **структурированные JSON-логи** в stdout.
|
||||
|
||||
### Ключевые события
|
||||
|
||||
| Событие | Воркер | Описание |
|
||||
|--------------------------|-----------|---------------------------------------|
|
||||
| `claimed file` | watcher | Файл взят в обработку |
|
||||
| `transcribed` | transcribe| STT завершён |
|
||||
| `llm call ok` | analyse/tagging | Вызов Yandex API |
|
||||
| `task complete` | analyse/tagging | Оба результата готовы |
|
||||
| `published final` | analyse/tagging | Сообщение в final |
|
||||
| `processing file deleted` | analyse/tagging | Файл удалён из processing |
|
||||
| `yandex api check ok` | analyse/tagging | Проверка API при старте |
|
||||
|
||||
### Поиск в логах
|
||||
|
||||
```bash
|
||||
# все LLM-вызовы
|
||||
docker compose logs analyse 2>&1 | grep '"llm call'
|
||||
|
||||
# завершённые задачи
|
||||
docker compose logs 2>&1 | grep '"task complete"'
|
||||
|
||||
# ошибки
|
||||
docker compose logs 2>&1 | grep '"level":"WARN"'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Обработка ошибок
|
||||
|
||||
| Ситуация | Поведение |
|
||||
|----------------------------------|------------------------------------------------|
|
||||
| Битое JSON в очереди | `Nack(requeue=false)` — в DLQ |
|
||||
| Ошибка Nexara STT | `Nack(requeue=false)` — в DLQ |
|
||||
| Ошибка Yandex LLM | `Nack(requeue=false)` — сообщение отбрасывается |
|
||||
| Ошибка сохранения в Postgres | `Nack(requeue=false)` — отбрасывается |
|
||||
| Redelivered сообщение | Пропускается (без повторного вызова LLM) |
|
||||
| Ошибка публикации в final | Файл **не** удаляется |
|
||||
| Yandex API недоступен при старте | Воркер не запускается, контейнер рестартит |
|
||||
|
||||
**Политика без повторов:** каждый промпт / классификация — ровно один вызов LLM. Повторные доставки RabbitMQ игнорируются.
|
||||
|
||||
---
|
||||
|
||||
## Переключение промптов на API
|
||||
|
||||
Уже реализовано в transcribe. Достаточно изменить `.env`:
|
||||
|
||||
```env
|
||||
PROMPTS_SOURCE=http
|
||||
PROMPTS_BASE_URL=https://your-api.example.com
|
||||
PROMPTS_API_KEY=your_token
|
||||
PROMPTS_SECTION=1
|
||||
```
|
||||
|
||||
**Запрос:** `GET {PROMPTS_BASE_URL}/metrics/?id_section=1`
|
||||
|
||||
**Ожидаемый ответ** — массив в том же формате, что `prompts.json`:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"id": 1,
|
||||
"id_section": 1,
|
||||
"name": "behavioral",
|
||||
"prompt": "полный текст промпта...",
|
||||
"dt_create": "2026-06-09T09:00:00"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
Analyse менять не нужно — промпты приходят в сообщении RabbitMQ.
|
||||
|
||||
---
|
||||
|
||||
## Что не реализовано
|
||||
|
||||
- **Consumer очереди `final`** — нет воркера, который читает итоговые сообщения
|
||||
- **DLQ-очередь** `transcribe.tasks.failed` — exchange `dlx` объявлен, но отдельная очередь может не быть привязана
|
||||
- **Повторная обработка** при ошибках LLM — намеренно отключена
|
||||
- **Архивация** удалённых файлов — файлы удаляются без бэкапа
|
||||
|
||||
---
|
||||
|
||||
## Быстрый troubleshooting
|
||||
|
||||
| Проблема | Решение |
|
||||
|---------------------------------------|--------------------------------------------------------|
|
||||
| Файл не обрабатывается | Проверить `incoming/`, логи watcher |
|
||||
| TLS timeout к Yandex из контейнера | MTU Docker, VPN, `docker compose restart` |
|
||||
| Старый Yandex токен | Обновить `.env`, `docker compose restart tagging analyse` |
|
||||
| `context canceled` в final | Исправлено — пересобрать analyse/tagging |
|
||||
| `metadata` column does not exist | `ALTER TABLE results ADD COLUMN IF NOT EXISTS metadata JSONB;` |
|
||||
| Файл остаётся в processing | Проверить, дошли ли оба воркера до `published final` |
|
||||
13
db/init.sql
Normal file
13
db/init.sql
Normal file
@@ -0,0 +1,13 @@
|
||||
CREATE TABLE IF NOT EXISTS results (
|
||||
task_id TEXT PRIMARY KEY,
|
||||
filename TEXT,
|
||||
transcription TEXT,
|
||||
analysis JSONB,
|
||||
tagging JSONB,
|
||||
metadata JSONB,
|
||||
status TEXT NOT NULL DEFAULT 'pending',
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
||||
);
|
||||
|
||||
ALTER TABLE results ADD COLUMN IF NOT EXISTS metadata JSONB;
|
||||
100
docker-compose.yml
Normal file
100
docker-compose.yml
Normal file
@@ -0,0 +1,100 @@
|
||||
services:
|
||||
rabbit:
|
||||
image: rabbitmq:3-management-alpine
|
||||
container_name: rabbit
|
||||
ports:
|
||||
- "5672:5672"
|
||||
- "15672:15672"
|
||||
environment:
|
||||
RABBITMQ_DEFAULT_USER: ${RABBITMQ_DEFAULT_USER:-admin}
|
||||
RABBITMQ_DEFAULT_PASS: ${RABBITMQ_DEFAULT_PASS:-secret123}
|
||||
healthcheck:
|
||||
test: ["CMD", "rabbitmq-diagnostics", "-q", "ping"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
restart: unless-stopped
|
||||
|
||||
postgres:
|
||||
image: postgres:16-alpine
|
||||
container_name: postgres
|
||||
ports:
|
||||
- "5432:5432"
|
||||
environment:
|
||||
POSTGRES_USER: ${POSTGRES_USER:-pipeline}
|
||||
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-pipeline_secret}
|
||||
POSTGRES_DB: ${POSTGRES_DB:-pipeline}
|
||||
volumes:
|
||||
- postgres_data:/var/lib/postgresql/data
|
||||
- ./db/init.sql:/docker-entrypoint-initdb.d/init.sql:ro
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-pipeline} -d ${POSTGRES_DB:-pipeline}"]
|
||||
interval: 5s
|
||||
timeout: 5s
|
||||
retries: 10
|
||||
start_period: 15s
|
||||
restart: unless-stopped
|
||||
|
||||
watcher:
|
||||
build: ./watcher
|
||||
container_name: watcher
|
||||
env_file: .env
|
||||
volumes:
|
||||
- ${STORAGE_ROOT:-./storage}:/data/storage
|
||||
environment:
|
||||
STORAGE_ROOT: /data/storage
|
||||
depends_on:
|
||||
rabbit:
|
||||
condition: service_healthy
|
||||
restart: unless-stopped
|
||||
|
||||
transcribe:
|
||||
build: ./workers/transcribe
|
||||
container_name: transcribe
|
||||
env_file: .env
|
||||
volumes:
|
||||
- ${STORAGE_ROOT:-./storage}:/data/storage
|
||||
- ./workers/transcribe/configs:/app/configs:ro
|
||||
environment:
|
||||
STORAGE_ROOT: /data/storage
|
||||
depends_on:
|
||||
rabbit:
|
||||
condition: service_healthy
|
||||
restart: unless-stopped
|
||||
|
||||
tagging:
|
||||
build: ./workers/tagging
|
||||
container_name: tagging
|
||||
env_file: .env
|
||||
volumes:
|
||||
- ${STORAGE_ROOT:-./storage}:/data/storage
|
||||
- ./.env:/config/.env:ro
|
||||
environment:
|
||||
STORAGE_ROOT: /data/storage
|
||||
DOTENV_PATH: /config/.env
|
||||
depends_on:
|
||||
rabbit:
|
||||
condition: service_healthy
|
||||
postgres:
|
||||
condition: service_healthy
|
||||
restart: unless-stopped
|
||||
|
||||
analyse:
|
||||
build: ./workers/analyse
|
||||
container_name: analyse
|
||||
env_file: .env
|
||||
volumes:
|
||||
- ${STORAGE_ROOT:-./storage}:/data/storage
|
||||
- ./.env:/config/.env:ro
|
||||
environment:
|
||||
STORAGE_ROOT: /data/storage
|
||||
DOTENV_PATH: /config/.env
|
||||
depends_on:
|
||||
rabbit:
|
||||
condition: service_healthy
|
||||
postgres:
|
||||
condition: service_healthy
|
||||
restart: unless-stopped
|
||||
|
||||
volumes:
|
||||
postgres_data:
|
||||
11
watcher/Dockerfile
Normal file
11
watcher/Dockerfile
Normal file
@@ -0,0 +1,11 @@
|
||||
FROM golang:1.22-alpine AS build
|
||||
WORKDIR /src
|
||||
COPY go.mod go.sum* ./
|
||||
RUN go mod download
|
||||
COPY . .
|
||||
RUN CGO_ENABLED=0 go build -o /watcher ./cmd/watcher
|
||||
|
||||
FROM alpine:3.19
|
||||
RUN apk add --no-cache ca-certificates
|
||||
COPY --from=build /watcher /watcher
|
||||
ENTRYPOINT ["/watcher"]
|
||||
115
watcher/cmd/watcher/main.go
Normal file
115
watcher/cmd/watcher/main.go
Normal file
@@ -0,0 +1,115 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"context"
|
||||
"log/slog"
|
||||
"os"
|
||||
"os/signal"
|
||||
"syscall"
|
||||
"time"
|
||||
|
||||
amqp "github.com/rabbitmq/amqp091-go"
|
||||
|
||||
"github.com/yourorg/watcher/internal/config"
|
||||
"github.com/yourorg/watcher/internal/publisher"
|
||||
"github.com/yourorg/watcher/internal/scanner"
|
||||
)
|
||||
|
||||
func main() {
|
||||
slog.SetDefault(slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{Level: slog.LevelInfo})))
|
||||
|
||||
cfg := config.Load()
|
||||
sc := scanner.New(scanner.Config{
|
||||
StorageRoot: cfg.StorageRoot,
|
||||
IncomingDir: cfg.IncomingDir,
|
||||
ProcessingDir: cfg.ProcessingDir,
|
||||
FailedDir: cfg.FailedDir,
|
||||
StableWindow: cfg.StableWindow,
|
||||
StableChecks: cfg.StableChecks,
|
||||
})
|
||||
if err := sc.EnsureDirs(); err != nil {
|
||||
slog.Error("ensure dirs failed", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
ch := mustRabbit(cfg.RabbitURL)
|
||||
if err := ch.ExchangeDeclare(cfg.Exchange, "direct", true, false, false, false, nil); err != nil {
|
||||
slog.Error("declare exchange failed", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
pub, err := publisher.New(ch, cfg.Exchange, cfg.RoutingKey)
|
||||
if err != nil {
|
||||
slog.Error("publisher init failed", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
|
||||
defer stop()
|
||||
|
||||
slog.Info("watcher started",
|
||||
"storage_root", cfg.StorageRoot,
|
||||
"poll_interval", cfg.PollInterval.String(),
|
||||
"exchange", cfg.Exchange,
|
||||
"routing_key", cfg.RoutingKey,
|
||||
)
|
||||
|
||||
ticker := time.NewTicker(cfg.PollInterval)
|
||||
defer ticker.Stop()
|
||||
|
||||
for {
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
slog.Info("watcher stopping")
|
||||
return
|
||||
case <-ticker.C:
|
||||
claimed, err := sc.ScanOnce()
|
||||
if err != nil {
|
||||
slog.Warn("scan failed", "error", err)
|
||||
continue
|
||||
}
|
||||
for _, cf := range claimed {
|
||||
task := publisher.AudioTask{
|
||||
TaskID: cf.TaskID,
|
||||
FilePath: cf.FilePath,
|
||||
Filename: cf.Filename,
|
||||
Size: cf.Size,
|
||||
}
|
||||
pubCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
|
||||
err := pub.Publish(pubCtx, task)
|
||||
cancel()
|
||||
if err != nil {
|
||||
slog.Warn("publish failed, rolling back", "task_id", cf.TaskID, "error", err)
|
||||
if rbErr := sc.RollbackToIncoming(cf.FilePath, cf.Filename); rbErr != nil {
|
||||
slog.Error("rollback failed, moving to failed", "task_id", cf.TaskID, "error", rbErr)
|
||||
_ = sc.MoveToFailed(cf.FilePath, cf.Filename)
|
||||
}
|
||||
continue
|
||||
}
|
||||
slog.Info("task published", "task_id", cf.TaskID, "filename", cf.Filename)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func mustRabbit(url string) *amqp.Channel {
|
||||
var conn *amqp.Connection
|
||||
var err error
|
||||
for i := 0; i < 30; i++ {
|
||||
conn, err = amqp.Dial(url)
|
||||
if err == nil {
|
||||
break
|
||||
}
|
||||
slog.Info("waiting for rabbit", "attempt", i+1, "error", err)
|
||||
time.Sleep(2 * time.Second)
|
||||
}
|
||||
if err != nil {
|
||||
slog.Error("rabbit unreachable", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
ch, err := conn.Channel()
|
||||
if err != nil {
|
||||
slog.Error("rabbit channel failed", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
return ch
|
||||
}
|
||||
8
watcher/go.mod
Normal file
8
watcher/go.mod
Normal file
@@ -0,0 +1,8 @@
|
||||
module github.com/yourorg/watcher
|
||||
|
||||
go 1.22
|
||||
|
||||
require (
|
||||
github.com/oklog/ulid/v2 v2.1.0
|
||||
github.com/rabbitmq/amqp091-go v1.9.0
|
||||
)
|
||||
21
watcher/go.sum
Normal file
21
watcher/go.sum
Normal file
@@ -0,0 +1,21 @@
|
||||
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
|
||||
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
|
||||
github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo=
|
||||
github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ=
|
||||
github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI=
|
||||
github.com/oklog/ulid/v2 v2.1.0 h1:+9lhoxAP56we25tyYETBBY1YLA2SaoLvUFgrP2miPJU=
|
||||
github.com/oklog/ulid/v2 v2.1.0/go.mod h1:rcEKHmBBKfef9DhnvX7y1HZBYxjXb0cP5ExxNsTT1QQ=
|
||||
github.com/pborman/getopt v0.0.0-20170112200414-7148bc3a4c30/go.mod h1:85jBQOZwpVEaDAr341tbn15RS4fCAsIst0qp7i8ex1o=
|
||||
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
|
||||
github.com/rabbitmq/amqp091-go v1.9.0 h1:qrQtyzB4H8BQgEuJwhmVQqVHB9O4+MNDJCCAcpc3Aoo=
|
||||
github.com/rabbitmq/amqp091-go v1.9.0/go.mod h1:+jPrT9iY2eLjRaMSRHUhc3z14E/l85kv/f+6luSD3pc=
|
||||
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
|
||||
github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw=
|
||||
github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
|
||||
github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU=
|
||||
go.uber.org/goleak v1.2.1 h1:NBol2c7O1ZokfZ0LEU9K6Whx/KnwvepVetCUhtKja4A=
|
||||
go.uber.org/goleak v1.2.1/go.mod h1:qlT2yGI9QafXHhZZLxlSuNsMw3FFLxBr+tBRlmO1xH4=
|
||||
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
|
||||
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
|
||||
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
|
||||
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
|
||||
60
watcher/internal/config/config.go
Normal file
60
watcher/internal/config/config.go
Normal file
@@ -0,0 +1,60 @@
|
||||
package config
|
||||
|
||||
import (
|
||||
"os"
|
||||
"strconv"
|
||||
"time"
|
||||
)
|
||||
|
||||
type Config struct {
|
||||
StorageRoot string
|
||||
IncomingDir string
|
||||
ProcessingDir string
|
||||
FailedDir string
|
||||
PollInterval time.Duration
|
||||
StableWindow time.Duration
|
||||
StableChecks int
|
||||
RabbitURL string
|
||||
Exchange string
|
||||
RoutingKey string
|
||||
}
|
||||
|
||||
func Load() Config {
|
||||
return Config{
|
||||
StorageRoot: getEnv("STORAGE_ROOT", "/data/storage"),
|
||||
IncomingDir: getEnv("INCOMING_DIR", "incoming"),
|
||||
ProcessingDir: getEnv("PROCESSING_DIR", "processing"),
|
||||
FailedDir: getEnv("FAILED_DIR", "failed"),
|
||||
PollInterval: getDuration("POLL_INTERVAL", 5*time.Second),
|
||||
StableWindow: getDuration("STABLE_WINDOW", 2*time.Second),
|
||||
StableChecks: getInt("STABLE_CHECKS", 3),
|
||||
RabbitURL: getEnv("RABBITMQ_URL", "amqp://guest:guest@localhost:5672/"),
|
||||
Exchange: getEnv("RABBITMQ_EXCHANGE", "audio_pipeline"),
|
||||
RoutingKey: getEnv("RABBITMQ_ROUTING_KEY", "audio.new"),
|
||||
}
|
||||
}
|
||||
|
||||
func getEnv(key, def string) string {
|
||||
if v := os.Getenv(key); v != "" {
|
||||
return v
|
||||
}
|
||||
return def
|
||||
}
|
||||
|
||||
func getInt(key string, def int) int {
|
||||
if v := os.Getenv(key); v != "" {
|
||||
if i, err := strconv.Atoi(v); err == nil {
|
||||
return i
|
||||
}
|
||||
}
|
||||
return def
|
||||
}
|
||||
|
||||
func getDuration(key string, def time.Duration) time.Duration {
|
||||
if v := os.Getenv(key); v != "" {
|
||||
if d, err := time.ParseDuration(v); err == nil {
|
||||
return d
|
||||
}
|
||||
}
|
||||
return def
|
||||
}
|
||||
58
watcher/internal/publisher/publisher.go
Normal file
58
watcher/internal/publisher/publisher.go
Normal file
@@ -0,0 +1,58 @@
|
||||
package publisher
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"time"
|
||||
|
||||
amqp "github.com/rabbitmq/amqp091-go"
|
||||
)
|
||||
|
||||
type AudioTask struct {
|
||||
TaskID string `json:"task_id"`
|
||||
FilePath string `json:"file_path"`
|
||||
Filename string `json:"filename"`
|
||||
Size int64 `json:"size"`
|
||||
CreatedAt int64 `json:"created_at"`
|
||||
}
|
||||
|
||||
type Publisher struct {
|
||||
ch *amqp.Channel
|
||||
exchange string
|
||||
routingKey string
|
||||
}
|
||||
|
||||
func New(ch *amqp.Channel, exchange, routingKey string) (*Publisher, error) {
|
||||
if err := ch.Confirm(false); err != nil {
|
||||
return nil, fmt.Errorf("confirm mode: %w", err)
|
||||
}
|
||||
return &Publisher{ch: ch, exchange: exchange, routingKey: routingKey}, nil
|
||||
}
|
||||
|
||||
func (p *Publisher) Publish(ctx context.Context, task AudioTask) error {
|
||||
if task.CreatedAt == 0 {
|
||||
task.CreatedAt = time.Now().Unix()
|
||||
}
|
||||
body, err := json.Marshal(task)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
confirms := p.ch.NotifyPublish(make(chan amqp.Confirmation, 1))
|
||||
if err := p.ch.PublishWithContext(ctx, p.exchange, p.routingKey, false, false, amqp.Publishing{
|
||||
ContentType: "application/json",
|
||||
Body: body,
|
||||
DeliveryMode: amqp.Persistent,
|
||||
}); err != nil {
|
||||
return err
|
||||
}
|
||||
select {
|
||||
case confirm := <-confirms:
|
||||
if !confirm.Ack {
|
||||
return fmt.Errorf("publish not confirmed")
|
||||
}
|
||||
return nil
|
||||
case <-ctx.Done():
|
||||
return ctx.Err()
|
||||
}
|
||||
}
|
||||
144
watcher/internal/scanner/scanner.go
Normal file
144
watcher/internal/scanner/scanner.go
Normal file
@@ -0,0 +1,144 @@
|
||||
package scanner
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"log/slog"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/oklog/ulid/v2"
|
||||
)
|
||||
|
||||
var allowedExts = map[string]bool{
|
||||
".mp3": true, ".wav": true, ".m4a": true,
|
||||
".ogg": true, ".flac": true, ".webm": true,
|
||||
}
|
||||
|
||||
type ClaimedFile struct {
|
||||
TaskID string
|
||||
FilePath string
|
||||
Filename string
|
||||
Size int64
|
||||
}
|
||||
|
||||
type Config struct {
|
||||
StorageRoot string
|
||||
IncomingDir string
|
||||
ProcessingDir string
|
||||
FailedDir string
|
||||
StableWindow time.Duration
|
||||
StableChecks int
|
||||
}
|
||||
|
||||
type Scanner struct {
|
||||
cfg Config
|
||||
}
|
||||
|
||||
func New(cfg Config) *Scanner {
|
||||
return &Scanner{cfg: cfg}
|
||||
}
|
||||
|
||||
func (s *Scanner) EnsureDirs() error {
|
||||
for _, dir := range []string{s.cfg.IncomingDir, s.cfg.ProcessingDir, s.cfg.FailedDir} {
|
||||
if err := os.MkdirAll(filepath.Join(s.cfg.StorageRoot, dir), 0o755); err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (s *Scanner) ScanOnce() ([]ClaimedFile, error) {
|
||||
incoming := filepath.Join(s.cfg.StorageRoot, s.cfg.IncomingDir)
|
||||
entries, err := os.ReadDir(incoming)
|
||||
if err != nil {
|
||||
if os.IsNotExist(err) {
|
||||
return nil, nil
|
||||
}
|
||||
return nil, err
|
||||
}
|
||||
|
||||
var claimed []ClaimedFile
|
||||
for _, e := range entries {
|
||||
if e.IsDir() {
|
||||
continue
|
||||
}
|
||||
name := e.Name()
|
||||
if strings.HasPrefix(name, ".") || strings.HasSuffix(strings.ToLower(name), ".tmp") {
|
||||
continue
|
||||
}
|
||||
ext := strings.ToLower(filepath.Ext(name))
|
||||
if !allowedExts[ext] {
|
||||
continue
|
||||
}
|
||||
src := filepath.Join(incoming, name)
|
||||
if !s.isStable(src) {
|
||||
continue
|
||||
}
|
||||
cf, err := s.claim(src, name, ext)
|
||||
if err != nil {
|
||||
slog.Warn("claim failed", "file", name, "error", err)
|
||||
continue
|
||||
}
|
||||
claimed = append(claimed, cf)
|
||||
}
|
||||
return claimed, nil
|
||||
}
|
||||
|
||||
func (s *Scanner) isStable(path string) bool {
|
||||
var lastSize int64 = -1
|
||||
for i := 0; i < s.cfg.StableChecks; i++ {
|
||||
info, err := os.Stat(path)
|
||||
if err != nil {
|
||||
return false
|
||||
}
|
||||
size := info.Size()
|
||||
if lastSize >= 0 && size != lastSize {
|
||||
return false
|
||||
}
|
||||
lastSize = size
|
||||
if i < s.cfg.StableChecks-1 {
|
||||
time.Sleep(s.cfg.StableWindow)
|
||||
}
|
||||
}
|
||||
return true
|
||||
}
|
||||
|
||||
func (s *Scanner) claim(src, originalName, ext string) (ClaimedFile, error) {
|
||||
info, err := os.Stat(src)
|
||||
if err != nil {
|
||||
return ClaimedFile{}, err
|
||||
}
|
||||
taskID := ulid.Make().String()
|
||||
processing := filepath.Join(s.cfg.StorageRoot, s.cfg.ProcessingDir)
|
||||
dst := filepath.Join(processing, taskID+ext)
|
||||
if err := os.Rename(src, dst); err != nil {
|
||||
return ClaimedFile{}, fmt.Errorf("rename: %w", err)
|
||||
}
|
||||
slog.Info("claimed file", "task_id", taskID, "filename", originalName, "path", dst, "size", info.Size())
|
||||
return ClaimedFile{
|
||||
TaskID: taskID,
|
||||
FilePath: dst,
|
||||
Filename: originalName,
|
||||
Size: info.Size(),
|
||||
}, nil
|
||||
}
|
||||
|
||||
func (s *Scanner) RollbackToIncoming(filePath, originalName string) error {
|
||||
incoming := filepath.Join(s.cfg.StorageRoot, s.cfg.IncomingDir)
|
||||
dst := filepath.Join(incoming, originalName)
|
||||
if err := os.Rename(filePath, dst); err != nil {
|
||||
return s.MoveToFailed(filePath, originalName)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (s *Scanner) MoveToFailed(filePath, originalName string) error {
|
||||
failed := filepath.Join(s.cfg.StorageRoot, s.cfg.FailedDir)
|
||||
if err := os.MkdirAll(failed, 0o755); err != nil {
|
||||
return err
|
||||
}
|
||||
dst := filepath.Join(failed, originalName)
|
||||
return os.Rename(filePath, dst)
|
||||
}
|
||||
12
workers/analyse/Dockerfile
Normal file
12
workers/analyse/Dockerfile
Normal file
@@ -0,0 +1,12 @@
|
||||
FROM golang:1.22-alpine AS build
|
||||
WORKDIR /src
|
||||
COPY go.mod go.sum* ./
|
||||
RUN go mod download
|
||||
COPY . .
|
||||
RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /analyse ./cmd/analyse
|
||||
|
||||
FROM alpine:3.20
|
||||
RUN apk add --no-cache ca-certificates
|
||||
WORKDIR /app
|
||||
COPY --from=build /analyse /app/analyse
|
||||
ENTRYPOINT ["/app/analyse"]
|
||||
657
workers/analyse/cmd/analyse/main.go
Normal file
657
workers/analyse/cmd/analyse/main.go
Normal file
@@ -0,0 +1,657 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"context"
|
||||
"database/sql"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
"log/slog"
|
||||
"net"
|
||||
"net/http"
|
||||
"os"
|
||||
"strings"
|
||||
"time"
|
||||
"unicode/utf8"
|
||||
|
||||
"github.com/joho/godotenv"
|
||||
_ "github.com/jackc/pgx/v5/stdlib"
|
||||
amqp "github.com/rabbitmq/amqp091-go"
|
||||
)
|
||||
|
||||
func init() {
|
||||
slog.SetDefault(slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{Level: slog.LevelInfo})))
|
||||
}
|
||||
|
||||
// ── входящее сообщение из очереди analyse (TranscriptionResult от transcribe) ──
|
||||
|
||||
type WorkerMessage struct {
|
||||
TaskID string `json:"task_id"`
|
||||
Filename string `json:"filename"`
|
||||
FilePath string `json:"file_path"`
|
||||
Transcription string `json:"transcription"`
|
||||
Language string `json:"language"`
|
||||
Segments []Segment `json:"segments,omitempty"`
|
||||
Prompts []Prompt `json:"prompts"`
|
||||
TranscribedAt int64 `json:"transcribed_at"`
|
||||
}
|
||||
|
||||
type Segment struct {
|
||||
Start float64 `json:"start"`
|
||||
End float64 `json:"end"`
|
||||
Text string `json:"text"`
|
||||
}
|
||||
|
||||
type Prompt struct {
|
||||
ID int `json:"id"`
|
||||
IDSection int `json:"id_section"`
|
||||
Name string `json:"name"`
|
||||
Prompt string `json:"prompt"`
|
||||
DtCreate string `json:"dt_create"`
|
||||
}
|
||||
|
||||
// AnalysisResult — ключ = name промпта, значение = полный JSON-ответ LLM.
|
||||
type AnalysisResult map[string]any
|
||||
|
||||
// ── LLM request/response ──
|
||||
|
||||
type chatMessage struct {
|
||||
Role string `json:"role"`
|
||||
Content string `json:"content"`
|
||||
}
|
||||
type chatRequest struct {
|
||||
Model string `json:"model"`
|
||||
Temperature float64 `json:"temperature"`
|
||||
ResponseFormat struct {
|
||||
Type string `json:"type"`
|
||||
} `json:"response_format"`
|
||||
Messages []chatMessage `json:"messages"`
|
||||
}
|
||||
type tokenUsage struct {
|
||||
PromptTokens int `json:"prompt_tokens"`
|
||||
CompletionTokens int `json:"completion_tokens"`
|
||||
TotalTokens int `json:"total_tokens"`
|
||||
}
|
||||
type chatResponse struct {
|
||||
Choices []struct {
|
||||
Message struct {
|
||||
Content string `json:"content"`
|
||||
} `json:"message"`
|
||||
} `json:"choices"`
|
||||
Usage *tokenUsage `json:"usage"`
|
||||
}
|
||||
|
||||
type llmCallResult struct {
|
||||
Content string
|
||||
RequestBytes int
|
||||
ResponseBytes int
|
||||
Usage *tokenUsage
|
||||
Duration time.Duration
|
||||
}
|
||||
|
||||
type analysisStats struct {
|
||||
LLMCalls int
|
||||
TotalTokens int
|
||||
PromptTokens int
|
||||
OutputTokens int
|
||||
}
|
||||
|
||||
// ===================== LLM =====================
|
||||
|
||||
var llmHTTPClient = newLLMHTTPClient(150 * time.Second)
|
||||
|
||||
func newLLMHTTPClient(totalTimeout time.Duration) *http.Client {
|
||||
return &http.Client{
|
||||
Timeout: totalTimeout,
|
||||
Transport: &http.Transport{
|
||||
Proxy: http.ProxyFromEnvironment,
|
||||
DialContext: (&net.Dialer{
|
||||
Timeout: 30 * time.Second,
|
||||
KeepAlive: 30 * time.Second,
|
||||
}).DialContext,
|
||||
TLSHandshakeTimeout: 60 * time.Second,
|
||||
ResponseHeaderTimeout: 90 * time.Second,
|
||||
ExpectContinueTimeout: 5 * time.Second,
|
||||
IdleConnTimeout: 90 * time.Second,
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
func callLLM(ctx context.Context, apiURL, model, prompt string) (*llmCallResult, error) {
|
||||
const systemPrompt = "Ты — строгий классификатор звонков. Отвечай только JSON, без пояснений."
|
||||
|
||||
reqBody := chatRequest{
|
||||
Model: model,
|
||||
Temperature: 0.1,
|
||||
Messages: []chatMessage{
|
||||
{Role: "system", Content: systemPrompt},
|
||||
{Role: "user", Content: prompt},
|
||||
},
|
||||
}
|
||||
reqBody.ResponseFormat.Type = "json_object"
|
||||
|
||||
jsonData, err := json.Marshal(reqBody)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
req, err := http.NewRequestWithContext(ctx, "POST", apiURL, bytes.NewBuffer(jsonData))
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
req.Header.Set("Authorization", "Bearer "+os.Getenv("YANDEX_API_KEY"))
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
|
||||
start := time.Now()
|
||||
resp, err := llmHTTPClient.Do(req)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
|
||||
body, _ := io.ReadAll(resp.Body)
|
||||
duration := time.Since(start)
|
||||
|
||||
if resp.StatusCode != http.StatusOK {
|
||||
return &llmCallResult{
|
||||
RequestBytes: len(jsonData),
|
||||
ResponseBytes: len(body),
|
||||
Duration: duration,
|
||||
}, fmt.Errorf("status %d: %s", resp.StatusCode, truncate(string(body), 500))
|
||||
}
|
||||
|
||||
var result chatResponse
|
||||
if err := json.Unmarshal(body, &result); err != nil {
|
||||
return &llmCallResult{
|
||||
RequestBytes: len(jsonData),
|
||||
ResponseBytes: len(body),
|
||||
Duration: duration,
|
||||
}, err
|
||||
}
|
||||
if len(result.Choices) == 0 {
|
||||
return &llmCallResult{
|
||||
RequestBytes: len(jsonData),
|
||||
ResponseBytes: len(body),
|
||||
Duration: duration,
|
||||
}, fmt.Errorf("empty response")
|
||||
}
|
||||
|
||||
return &llmCallResult{
|
||||
Content: result.Choices[0].Message.Content,
|
||||
RequestBytes: len(jsonData),
|
||||
ResponseBytes: len(body),
|
||||
Usage: result.Usage,
|
||||
Duration: duration,
|
||||
}, nil
|
||||
}
|
||||
|
||||
func checkYandexAPI(ctx context.Context, apiURL, model string) error {
|
||||
slog.Info("yandex api check started", "worker", "analyse", "url", apiURL, "model", model)
|
||||
|
||||
res, err := callLLM(ctx, apiURL, model, `Ответь только JSON: {"ok":true}`)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
attrs := []any{
|
||||
"worker", "analyse",
|
||||
"duration_ms", res.Duration.Milliseconds(),
|
||||
"response_chars", utf8.RuneCountInString(res.Content),
|
||||
}
|
||||
if res.Usage != nil {
|
||||
attrs = append(attrs,
|
||||
"prompt_tokens", res.Usage.PromptTokens,
|
||||
"completion_tokens", res.Usage.CompletionTokens,
|
||||
"total_tokens", res.Usage.TotalTokens,
|
||||
)
|
||||
}
|
||||
slog.Info("yandex api check ok", attrs...)
|
||||
return nil
|
||||
}
|
||||
|
||||
func logLLMCall(taskID, model, promptName string, promptIndex, promptTotal, attempt, inputChars int, res *llmCallResult, err error) {
|
||||
attrs := []any{
|
||||
"worker", "analyse",
|
||||
"task_id", taskID,
|
||||
"model", model,
|
||||
"call_type", "analyse_prompt",
|
||||
"prompt_name", promptName,
|
||||
"prompt_index", promptIndex,
|
||||
"prompt_total", promptTotal,
|
||||
"attempt", attempt,
|
||||
"input_chars", inputChars,
|
||||
}
|
||||
if res != nil {
|
||||
attrs = append(attrs,
|
||||
"duration_ms", res.Duration.Milliseconds(),
|
||||
"request_bytes", res.RequestBytes,
|
||||
"response_bytes", res.ResponseBytes,
|
||||
"response_chars", utf8.RuneCountInString(res.Content),
|
||||
)
|
||||
if res.Usage != nil {
|
||||
attrs = append(attrs,
|
||||
"prompt_tokens", res.Usage.PromptTokens,
|
||||
"completion_tokens", res.Usage.CompletionTokens,
|
||||
"total_tokens", res.Usage.TotalTokens,
|
||||
)
|
||||
}
|
||||
}
|
||||
if err != nil {
|
||||
slog.Warn("llm call failed", append(attrs, "error", err)...)
|
||||
return
|
||||
}
|
||||
slog.Info("llm call ok", attrs...)
|
||||
}
|
||||
|
||||
func accumulateUsage(stats *analysisStats, res *llmCallResult) {
|
||||
stats.LLMCalls++
|
||||
if res != nil && res.Usage != nil {
|
||||
stats.TotalTokens += res.Usage.TotalTokens
|
||||
stats.PromptTokens += res.Usage.PromptTokens
|
||||
stats.OutputTokens += res.Usage.CompletionTokens
|
||||
}
|
||||
}
|
||||
|
||||
func buildPromptQuery(transcription string, p Prompt) string {
|
||||
var b strings.Builder
|
||||
b.WriteString(p.Prompt)
|
||||
b.WriteString("\n\n=== ТРАНСКРИПЦИЯ ===\n\"\"\"\n")
|
||||
b.WriteString(transcription)
|
||||
b.WriteString("\n\"\"\"")
|
||||
return b.String()
|
||||
}
|
||||
|
||||
func analysePrompt(ctx context.Context, apiURL, model, transcription string, p Prompt, index, total int, taskID string, stats *analysisStats) (any, error) {
|
||||
query := buildPromptQuery(transcription, p)
|
||||
inputChars := utf8.RuneCountInString(query)
|
||||
|
||||
res, err := callLLM(ctx, apiURL, model, query)
|
||||
logLLMCall(taskID, model, p.Name, index, total, 1, inputChars, res, err)
|
||||
accumulateUsage(stats, res)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
var parsed any
|
||||
if err := json.Unmarshal([]byte(res.Content), &parsed); err != nil {
|
||||
return nil, fmt.Errorf("parse: %w, resp: %s", err, truncate(res.Content, 300))
|
||||
}
|
||||
return parsed, nil
|
||||
}
|
||||
|
||||
func runAnalysis(ctx context.Context, apiURL, model, taskID, transcription string, prompts []Prompt) (AnalysisResult, analysisStats, error) {
|
||||
stats := analysisStats{}
|
||||
result := make(AnalysisResult, len(prompts))
|
||||
|
||||
valid := make([]Prompt, 0, len(prompts))
|
||||
for _, p := range prompts {
|
||||
if p.Name != "" {
|
||||
valid = append(valid, p)
|
||||
}
|
||||
}
|
||||
total := len(valid)
|
||||
|
||||
for i, p := range valid {
|
||||
value, err := analysePrompt(ctx, apiURL, model, transcription, p, i+1, total, taskID, &stats)
|
||||
if err != nil {
|
||||
return nil, stats, fmt.Errorf("%s: %w", p.Name, err)
|
||||
}
|
||||
result[p.Name] = value
|
||||
}
|
||||
return result, stats, nil
|
||||
}
|
||||
|
||||
// ===================== DB =====================
|
||||
|
||||
func saveAnalysis(ctx context.Context, db *sql.DB, task WorkerMessage, analysis []byte) (complete bool, err error) {
|
||||
metadata, _ := json.Marshal(map[string]any{
|
||||
"file_path": task.FilePath,
|
||||
"language": task.Language,
|
||||
"segments": task.Segments,
|
||||
"prompts": task.Prompts,
|
||||
"transcribed_at": task.TranscribedAt,
|
||||
})
|
||||
|
||||
_, err = db.ExecContext(ctx,
|
||||
`INSERT INTO results (task_id) VALUES ($1) ON CONFLICT (task_id) DO NOTHING`, task.TaskID)
|
||||
if err != nil {
|
||||
return false, fmt.Errorf("ensure row: %w", err)
|
||||
}
|
||||
|
||||
err = db.QueryRowContext(ctx, `
|
||||
UPDATE results
|
||||
SET analysis = $2::jsonb,
|
||||
filename = COALESCE(NULLIF($3, ''), filename),
|
||||
transcription = COALESCE(NULLIF($4, ''), transcription),
|
||||
metadata = COALESCE($5::jsonb, metadata),
|
||||
updated_at = now(),
|
||||
status = CASE WHEN tagging IS NOT NULL THEN 'done' ELSE status END
|
||||
WHERE task_id = $1
|
||||
RETURNING (analysis IS NOT NULL AND tagging IS NOT NULL)
|
||||
`, task.TaskID, string(analysis), task.Filename, task.Transcription, string(metadata)).Scan(&complete)
|
||||
if err != nil {
|
||||
return false, fmt.Errorf("update analysis: %w", err)
|
||||
}
|
||||
return complete, nil
|
||||
}
|
||||
|
||||
// ===================== MAIN =====================
|
||||
|
||||
func loadDotenv() {
|
||||
path := os.Getenv("DOTENV_PATH")
|
||||
if path == "" {
|
||||
return
|
||||
}
|
||||
if err := godotenv.Overload(path); err != nil {
|
||||
slog.Warn("dotenv load failed", "path", path, "error", err)
|
||||
return
|
||||
}
|
||||
slog.Info("dotenv loaded", "path", path)
|
||||
}
|
||||
|
||||
func main() {
|
||||
loadDotenv()
|
||||
|
||||
amqpURL := getEnv("RABBITMQ_URL", "amqp://guest:guest@localhost:5672/")
|
||||
dbURL := getEnv("DATABASE_URL", "")
|
||||
token := os.Getenv("YANDEX_API_KEY")
|
||||
model := os.Getenv("YANDEX_MODEL")
|
||||
apiURL := getEnv("YANDEX_API_URL", "https://ai.api.cloud.yandex.net/v1/chat/completions")
|
||||
inputQueue := getEnv("ANALYSE_QUEUE", "analyse")
|
||||
finalQueue := getEnv("FINAL_QUEUE", "final")
|
||||
|
||||
if token == "" {
|
||||
slog.Error("YANDEX_API_KEY is required")
|
||||
os.Exit(1)
|
||||
}
|
||||
if model == "" {
|
||||
slog.Error("YANDEX_MODEL is required")
|
||||
os.Exit(1)
|
||||
}
|
||||
if dbURL == "" {
|
||||
slog.Error("DATABASE_URL is required")
|
||||
os.Exit(1)
|
||||
}
|
||||
slog.Info("config loaded", "worker", "analyse",
|
||||
"yandex_token", tokenFingerprint(token), "model", model, "api_url", apiURL)
|
||||
|
||||
db := mustDB(dbURL)
|
||||
defer db.Close()
|
||||
|
||||
checkCtx, checkCancel := context.WithTimeout(context.Background(), 90*time.Second)
|
||||
if err := checkYandexAPI(checkCtx, apiURL, model); err != nil {
|
||||
checkCancel()
|
||||
slog.Error("yandex api check failed — worker will not start", "worker", "analyse", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
checkCancel()
|
||||
|
||||
ch := mustRabbit(amqpURL)
|
||||
|
||||
if _, err := ch.QueueDeclare(inputQueue, true, false, false, false, nil); err != nil {
|
||||
slog.Error("declare queue failed", "queue", inputQueue, "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
if _, err := ch.QueueDeclare(finalQueue, true, false, false, false, nil); err != nil {
|
||||
slog.Error("declare queue failed", "queue", finalQueue, "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
ch.Qos(1, 0, false)
|
||||
|
||||
msgs, err := ch.Consume(inputQueue, "", false, false, false, false, nil)
|
||||
if err != nil {
|
||||
slog.Error("consume failed", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
slog.Info("worker started", "worker", "analyse", "queue", inputQueue, "model", model)
|
||||
|
||||
for d := range msgs {
|
||||
taskStart := time.Now()
|
||||
var task WorkerMessage
|
||||
if err := json.Unmarshal(d.Body, &task); err != nil {
|
||||
slog.Warn("bad message", "worker", "analyse", "delivery_tag", d.DeliveryTag,
|
||||
"body_bytes", len(d.Body), "error", err)
|
||||
d.Nack(false, false)
|
||||
continue
|
||||
}
|
||||
|
||||
promptNames := make([]string, 0, len(task.Prompts))
|
||||
promptTextChars := 0
|
||||
for _, p := range task.Prompts {
|
||||
if p.Name != "" {
|
||||
promptNames = append(promptNames, p.Name)
|
||||
promptTextChars += utf8.RuneCountInString(p.Prompt)
|
||||
}
|
||||
}
|
||||
transcriptionChars := utf8.RuneCountInString(task.Transcription)
|
||||
|
||||
slog.Info("message received", "worker", "analyse",
|
||||
"task_id", task.TaskID,
|
||||
"filename", task.Filename,
|
||||
"delivery_tag", d.DeliveryTag,
|
||||
"redelivered", d.Redelivered,
|
||||
"body_bytes", len(d.Body),
|
||||
"transcription_chars", transcriptionChars,
|
||||
"segments", len(task.Segments),
|
||||
"prompts", len(promptNames),
|
||||
"prompt_names", promptNames,
|
||||
"prompt_text_chars", promptTextChars,
|
||||
"llm_calls_expected", len(promptNames),
|
||||
)
|
||||
if d.Redelivered {
|
||||
slog.Warn("redelivered message skipped — no llm call",
|
||||
"worker", "analyse", "task_id", task.TaskID,
|
||||
"delivery_tag", d.DeliveryTag, "prompts", len(promptNames))
|
||||
d.Nack(false, false)
|
||||
continue
|
||||
}
|
||||
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
|
||||
|
||||
result, stats, err := runAnalysis(ctx, apiURL, model, task.TaskID, task.Transcription, task.Prompts)
|
||||
if err != nil {
|
||||
cancel()
|
||||
slog.Warn("task failed, discarded",
|
||||
"worker", "analyse", "task_id", task.TaskID,
|
||||
"llm_calls_done", stats.LLMCalls,
|
||||
"total_tokens_so_far", stats.TotalTokens,
|
||||
"error", err)
|
||||
d.Nack(false, false)
|
||||
continue
|
||||
}
|
||||
|
||||
analysisJSON, _ := json.Marshal(result)
|
||||
complete, err := saveAnalysis(ctx, db, task, analysisJSON)
|
||||
if err != nil {
|
||||
cancel()
|
||||
slog.Warn("db save failed, discarded",
|
||||
"worker", "analyse", "task_id", task.TaskID, "error", err)
|
||||
d.Nack(false, false)
|
||||
continue
|
||||
}
|
||||
|
||||
taskAttrs := []any{
|
||||
"worker", "analyse",
|
||||
"task_id", task.TaskID,
|
||||
"llm_calls", stats.LLMCalls,
|
||||
"total_tokens", stats.TotalTokens,
|
||||
"prompt_tokens", stats.PromptTokens,
|
||||
"completion_tokens", stats.OutputTokens,
|
||||
"duration_ms", time.Since(taskStart).Milliseconds(),
|
||||
}
|
||||
|
||||
if complete {
|
||||
notifyFinal(ctx, ch, db, finalQueue, task.TaskID, "analyse")
|
||||
slog.Info("task complete", append(taskAttrs, "was_last", "analyse")...)
|
||||
} else {
|
||||
slog.Info("task partial", append(taskAttrs, "waiting_for", "tagging")...)
|
||||
}
|
||||
cancel()
|
||||
|
||||
d.Ack(false)
|
||||
}
|
||||
}
|
||||
|
||||
func truncate(s string, max int) string {
|
||||
if len(s) <= max {
|
||||
return s
|
||||
}
|
||||
return s[:max] + "..."
|
||||
}
|
||||
|
||||
func loadFinalPayload(ctx context.Context, db *sql.DB, taskID string) ([]byte, error) {
|
||||
var (
|
||||
filename, transcription, status sql.NullString
|
||||
analysis, tagging, metadata []byte
|
||||
createdAt, updatedAt time.Time
|
||||
)
|
||||
err := db.QueryRowContext(ctx, `
|
||||
SELECT filename, transcription, analysis, tagging, metadata, status, created_at, updated_at
|
||||
FROM results WHERE task_id = $1
|
||||
`, taskID).Scan(&filename, &transcription, &analysis, &tagging, &metadata, &status, &createdAt, &updatedAt)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("load result: %w", err)
|
||||
}
|
||||
|
||||
msg := map[string]any{
|
||||
"task_id": taskID,
|
||||
"status": status.String,
|
||||
"created_at": createdAt,
|
||||
"updated_at": updatedAt,
|
||||
}
|
||||
if filename.Valid {
|
||||
msg["filename"] = filename.String
|
||||
}
|
||||
if transcription.Valid {
|
||||
msg["transcription"] = transcription.String
|
||||
}
|
||||
if len(analysis) > 0 {
|
||||
var v any
|
||||
if err := json.Unmarshal(analysis, &v); err == nil {
|
||||
msg["analysis"] = v
|
||||
}
|
||||
}
|
||||
if len(tagging) > 0 {
|
||||
var v any
|
||||
if err := json.Unmarshal(tagging, &v); err == nil {
|
||||
msg["tagging"] = v
|
||||
}
|
||||
}
|
||||
if len(metadata) > 0 {
|
||||
var meta map[string]any
|
||||
if err := json.Unmarshal(metadata, &meta); err == nil {
|
||||
for k, v := range meta {
|
||||
msg[k] = v
|
||||
}
|
||||
}
|
||||
}
|
||||
return json.Marshal(msg)
|
||||
}
|
||||
|
||||
func notifyFinal(ctx context.Context, ch *amqp.Channel, db *sql.DB, queue, taskID, worker string) {
|
||||
body, err := loadFinalPayload(ctx, db, taskID)
|
||||
if err != nil {
|
||||
slog.Warn("load final payload failed", "worker", worker, "task_id", taskID, "error", err)
|
||||
return
|
||||
}
|
||||
if err := ch.PublishWithContext(ctx, "", queue, false, false,
|
||||
amqp.Publishing{
|
||||
ContentType: "application/json",
|
||||
Body: body,
|
||||
DeliveryMode: amqp.Persistent,
|
||||
}); err != nil {
|
||||
slog.Warn("publish final failed", "worker", worker, "task_id", taskID, "error", err)
|
||||
return
|
||||
}
|
||||
slog.Info("published final", "worker", worker, "task_id", taskID, "queue", queue, "body_bytes", len(body))
|
||||
deleteProcessingFile(extractFilePath(body), taskID, worker)
|
||||
}
|
||||
|
||||
func extractFilePath(body []byte) string {
|
||||
var msg map[string]any
|
||||
if err := json.Unmarshal(body, &msg); err != nil {
|
||||
return ""
|
||||
}
|
||||
fp, _ := msg["file_path"].(string)
|
||||
return fp
|
||||
}
|
||||
|
||||
func deleteProcessingFile(filePath, taskID, worker string) {
|
||||
if filePath == "" {
|
||||
slog.Warn("processing file not deleted: no file_path", "worker", worker, "task_id", taskID)
|
||||
return
|
||||
}
|
||||
if !strings.Contains(filePath, "/processing/") {
|
||||
slog.Warn("processing file not deleted: path outside processing", "worker", worker, "task_id", taskID, "path", filePath)
|
||||
return
|
||||
}
|
||||
if err := os.Remove(filePath); err != nil {
|
||||
if os.IsNotExist(err) {
|
||||
slog.Info("processing file already removed", "worker", worker, "task_id", taskID, "path", filePath)
|
||||
return
|
||||
}
|
||||
slog.Warn("processing file delete failed", "worker", worker, "task_id", taskID, "path", filePath, "error", err)
|
||||
return
|
||||
}
|
||||
slog.Info("processing file deleted", "worker", worker, "task_id", taskID, "path", filePath)
|
||||
}
|
||||
|
||||
func getEnv(k, d string) string {
|
||||
if v := os.Getenv(k); v != "" {
|
||||
return v
|
||||
}
|
||||
return d
|
||||
}
|
||||
|
||||
func tokenFingerprint(token string) string {
|
||||
if len(token) <= 12 {
|
||||
return "***"
|
||||
}
|
||||
return token[:8] + "..." + token[len(token)-4:]
|
||||
}
|
||||
|
||||
func mustDB(url string) *sql.DB {
|
||||
db, err := sql.Open("pgx", url)
|
||||
if err != nil {
|
||||
slog.Error("db open failed", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
db.SetMaxOpenConns(5)
|
||||
time.Sleep(2 * time.Second) // дать Docker DNS зарегистрировать postgres
|
||||
for i := 0; i < 60; i++ {
|
||||
if err = db.Ping(); err == nil {
|
||||
return db
|
||||
}
|
||||
if i < 5 || (i+1)%10 == 0 {
|
||||
slog.Info("waiting for db", "attempt", i+1, "error", err)
|
||||
}
|
||||
time.Sleep(3 * time.Second)
|
||||
}
|
||||
slog.Error("db unreachable", "error", err)
|
||||
os.Exit(1)
|
||||
return nil
|
||||
}
|
||||
|
||||
func mustRabbit(url string) *amqp.Channel {
|
||||
var conn *amqp.Connection
|
||||
var err error
|
||||
for i := 0; i < 30; i++ {
|
||||
conn, err = amqp.Dial(url)
|
||||
if err == nil {
|
||||
break
|
||||
}
|
||||
slog.Info("waiting for rabbit", "attempt", i+1, "error", err)
|
||||
time.Sleep(2 * time.Second)
|
||||
}
|
||||
if err != nil {
|
||||
slog.Error("rabbit unreachable", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
ch, err := conn.Channel()
|
||||
if err != nil {
|
||||
slog.Error("rabbit channel failed", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
return ch
|
||||
}
|
||||
18
workers/analyse/go.mod
Normal file
18
workers/analyse/go.mod
Normal file
@@ -0,0 +1,18 @@
|
||||
module github.com/yourorg/analyse
|
||||
|
||||
go 1.22
|
||||
|
||||
require (
|
||||
github.com/jackc/pgx/v5 v5.5.5
|
||||
github.com/joho/godotenv v1.5.1
|
||||
github.com/rabbitmq/amqp091-go v1.9.0
|
||||
)
|
||||
|
||||
require (
|
||||
github.com/jackc/pgpassfile v1.0.0 // indirect
|
||||
github.com/jackc/pgservicefile v0.0.0-20221227161230-091c0ba34f0a // indirect
|
||||
github.com/jackc/puddle/v2 v2.2.1 // indirect
|
||||
golang.org/x/crypto v0.17.0 // indirect
|
||||
golang.org/x/sync v0.1.0 // indirect
|
||||
golang.org/x/text v0.14.0 // indirect
|
||||
)
|
||||
41
workers/analyse/go.sum
Normal file
41
workers/analyse/go.sum
Normal file
@@ -0,0 +1,41 @@
|
||||
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
|
||||
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
|
||||
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
|
||||
github.com/jackc/pgpassfile v1.0.0 h1:/6Hmqy13Ss2zCq62VdNG8tM1wchn8zjSGOBJ6icpsIM=
|
||||
github.com/jackc/pgpassfile v1.0.0/go.mod h1:CEx0iS5ambNFdcRtxPj5JhEz+xB6uRky5eyVu/W2HEg=
|
||||
github.com/jackc/pgservicefile v0.0.0-20221227161230-091c0ba34f0a h1:bbPeKD0xmW/Y25WS6cokEszi5g+S0QxI/d45PkRi7Nk=
|
||||
github.com/jackc/pgservicefile v0.0.0-20221227161230-091c0ba34f0a/go.mod h1:5TJZWKEWniPve33vlWYSoGYefn3gLQRzjfDlhSJ9ZKM=
|
||||
github.com/jackc/pgx/v5 v5.5.5 h1:amBjrZVmksIdNjxGW/IiIMzxMKZFelXbUoPNb+8sjQw=
|
||||
github.com/jackc/pgx/v5 v5.5.5/go.mod h1:ez9gk+OAat140fv9ErkZDYFWmXLfV+++K0uAOiwgm1A=
|
||||
github.com/jackc/puddle/v2 v2.2.1 h1:RhxXJtFG022u4ibrCSMSiu5aOq1i77R3OHKNJj77OAk=
|
||||
github.com/jackc/puddle/v2 v2.2.1/go.mod h1:vriiEXHvEE654aYKXXjOvZM39qJ0q+azkZFrfEOc3H4=
|
||||
github.com/joho/godotenv v1.5.1 h1:7eLL/+HRGLY0ldzfGMeQkb7vMd0as4CfYvUVzLqw0N0=
|
||||
github.com/joho/godotenv v1.5.1/go.mod h1:f4LDr5Voq0i2e/R5DDNOoa2zzDfwtkZa6DnEwAbqwq4=
|
||||
github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo=
|
||||
github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ=
|
||||
github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI=
|
||||
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
|
||||
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
|
||||
github.com/rabbitmq/amqp091-go v1.9.0 h1:qrQtyzB4H8BQgEuJwhmVQqVHB9O4+MNDJCCAcpc3Aoo=
|
||||
github.com/rabbitmq/amqp091-go v1.9.0/go.mod h1:+jPrT9iY2eLjRaMSRHUhc3z14E/l85kv/f+6luSD3pc=
|
||||
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
|
||||
github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw=
|
||||
github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI=
|
||||
github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
|
||||
github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
|
||||
github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU=
|
||||
github.com/stretchr/testify v1.8.1 h1:w7B6lhMri9wdJUVmEZPGGhZzrYTPvgJArz7wNPgYKsk=
|
||||
github.com/stretchr/testify v1.8.1/go.mod h1:w2LPCIKwWwSfY2zedu0+kehJoqGctiVI29o6fzry7u4=
|
||||
go.uber.org/goleak v1.2.1 h1:NBol2c7O1ZokfZ0LEU9K6Whx/KnwvepVetCUhtKja4A=
|
||||
go.uber.org/goleak v1.2.1/go.mod h1:qlT2yGI9QafXHhZZLxlSuNsMw3FFLxBr+tBRlmO1xH4=
|
||||
golang.org/x/crypto v0.17.0 h1:r8bRNjWL3GshPW3gkd+RpvzWrZAwPS49OmTGZ/uhM4k=
|
||||
golang.org/x/crypto v0.17.0/go.mod h1:gCAAfMLgwOJRpTjQ2zCCt2OcSfYMTeZVSRtQlPC7Nq4=
|
||||
golang.org/x/sync v0.1.0 h1:wsuoTGHzEhffawBOhz5CYhcrV4IdKZbEyZjBMuTp12o=
|
||||
golang.org/x/sync v0.1.0/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
|
||||
golang.org/x/text v0.14.0 h1:ScX5w1eTa3QqT8oi6+ziP7dTV1S2+ALU0bI+0zXKWiQ=
|
||||
golang.org/x/text v0.14.0/go.mod h1:18ZOQIKpY8NJVqYksKHtTdi31H5itFRjB5/qKTNYzSU=
|
||||
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
|
||||
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
|
||||
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
|
||||
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
|
||||
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
|
||||
11
workers/tagging/Dockerfile
Normal file
11
workers/tagging/Dockerfile
Normal file
@@ -0,0 +1,11 @@
|
||||
FROM golang:1.22-alpine AS build
|
||||
WORKDIR /src
|
||||
COPY go.mod go.sum* ./
|
||||
RUN go mod download
|
||||
COPY . .
|
||||
RUN CGO_ENABLED=0 go build -o /tagging ./cmd/tagging
|
||||
|
||||
FROM alpine:3.19
|
||||
RUN apk add --no-cache ca-certificates
|
||||
COPY --from=build /tagging /tagging
|
||||
ENTRYPOINT ["/tagging"]
|
||||
685
workers/tagging/cmd/tagging/main.go
Normal file
685
workers/tagging/cmd/tagging/main.go
Normal file
@@ -0,0 +1,685 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"context"
|
||||
"database/sql"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
"log/slog"
|
||||
"net"
|
||||
"net/http"
|
||||
"os"
|
||||
"strings"
|
||||
"time"
|
||||
"unicode/utf8"
|
||||
|
||||
"github.com/joho/godotenv"
|
||||
_ "github.com/jackc/pgx/v5/stdlib"
|
||||
amqp "github.com/rabbitmq/amqp091-go"
|
||||
)
|
||||
|
||||
func init() {
|
||||
slog.SetDefault(slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{Level: slog.LevelInfo})))
|
||||
}
|
||||
|
||||
func apiURL() string {
|
||||
if u := os.Getenv("YANDEX_API_URL"); u != "" {
|
||||
return u
|
||||
}
|
||||
return "https://ai.api.cloud.yandex.net/v1/chat/completions"
|
||||
}
|
||||
|
||||
// ── входящее сообщение из очереди tagging ──
|
||||
type WorkerMessage struct {
|
||||
TaskID string `json:"task_id"`
|
||||
Filename string `json:"filename"`
|
||||
Transcription string `json:"transcription"`
|
||||
}
|
||||
|
||||
// ── результат классификации ──
|
||||
type ClassificationResult struct {
|
||||
L1 string `json:"L1"`
|
||||
L2 string `json:"L2"`
|
||||
L3 string `json:"L3"`
|
||||
RiskLevel string `json:"risk_level"`
|
||||
HasActionItems bool `json:"has_action_items"`
|
||||
HasDeadline bool `json:"has_deadline"`
|
||||
}
|
||||
|
||||
// ── LLM request/response ──
|
||||
type chatMessage struct {
|
||||
Role string `json:"role"`
|
||||
Content string `json:"content"`
|
||||
}
|
||||
type chatRequest struct {
|
||||
Model string `json:"model"`
|
||||
Temperature float64 `json:"temperature"`
|
||||
ResponseFormat struct {
|
||||
Type string `json:"type"`
|
||||
} `json:"response_format"`
|
||||
Messages []chatMessage `json:"messages"`
|
||||
}
|
||||
type tokenUsage struct {
|
||||
PromptTokens int `json:"prompt_tokens"`
|
||||
CompletionTokens int `json:"completion_tokens"`
|
||||
TotalTokens int `json:"total_tokens"`
|
||||
}
|
||||
type chatResponse struct {
|
||||
Choices []struct {
|
||||
Message struct {
|
||||
Content string `json:"content"`
|
||||
} `json:"message"`
|
||||
} `json:"choices"`
|
||||
Usage *tokenUsage `json:"usage"`
|
||||
}
|
||||
|
||||
type llmCallResult struct {
|
||||
Content string
|
||||
RequestBytes int
|
||||
ResponseBytes int
|
||||
Usage *tokenUsage
|
||||
Duration time.Duration
|
||||
}
|
||||
|
||||
// ===================== LLM =====================
|
||||
|
||||
var llmHTTPClient = newLLMHTTPClient(90 * time.Second)
|
||||
|
||||
func newLLMHTTPClient(totalTimeout time.Duration) *http.Client {
|
||||
return &http.Client{
|
||||
Timeout: totalTimeout,
|
||||
Transport: &http.Transport{
|
||||
Proxy: http.ProxyFromEnvironment,
|
||||
DialContext: (&net.Dialer{
|
||||
Timeout: 30 * time.Second,
|
||||
KeepAlive: 30 * time.Second,
|
||||
}).DialContext,
|
||||
TLSHandshakeTimeout: 60 * time.Second,
|
||||
ResponseHeaderTimeout: 60 * time.Second,
|
||||
ExpectContinueTimeout: 5 * time.Second,
|
||||
IdleConnTimeout: 90 * time.Second,
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
func callLLM(ctx context.Context, model, prompt string) (*llmCallResult, error) {
|
||||
const systemPrompt = "Ты — классификатор диалогов в логистике. Отвечай только JSON, без пояснений."
|
||||
|
||||
reqBody := chatRequest{
|
||||
Model: model,
|
||||
Temperature: 0.1,
|
||||
Messages: []chatMessage{
|
||||
{Role: "system", Content: systemPrompt},
|
||||
{Role: "user", Content: prompt},
|
||||
},
|
||||
}
|
||||
reqBody.ResponseFormat.Type = "json_object"
|
||||
|
||||
jsonData, err := json.Marshal(reqBody)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
req, err := http.NewRequestWithContext(ctx, "POST", apiURL(), bytes.NewBuffer(jsonData))
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
req.Header.Set("Authorization", "Bearer "+os.Getenv("YANDEX_API_KEY"))
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
|
||||
start := time.Now()
|
||||
resp, err := llmHTTPClient.Do(req)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
|
||||
body, _ := io.ReadAll(resp.Body)
|
||||
duration := time.Since(start)
|
||||
|
||||
if resp.StatusCode != http.StatusOK {
|
||||
return &llmCallResult{
|
||||
RequestBytes: len(jsonData),
|
||||
ResponseBytes: len(body),
|
||||
Duration: duration,
|
||||
}, fmt.Errorf("status %d: %s", resp.StatusCode, truncate(string(body), 500))
|
||||
}
|
||||
|
||||
var result chatResponse
|
||||
if err := json.Unmarshal(body, &result); err != nil {
|
||||
return &llmCallResult{
|
||||
RequestBytes: len(jsonData),
|
||||
ResponseBytes: len(body),
|
||||
Duration: duration,
|
||||
}, err
|
||||
}
|
||||
if len(result.Choices) == 0 {
|
||||
return &llmCallResult{
|
||||
RequestBytes: len(jsonData),
|
||||
ResponseBytes: len(body),
|
||||
Duration: duration,
|
||||
}, fmt.Errorf("empty response")
|
||||
}
|
||||
|
||||
return &llmCallResult{
|
||||
Content: result.Choices[0].Message.Content,
|
||||
RequestBytes: len(jsonData),
|
||||
ResponseBytes: len(body),
|
||||
Usage: result.Usage,
|
||||
Duration: duration,
|
||||
}, nil
|
||||
}
|
||||
|
||||
func checkYandexAPI(ctx context.Context, model string) error {
|
||||
slog.Info("yandex api check started", "worker", "tagging", "url", apiURL(), "model", model)
|
||||
|
||||
res, err := callLLM(ctx, model, `Ответь только JSON: {"ok":true}`)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
attrs := []any{
|
||||
"worker", "tagging",
|
||||
"duration_ms", res.Duration.Milliseconds(),
|
||||
"response_chars", utf8.RuneCountInString(res.Content),
|
||||
}
|
||||
if res.Usage != nil {
|
||||
attrs = append(attrs,
|
||||
"prompt_tokens", res.Usage.PromptTokens,
|
||||
"completion_tokens", res.Usage.CompletionTokens,
|
||||
"total_tokens", res.Usage.TotalTokens,
|
||||
)
|
||||
}
|
||||
slog.Info("yandex api check ok", attrs...)
|
||||
return nil
|
||||
}
|
||||
|
||||
func logLLMCall(worker, taskID, model, callType string, attempt int, inputChars int, res *llmCallResult, err error) {
|
||||
attrs := []any{
|
||||
"worker", worker,
|
||||
"task_id", taskID,
|
||||
"model", model,
|
||||
"call_type", callType,
|
||||
"attempt", attempt,
|
||||
"input_chars", inputChars,
|
||||
}
|
||||
if res != nil {
|
||||
attrs = append(attrs,
|
||||
"duration_ms", res.Duration.Milliseconds(),
|
||||
"request_bytes", res.RequestBytes,
|
||||
"response_bytes", res.ResponseBytes,
|
||||
"response_chars", utf8.RuneCountInString(res.Content),
|
||||
)
|
||||
if res.Usage != nil {
|
||||
attrs = append(attrs,
|
||||
"prompt_tokens", res.Usage.PromptTokens,
|
||||
"completion_tokens", res.Usage.CompletionTokens,
|
||||
"total_tokens", res.Usage.TotalTokens,
|
||||
)
|
||||
}
|
||||
}
|
||||
if err != nil {
|
||||
slog.Warn("llm call failed", append(attrs, "error", err)...)
|
||||
return
|
||||
}
|
||||
slog.Info("llm call ok", attrs...)
|
||||
}
|
||||
|
||||
func buildPrompt(text string) string {
|
||||
return fmt.Sprintf(`Ты — классификатор диалогов в логистике.
|
||||
|
||||
Тебе даётся НЕструктурированный текст диалога (разговор, звонок, переписка).
|
||||
Текст может быть неаккуратным, с ошибками, без структуры.
|
||||
|
||||
Твоя задача:
|
||||
1. Понять смысл диалога
|
||||
2. Выделить ключевую цель разговора
|
||||
3. Определить наличие проблемы
|
||||
4. Классифицировать диалог по правилам ниже
|
||||
|
||||
=== ИЕРАРХИЯ КЛАССОВ ===
|
||||
|
||||
L1:
|
||||
- new_order
|
||||
- order_change
|
||||
- tracking
|
||||
- delivery_coordination
|
||||
- problem
|
||||
- claim
|
||||
- information_request
|
||||
- internal_communication
|
||||
- other
|
||||
|
||||
L2:
|
||||
|
||||
Для problem:
|
||||
- delivery_issue
|
||||
- cargo_issue
|
||||
- data_issue
|
||||
- communication_issue
|
||||
|
||||
Для delivery_coordination:
|
||||
- delivery_time
|
||||
- unloading_conditions
|
||||
- warehouse_rules
|
||||
- access
|
||||
- scheduling
|
||||
|
||||
Для tracking:
|
||||
- location_request
|
||||
- status_update
|
||||
- eta
|
||||
|
||||
L3 (опционально):
|
||||
- wrong_contact
|
||||
- wrong_address
|
||||
- missing_info
|
||||
- delay
|
||||
- lost
|
||||
- damage
|
||||
- cannot_reach
|
||||
- no_response
|
||||
|
||||
=== ДОПОЛНИТЕЛЬНЫЕ ПОЛЯ ===
|
||||
|
||||
risk_level:
|
||||
- none
|
||||
- low
|
||||
- medium
|
||||
- high
|
||||
|
||||
has_action_items:
|
||||
- true / false
|
||||
|
||||
has_deadline:
|
||||
- true / false
|
||||
|
||||
=== ПРАВИЛА ===
|
||||
|
||||
1. Определи основную цель разговора:
|
||||
- заказ → new_order
|
||||
- изменение → order_change
|
||||
- узнать статус → tracking
|
||||
- согласование → delivery_coordination
|
||||
- ошибка / проблема → problem
|
||||
|
||||
2. Если есть любая ошибка или сбой → ВСЕГДА L1 = problem
|
||||
|
||||
3. Ошибки в email / телефоне / адресе → L2 = data_issue
|
||||
|
||||
4. Если обсуждают условия (время, склад, разгрузка) без проблемы → delivery_coordination
|
||||
|
||||
5. Если спрашивают "где груз?" → tracking
|
||||
|
||||
6. Определи risk_level:
|
||||
- low → проблема не влияет на доставку
|
||||
- medium → возможна задержка
|
||||
- high → срыв сроков / потеря
|
||||
|
||||
7. has_action_items = true если:
|
||||
- есть договорённости ("перезвоню", "свяжется", "отправлю")
|
||||
|
||||
8. has_deadline = true если:
|
||||
- есть конкретное время ("в 18:00", "через 10 минут", "завтра")
|
||||
|
||||
---
|
||||
|
||||
=== ФОРМАТ ОТВЕТА ===
|
||||
|
||||
Ответ только JSON, без пояснений:
|
||||
|
||||
{
|
||||
"L1": "...",
|
||||
"L2": "...",
|
||||
"L3": "...",
|
||||
"risk_level": "...",
|
||||
"has_action_items": true/false,
|
||||
"has_deadline": true/false
|
||||
}
|
||||
|
||||
---
|
||||
|
||||
=== ДИАЛОГ ===
|
||||
|
||||
Текст:
|
||||
"""
|
||||
%s
|
||||
"""`, text)
|
||||
}
|
||||
|
||||
func classify(ctx context.Context, taskID, model, text string) (ClassificationResult, error) {
|
||||
prompt := buildPrompt(text)
|
||||
inputChars := utf8.RuneCountInString(prompt)
|
||||
|
||||
res, err := callLLM(ctx, model, prompt)
|
||||
logLLMCall("tagging", taskID, model, "classify", 1, inputChars, res, err)
|
||||
if err != nil {
|
||||
return ClassificationResult{}, err
|
||||
}
|
||||
|
||||
var result ClassificationResult
|
||||
if err := json.Unmarshal([]byte(res.Content), &result); err != nil {
|
||||
return ClassificationResult{}, fmt.Errorf("parse: %w, resp: %s", err, truncate(res.Content, 300))
|
||||
}
|
||||
return result, nil
|
||||
}
|
||||
|
||||
// ===================== DB =====================
|
||||
|
||||
func saveTagging(ctx context.Context, db *sql.DB, taskID, filename, transcription string, tagging []byte) (complete bool, err error) {
|
||||
_, err = db.ExecContext(ctx,
|
||||
`INSERT INTO results (task_id) VALUES ($1) ON CONFLICT (task_id) DO NOTHING`, taskID)
|
||||
if err != nil {
|
||||
return false, fmt.Errorf("ensure row: %w", err)
|
||||
}
|
||||
|
||||
err = db.QueryRowContext(ctx, `
|
||||
UPDATE results
|
||||
SET tagging = $2::jsonb,
|
||||
filename = COALESCE(NULLIF($3, ''), filename),
|
||||
transcription = COALESCE(NULLIF($4, ''), transcription),
|
||||
updated_at = now(),
|
||||
status = CASE WHEN analysis IS NOT NULL THEN 'done' ELSE status END
|
||||
WHERE task_id = $1
|
||||
RETURNING (analysis IS NOT NULL AND tagging IS NOT NULL)
|
||||
`, taskID, string(tagging), filename, transcription).Scan(&complete)
|
||||
if err != nil {
|
||||
return false, fmt.Errorf("update tagging: %w", err)
|
||||
}
|
||||
return complete, nil
|
||||
}
|
||||
|
||||
// ===================== MAIN =====================
|
||||
|
||||
func loadDotenv() {
|
||||
path := os.Getenv("DOTENV_PATH")
|
||||
if path == "" {
|
||||
return
|
||||
}
|
||||
if err := godotenv.Overload(path); err != nil {
|
||||
slog.Warn("dotenv load failed", "path", path, "error", err)
|
||||
return
|
||||
}
|
||||
slog.Info("dotenv loaded", "path", path)
|
||||
}
|
||||
|
||||
func main() {
|
||||
loadDotenv()
|
||||
|
||||
amqpURL := getenv("RABBITMQ_URL", "amqp://guest:guest@localhost:5672/")
|
||||
dbURL := os.Getenv("DATABASE_URL")
|
||||
token := os.Getenv("YANDEX_API_KEY")
|
||||
model := os.Getenv("YANDEX_MODEL")
|
||||
inputQueue := getenv("TAGGING_QUEUE", "tagging")
|
||||
finalQueue := getenv("FINAL_QUEUE", "final")
|
||||
|
||||
if token == "" {
|
||||
slog.Error("YANDEX_API_KEY is required")
|
||||
os.Exit(1)
|
||||
}
|
||||
if model == "" {
|
||||
slog.Error("YANDEX_MODEL is required")
|
||||
os.Exit(1)
|
||||
}
|
||||
slog.Info("config loaded", "worker", "tagging",
|
||||
"yandex_token", tokenFingerprint(token), "model", model, "api_url", apiURL())
|
||||
|
||||
db := mustDB(dbURL)
|
||||
defer db.Close()
|
||||
|
||||
checkCtx, checkCancel := context.WithTimeout(context.Background(), 90*time.Second)
|
||||
if err := checkYandexAPI(checkCtx, model); err != nil {
|
||||
checkCancel()
|
||||
slog.Error("yandex api check failed — worker will not start", "worker", "tagging", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
checkCancel()
|
||||
|
||||
ch := mustRabbit(amqpURL)
|
||||
|
||||
if _, err := ch.QueueDeclare(inputQueue, true, false, false, false, nil); err != nil {
|
||||
slog.Error("declare queue failed", "queue", inputQueue, "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
if _, err := ch.QueueDeclare(finalQueue, true, false, false, false, nil); err != nil {
|
||||
slog.Error("declare queue failed", "queue", finalQueue, "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
ch.Qos(1, 0, false)
|
||||
|
||||
msgs, err := ch.Consume(inputQueue, "", false, false, false, false, nil)
|
||||
if err != nil {
|
||||
slog.Error("consume failed", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
slog.Info("worker started", "worker", "tagging", "queue", inputQueue, "model", model)
|
||||
|
||||
for d := range msgs {
|
||||
taskStart := time.Now()
|
||||
var task WorkerMessage
|
||||
if err := json.Unmarshal(d.Body, &task); err != nil {
|
||||
slog.Warn("bad message", "worker", "tagging", "delivery_tag", d.DeliveryTag,
|
||||
"body_bytes", len(d.Body), "error", err)
|
||||
d.Nack(false, false)
|
||||
continue
|
||||
}
|
||||
|
||||
transcriptionChars := utf8.RuneCountInString(task.Transcription)
|
||||
slog.Info("message received", "worker", "tagging",
|
||||
"task_id", task.TaskID,
|
||||
"filename", task.Filename,
|
||||
"delivery_tag", d.DeliveryTag,
|
||||
"redelivered", d.Redelivered,
|
||||
"body_bytes", len(d.Body),
|
||||
"transcription_chars", transcriptionChars,
|
||||
"llm_calls_expected", 1,
|
||||
)
|
||||
if d.Redelivered {
|
||||
slog.Warn("redelivered message skipped — no llm call",
|
||||
"worker", "tagging", "task_id", task.TaskID, "delivery_tag", d.DeliveryTag)
|
||||
d.Nack(false, false)
|
||||
continue
|
||||
}
|
||||
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 3*time.Minute)
|
||||
|
||||
result, err := classify(ctx, task.TaskID, model, task.Transcription)
|
||||
if err != nil {
|
||||
cancel()
|
||||
slog.Warn("task failed, discarded",
|
||||
"worker", "tagging", "task_id", task.TaskID,
|
||||
"llm_calls", 1, "error", err)
|
||||
d.Nack(false, false)
|
||||
continue
|
||||
}
|
||||
|
||||
tagJSON, _ := json.Marshal(result)
|
||||
complete, err := saveTagging(ctx, db, task.TaskID, task.Filename, task.Transcription, tagJSON)
|
||||
if err != nil {
|
||||
cancel()
|
||||
slog.Warn("db save failed, discarded",
|
||||
"worker", "tagging", "task_id", task.TaskID, "error", err)
|
||||
d.Nack(false, false)
|
||||
continue
|
||||
}
|
||||
|
||||
if complete {
|
||||
notifyFinal(ctx, ch, db, finalQueue, task.TaskID, "tagging")
|
||||
slog.Info("task complete", "worker", "tagging", "task_id", task.TaskID,
|
||||
"was_last", "tagging", "L1", result.L1,
|
||||
"llm_calls", 1, "duration_ms", time.Since(taskStart).Milliseconds())
|
||||
} else {
|
||||
slog.Info("task partial", "worker", "tagging", "task_id", task.TaskID,
|
||||
"waiting_for", "analyse", "L1", result.L1,
|
||||
"llm_calls", 1, "duration_ms", time.Since(taskStart).Milliseconds())
|
||||
}
|
||||
cancel()
|
||||
|
||||
d.Ack(false)
|
||||
}
|
||||
}
|
||||
|
||||
func getenv(k, d string) string {
|
||||
if v := os.Getenv(k); v != "" {
|
||||
return v
|
||||
}
|
||||
return d
|
||||
}
|
||||
|
||||
func tokenFingerprint(token string) string {
|
||||
if len(token) <= 12 {
|
||||
return "***"
|
||||
}
|
||||
return token[:8] + "..." + token[len(token)-4:]
|
||||
}
|
||||
|
||||
func truncate(s string, max int) string {
|
||||
if len(s) <= max {
|
||||
return s
|
||||
}
|
||||
return s[:max] + "..."
|
||||
}
|
||||
|
||||
func loadFinalPayload(ctx context.Context, db *sql.DB, taskID string) ([]byte, error) {
|
||||
var (
|
||||
filename, transcription, status sql.NullString
|
||||
analysis, tagging, metadata []byte
|
||||
createdAt, updatedAt time.Time
|
||||
)
|
||||
err := db.QueryRowContext(ctx, `
|
||||
SELECT filename, transcription, analysis, tagging, metadata, status, created_at, updated_at
|
||||
FROM results WHERE task_id = $1
|
||||
`, taskID).Scan(&filename, &transcription, &analysis, &tagging, &metadata, &status, &createdAt, &updatedAt)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("load result: %w", err)
|
||||
}
|
||||
|
||||
msg := map[string]any{
|
||||
"task_id": taskID,
|
||||
"status": status.String,
|
||||
"created_at": createdAt,
|
||||
"updated_at": updatedAt,
|
||||
}
|
||||
if filename.Valid {
|
||||
msg["filename"] = filename.String
|
||||
}
|
||||
if transcription.Valid {
|
||||
msg["transcription"] = transcription.String
|
||||
}
|
||||
if len(analysis) > 0 {
|
||||
var v any
|
||||
if err := json.Unmarshal(analysis, &v); err == nil {
|
||||
msg["analysis"] = v
|
||||
}
|
||||
}
|
||||
if len(tagging) > 0 {
|
||||
var v any
|
||||
if err := json.Unmarshal(tagging, &v); err == nil {
|
||||
msg["tagging"] = v
|
||||
}
|
||||
}
|
||||
if len(metadata) > 0 {
|
||||
var meta map[string]any
|
||||
if err := json.Unmarshal(metadata, &meta); err == nil {
|
||||
for k, v := range meta {
|
||||
msg[k] = v
|
||||
}
|
||||
}
|
||||
}
|
||||
return json.Marshal(msg)
|
||||
}
|
||||
|
||||
func notifyFinal(ctx context.Context, ch *amqp.Channel, db *sql.DB, queue, taskID, worker string) {
|
||||
body, err := loadFinalPayload(ctx, db, taskID)
|
||||
if err != nil {
|
||||
slog.Warn("load final payload failed", "worker", worker, "task_id", taskID, "error", err)
|
||||
return
|
||||
}
|
||||
if err := ch.PublishWithContext(ctx, "", queue, false, false,
|
||||
amqp.Publishing{
|
||||
ContentType: "application/json",
|
||||
Body: body,
|
||||
DeliveryMode: amqp.Persistent,
|
||||
}); err != nil {
|
||||
slog.Warn("publish final failed", "worker", worker, "task_id", taskID, "error", err)
|
||||
return
|
||||
}
|
||||
slog.Info("published final", "worker", worker, "task_id", taskID, "queue", queue, "body_bytes", len(body))
|
||||
deleteProcessingFile(extractFilePath(body), taskID, worker)
|
||||
}
|
||||
|
||||
func extractFilePath(body []byte) string {
|
||||
var msg map[string]any
|
||||
if err := json.Unmarshal(body, &msg); err != nil {
|
||||
return ""
|
||||
}
|
||||
fp, _ := msg["file_path"].(string)
|
||||
return fp
|
||||
}
|
||||
|
||||
func deleteProcessingFile(filePath, taskID, worker string) {
|
||||
if filePath == "" {
|
||||
slog.Warn("processing file not deleted: no file_path", "worker", worker, "task_id", taskID)
|
||||
return
|
||||
}
|
||||
if !strings.Contains(filePath, "/processing/") {
|
||||
slog.Warn("processing file not deleted: path outside processing", "worker", worker, "task_id", taskID, "path", filePath)
|
||||
return
|
||||
}
|
||||
if err := os.Remove(filePath); err != nil {
|
||||
if os.IsNotExist(err) {
|
||||
slog.Info("processing file already removed", "worker", worker, "task_id", taskID, "path", filePath)
|
||||
return
|
||||
}
|
||||
slog.Warn("processing file delete failed", "worker", worker, "task_id", taskID, "path", filePath, "error", err)
|
||||
return
|
||||
}
|
||||
slog.Info("processing file deleted", "worker", worker, "task_id", taskID, "path", filePath)
|
||||
}
|
||||
|
||||
func mustDB(url string) *sql.DB {
|
||||
db, err := sql.Open("pgx", url)
|
||||
if err != nil {
|
||||
slog.Error("db open failed", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
db.SetMaxOpenConns(5)
|
||||
time.Sleep(2 * time.Second) // дать Docker DNS зарегистрировать postgres
|
||||
for i := 0; i < 60; i++ {
|
||||
if err = db.Ping(); err == nil {
|
||||
return db
|
||||
}
|
||||
if i < 5 || (i+1)%10 == 0 {
|
||||
slog.Info("waiting for db", "attempt", i+1, "error", err)
|
||||
}
|
||||
time.Sleep(3 * time.Second)
|
||||
}
|
||||
slog.Error("db unreachable", "error", err)
|
||||
os.Exit(1)
|
||||
return nil
|
||||
}
|
||||
|
||||
func mustRabbit(url string) *amqp.Channel {
|
||||
var conn *amqp.Connection
|
||||
var err error
|
||||
for i := 0; i < 30; i++ {
|
||||
conn, err = amqp.Dial(url)
|
||||
if err == nil {
|
||||
break
|
||||
}
|
||||
slog.Info("waiting for rabbit", "attempt", i+1, "error", err)
|
||||
time.Sleep(2 * time.Second)
|
||||
}
|
||||
if err != nil {
|
||||
slog.Error("rabbit unreachable", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
ch, err := conn.Channel()
|
||||
if err != nil {
|
||||
slog.Error("rabbit channel failed", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
return ch
|
||||
}
|
||||
18
workers/tagging/go.mod
Normal file
18
workers/tagging/go.mod
Normal file
@@ -0,0 +1,18 @@
|
||||
module github.com/yourorg/tagging
|
||||
|
||||
go 1.22
|
||||
|
||||
require (
|
||||
github.com/jackc/pgx/v5 v5.5.5
|
||||
github.com/joho/godotenv v1.5.1
|
||||
github.com/rabbitmq/amqp091-go v1.9.0
|
||||
)
|
||||
|
||||
require (
|
||||
github.com/jackc/pgpassfile v1.0.0 // indirect
|
||||
github.com/jackc/pgservicefile v0.0.0-20221227161230-091c0ba34f0a // indirect
|
||||
github.com/jackc/puddle/v2 v2.2.1 // indirect
|
||||
golang.org/x/crypto v0.17.0 // indirect
|
||||
golang.org/x/sync v0.1.0 // indirect
|
||||
golang.org/x/text v0.14.0 // indirect
|
||||
)
|
||||
41
workers/tagging/go.sum
Normal file
41
workers/tagging/go.sum
Normal file
@@ -0,0 +1,41 @@
|
||||
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
|
||||
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
|
||||
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
|
||||
github.com/jackc/pgpassfile v1.0.0 h1:/6Hmqy13Ss2zCq62VdNG8tM1wchn8zjSGOBJ6icpsIM=
|
||||
github.com/jackc/pgpassfile v1.0.0/go.mod h1:CEx0iS5ambNFdcRtxPj5JhEz+xB6uRky5eyVu/W2HEg=
|
||||
github.com/jackc/pgservicefile v0.0.0-20221227161230-091c0ba34f0a h1:bbPeKD0xmW/Y25WS6cokEszi5g+S0QxI/d45PkRi7Nk=
|
||||
github.com/jackc/pgservicefile v0.0.0-20221227161230-091c0ba34f0a/go.mod h1:5TJZWKEWniPve33vlWYSoGYefn3gLQRzjfDlhSJ9ZKM=
|
||||
github.com/jackc/pgx/v5 v5.5.5 h1:amBjrZVmksIdNjxGW/IiIMzxMKZFelXbUoPNb+8sjQw=
|
||||
github.com/jackc/pgx/v5 v5.5.5/go.mod h1:ez9gk+OAat140fv9ErkZDYFWmXLfV+++K0uAOiwgm1A=
|
||||
github.com/jackc/puddle/v2 v2.2.1 h1:RhxXJtFG022u4ibrCSMSiu5aOq1i77R3OHKNJj77OAk=
|
||||
github.com/jackc/puddle/v2 v2.2.1/go.mod h1:vriiEXHvEE654aYKXXjOvZM39qJ0q+azkZFrfEOc3H4=
|
||||
github.com/joho/godotenv v1.5.1 h1:7eLL/+HRGLY0ldzfGMeQkb7vMd0as4CfYvUVzLqw0N0=
|
||||
github.com/joho/godotenv v1.5.1/go.mod h1:f4LDr5Voq0i2e/R5DDNOoa2zzDfwtkZa6DnEwAbqwq4=
|
||||
github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo=
|
||||
github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ=
|
||||
github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI=
|
||||
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
|
||||
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
|
||||
github.com/rabbitmq/amqp091-go v1.9.0 h1:qrQtyzB4H8BQgEuJwhmVQqVHB9O4+MNDJCCAcpc3Aoo=
|
||||
github.com/rabbitmq/amqp091-go v1.9.0/go.mod h1:+jPrT9iY2eLjRaMSRHUhc3z14E/l85kv/f+6luSD3pc=
|
||||
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
|
||||
github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw=
|
||||
github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI=
|
||||
github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
|
||||
github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
|
||||
github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU=
|
||||
github.com/stretchr/testify v1.8.1 h1:w7B6lhMri9wdJUVmEZPGGhZzrYTPvgJArz7wNPgYKsk=
|
||||
github.com/stretchr/testify v1.8.1/go.mod h1:w2LPCIKwWwSfY2zedu0+kehJoqGctiVI29o6fzry7u4=
|
||||
go.uber.org/goleak v1.2.1 h1:NBol2c7O1ZokfZ0LEU9K6Whx/KnwvepVetCUhtKja4A=
|
||||
go.uber.org/goleak v1.2.1/go.mod h1:qlT2yGI9QafXHhZZLxlSuNsMw3FFLxBr+tBRlmO1xH4=
|
||||
golang.org/x/crypto v0.17.0 h1:r8bRNjWL3GshPW3gkd+RpvzWrZAwPS49OmTGZ/uhM4k=
|
||||
golang.org/x/crypto v0.17.0/go.mod h1:gCAAfMLgwOJRpTjQ2zCCt2OcSfYMTeZVSRtQlPC7Nq4=
|
||||
golang.org/x/sync v0.1.0 h1:wsuoTGHzEhffawBOhz5CYhcrV4IdKZbEyZjBMuTp12o=
|
||||
golang.org/x/sync v0.1.0/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
|
||||
golang.org/x/text v0.14.0 h1:ScX5w1eTa3QqT8oi6+ziP7dTV1S2+ALU0bI+0zXKWiQ=
|
||||
golang.org/x/text v0.14.0/go.mod h1:18ZOQIKpY8NJVqYksKHtTdi31H5itFRjB5/qKTNYzSU=
|
||||
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
|
||||
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
|
||||
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
|
||||
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
|
||||
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
|
||||
11
workers/transcribe/Dockerfile
Normal file
11
workers/transcribe/Dockerfile
Normal file
@@ -0,0 +1,11 @@
|
||||
FROM golang:1.22-alpine AS build
|
||||
WORKDIR /src
|
||||
COPY go.mod go.sum* ./
|
||||
RUN go mod download
|
||||
COPY . .
|
||||
RUN CGO_ENABLED=0 go build -o /transcribe ./cmd/transcribe
|
||||
|
||||
FROM alpine:3.19
|
||||
RUN apk add --no-cache ca-certificates
|
||||
COPY --from=build /transcribe /transcribe
|
||||
ENTRYPOINT ["/transcribe"]
|
||||
64
workers/transcribe/cmd/transcribe/main.go
Normal file
64
workers/transcribe/cmd/transcribe/main.go
Normal file
@@ -0,0 +1,64 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"context"
|
||||
"log/slog"
|
||||
"os"
|
||||
"os/signal"
|
||||
"syscall"
|
||||
"time"
|
||||
|
||||
amqp "github.com/rabbitmq/amqp091-go"
|
||||
|
||||
"github.com/yourorg/transcribe/internal/config"
|
||||
"github.com/yourorg/transcribe/internal/consumer"
|
||||
)
|
||||
|
||||
func main() {
|
||||
slog.SetDefault(slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{Level: slog.LevelInfo})))
|
||||
|
||||
cfg := config.Load()
|
||||
if cfg.NexaraAPIKey == "" {
|
||||
slog.Error("NEXARA_API_KEY is required")
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
ch := mustRabbit(cfg.RabbitURL)
|
||||
cons, err := consumer.New(cfg, ch)
|
||||
if err != nil {
|
||||
slog.Error("consumer init failed", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
|
||||
defer stop()
|
||||
|
||||
if err := cons.Run(ctx); err != nil && ctx.Err() == nil {
|
||||
slog.Error("consumer stopped", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
slog.Info("transcribe worker stopping")
|
||||
}
|
||||
|
||||
func mustRabbit(url string) *amqp.Channel {
|
||||
var conn *amqp.Connection
|
||||
var err error
|
||||
for i := 0; i < 30; i++ {
|
||||
conn, err = amqp.Dial(url)
|
||||
if err == nil {
|
||||
break
|
||||
}
|
||||
slog.Info("waiting for rabbit", "attempt", i+1, "error", err)
|
||||
time.Sleep(2 * time.Second)
|
||||
}
|
||||
if err != nil {
|
||||
slog.Error("rabbit unreachable", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
ch, err := conn.Channel()
|
||||
if err != nil {
|
||||
slog.Error("rabbit channel failed", "error", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
return ch
|
||||
}
|
||||
23
workers/transcribe/configs/prompts.json
Normal file
23
workers/transcribe/configs/prompts.json
Normal file
@@ -0,0 +1,23 @@
|
||||
[
|
||||
{
|
||||
"id": 1,
|
||||
"id_section": 1,
|
||||
"name": "behavioral",
|
||||
"prompt": "Ты — строгий классификатор звонков.\n\nЗадача:\nПроанализируй диалог и оцени поведенческие критерии.\n\nКритерии:\n1. Приветствие\n2. Инициативность (выявление цели, попытка развить разговор)\n3. Уточнил, остались ли вопросы\n4. Прощание\n\nИнструкция:\nДля каждого критерия:\n- определи наличие\n- найди ДОСЛОВНУЮ цитату\n- оцени confidence (0.0–1.0)\n\nФормат ответа (строго JSON):\n\n{\n \"greeting\": {\n \"value\": true/false,\n \"evidence\": \"цитата или null\",\n \"confidence\": number\n },\n \"initiative\": {\n \"value\": true/false,\n \"evidence\": \"цитата или null\",\n \"confidence\": number\n },\n \"questions_check\": {\n \"value\": true/false,\n \"evidence\": \"цитата или null\",\n \"confidence\": number\n },\n \"closing\": {\n \"value\": true/false,\n \"evidence\": \"цитата или null\",\n \"confidence\": number\n }\n}\n\nЖЁСТКИЕ ПРАВИЛА:\n- каждый критерий оценивается независимо\n- не додумывать\n- если нет → value=false, evidence=null, confidence=0.0\n- evidence должен подтверждать вывод",
|
||||
"dt_create": "2026-06-09T09:00:00.000000"
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"id_section": 1,
|
||||
"name": "client_data",
|
||||
"prompt": "Ты — строгий классификатор звонков.\n\nЗадача:\nОпредели, какие данные о клиенте были получены.\n\nКритерии:\n1. Первый ли раз обращается\n2. Указан ли город клиента\n3. Тип клиента (физ/юр)\n4. Получены ли контакты\n5. Источник (откуда узнали)\n\nФормат ответа (строго JSON):\n\n{\n \"first_time\": {\n \"value\": true/false,\n \"evidence\": \"цитата или null\",\n \"confidence\": number\n },\n \"client_city\": {\n \"value\": true/false,\n \"city\": \"строка или null\",\n \"evidence\": \"цитата или null\",\n \"confidence\": number\n },\n \"client_type\": {\n \"value\": true/false,\n \"type\": \"physical|legal|null\",\n \"evidence\": \"цитата или null\",\n \"confidence\": number\n },\n \"contacts\": {\n \"value\": true/false,\n \"evidence\": \"цитата или null\",\n \"confidence\": number\n },\n \"source\": {\n \"value\": true/false,\n \"evidence\": \"цитата или null\",\n \"confidence\": number\n }\n}\n\nЖЁСТКИЕ ПРАВИЛА:\n- city/type только если явно сказано\n- не додумывать\n- если нет → value=false, evidence=null, confidence=0.0",
|
||||
"dt_create": "2026-06-09T09:00:00.000000"
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"id_section": 1,
|
||||
"name": "cargo_data",
|
||||
"prompt": "Ты — строгий классификатор логистических данных.\n\nКритерии:\n1. Характер груза\n2. Параметры груза (вес, объем, размеры)\n3. Стоимость груза\n\nФормат ответа (строго JSON):\n\n{\n \"cargo_type\": {\n \"value\": true/false,\n \"type\": \"строка или null\",\n \"evidence\": \"цитата или null\",\n \"confidence\": number\n },\n \"cargo_params\": {\n \"value\": true/false,\n \"params\": \"строка или null\",\n \"evidence\": \"цитата или null\",\n \"confidence\": number\n },\n \"cargo_value\": {\n \"value\": true/false,\n \"amount\": \"строка или null\",\n \"evidence\": \"цитата или null\",\n \"confidence\": number\n }\n}\n\nЖЁСТКИЕ ПРАВИЛА:\n- только явные данные\n- числа/параметры должны быть в evidence\n- если нет → value=false, evidence=null, confidence=0.0",
|
||||
"dt_create": "2026-06-09T09:00:00.000000"
|
||||
}
|
||||
]
|
||||
5
workers/transcribe/go.mod
Normal file
5
workers/transcribe/go.mod
Normal file
@@ -0,0 +1,5 @@
|
||||
module github.com/yourorg/transcribe
|
||||
|
||||
go 1.22
|
||||
|
||||
require github.com/rabbitmq/amqp091-go v1.9.0
|
||||
18
workers/transcribe/go.sum
Normal file
18
workers/transcribe/go.sum
Normal file
@@ -0,0 +1,18 @@
|
||||
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
|
||||
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
|
||||
github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo=
|
||||
github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ=
|
||||
github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI=
|
||||
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
|
||||
github.com/rabbitmq/amqp091-go v1.9.0 h1:qrQtyzB4H8BQgEuJwhmVQqVHB9O4+MNDJCCAcpc3Aoo=
|
||||
github.com/rabbitmq/amqp091-go v1.9.0/go.mod h1:+jPrT9iY2eLjRaMSRHUhc3z14E/l85kv/f+6luSD3pc=
|
||||
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
|
||||
github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw=
|
||||
github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
|
||||
github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU=
|
||||
go.uber.org/goleak v1.2.1 h1:NBol2c7O1ZokfZ0LEU9K6Whx/KnwvepVetCUhtKja4A=
|
||||
go.uber.org/goleak v1.2.1/go.mod h1:qlT2yGI9QafXHhZZLxlSuNsMw3FFLxBr+tBRlmO1xH4=
|
||||
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
|
||||
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
|
||||
gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
|
||||
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
|
||||
78
workers/transcribe/internal/config/config.go
Normal file
78
workers/transcribe/internal/config/config.go
Normal file
@@ -0,0 +1,78 @@
|
||||
package config
|
||||
|
||||
import (
|
||||
"os"
|
||||
"strconv"
|
||||
"time"
|
||||
)
|
||||
|
||||
type Config struct {
|
||||
RabbitURL string
|
||||
InputQueue string
|
||||
OutputExchange string
|
||||
AnalyseQueue string
|
||||
TaggingQueue string
|
||||
InputExchange string
|
||||
InputRoutingKey string
|
||||
Prefetch int
|
||||
|
||||
NexaraBaseURL string
|
||||
NexaraAPIKey string
|
||||
NexaraModel string
|
||||
NexaraTimeout time.Duration
|
||||
|
||||
PromptsSource string
|
||||
PromptsFile string
|
||||
PromptsBaseURL string
|
||||
PromptsAPIKey string
|
||||
PromptsSection int
|
||||
}
|
||||
|
||||
func Load() Config {
|
||||
return Config{
|
||||
RabbitURL: getEnv("RABBITMQ_URL", "amqp://guest:guest@localhost:5672/"),
|
||||
InputQueue: getEnv("INPUT_QUEUE", "transcribe.tasks"),
|
||||
OutputExchange: getEnv("OUTPUT_EXCHANGE", "transcription_done"),
|
||||
AnalyseQueue: getEnv("ANALYSE_QUEUE", "analyse"),
|
||||
TaggingQueue: getEnv("TAGGING_QUEUE", "tagging"),
|
||||
InputExchange: getEnv("RABBITMQ_EXCHANGE", "audio_pipeline"),
|
||||
InputRoutingKey: getEnv("RABBITMQ_ROUTING_KEY", "audio.new"),
|
||||
Prefetch: getInt("PREFETCH", 1),
|
||||
|
||||
NexaraBaseURL: getEnv("NEXARA_BASE_URL", "https://api.nexara.ru"),
|
||||
NexaraAPIKey: os.Getenv("NEXARA_API_KEY"),
|
||||
NexaraModel: getEnv("NEXARA_MODEL", "whisper-1"),
|
||||
NexaraTimeout: getDuration("NEXARA_TIMEOUT", 10*time.Minute),
|
||||
|
||||
PromptsSource: getEnv("PROMPTS_SOURCE", "static"),
|
||||
PromptsFile: getEnv("PROMPTS_FILE", "/app/configs/prompts.json"),
|
||||
PromptsBaseURL: os.Getenv("PROMPTS_BASE_URL"),
|
||||
PromptsAPIKey: os.Getenv("PROMPTS_API_KEY"),
|
||||
PromptsSection: getInt("PROMPTS_SECTION", 1),
|
||||
}
|
||||
}
|
||||
|
||||
func getEnv(key, def string) string {
|
||||
if v := os.Getenv(key); v != "" {
|
||||
return v
|
||||
}
|
||||
return def
|
||||
}
|
||||
|
||||
func getInt(key string, def int) int {
|
||||
if v := os.Getenv(key); v != "" {
|
||||
if i, err := strconv.Atoi(v); err == nil {
|
||||
return i
|
||||
}
|
||||
}
|
||||
return def
|
||||
}
|
||||
|
||||
func getDuration(key string, def time.Duration) time.Duration {
|
||||
if v := os.Getenv(key); v != "" {
|
||||
if d, err := time.ParseDuration(v); err == nil {
|
||||
return d
|
||||
}
|
||||
}
|
||||
return def
|
||||
}
|
||||
172
workers/transcribe/internal/consumer/consumer.go
Normal file
172
workers/transcribe/internal/consumer/consumer.go
Normal file
@@ -0,0 +1,172 @@
|
||||
package consumer
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"log/slog"
|
||||
"time"
|
||||
|
||||
amqp "github.com/rabbitmq/amqp091-go"
|
||||
|
||||
"github.com/yourorg/transcribe/internal/config"
|
||||
"github.com/yourorg/transcribe/internal/models"
|
||||
"github.com/yourorg/transcribe/internal/nexara"
|
||||
"github.com/yourorg/transcribe/internal/prompts"
|
||||
)
|
||||
|
||||
type Consumer struct {
|
||||
cfg config.Config
|
||||
ch *amqp.Channel
|
||||
nexara *nexara.Client
|
||||
prompts *prompts.Loader
|
||||
}
|
||||
|
||||
func New(cfg config.Config, ch *amqp.Channel) (*Consumer, error) {
|
||||
if err := setupTopology(ch, cfg); err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return &Consumer{
|
||||
cfg: cfg,
|
||||
ch: ch,
|
||||
nexara: nexara.New(cfg.NexaraBaseURL, cfg.NexaraAPIKey, cfg.NexaraModel, cfg.NexaraTimeout),
|
||||
prompts: prompts.New(cfg.PromptsSource, cfg.PromptsFile, cfg.PromptsBaseURL, cfg.PromptsAPIKey, cfg.PromptsSection),
|
||||
}, nil
|
||||
}
|
||||
|
||||
func setupTopology(ch *amqp.Channel, cfg config.Config) error {
|
||||
if err := ch.ExchangeDeclare("dlx", "direct", true, false, false, false, nil); err != nil {
|
||||
return fmt.Errorf("declare dlx: %w", err)
|
||||
}
|
||||
if err := ch.ExchangeDeclare(cfg.InputExchange, "direct", true, false, false, false, nil); err != nil {
|
||||
return fmt.Errorf("declare input exchange: %w", err)
|
||||
}
|
||||
if err := ch.ExchangeDeclare(cfg.OutputExchange, "fanout", true, false, false, false, nil); err != nil {
|
||||
return fmt.Errorf("declare output exchange: %w", err)
|
||||
}
|
||||
|
||||
dlqArgs := amqp.Table{
|
||||
"x-dead-letter-exchange": "dlx",
|
||||
"x-dead-letter-routing-key": cfg.InputQueue + ".failed",
|
||||
}
|
||||
if _, err := ch.QueueDeclare(cfg.InputQueue, true, false, false, false, dlqArgs); err != nil {
|
||||
return fmt.Errorf("declare input queue: %w", err)
|
||||
}
|
||||
if _, err := ch.QueueDeclare(cfg.InputQueue+".failed", true, false, false, false, nil); err != nil {
|
||||
return fmt.Errorf("declare dlq: %w", err)
|
||||
}
|
||||
if err := ch.QueueBind(cfg.InputQueue+".failed", cfg.InputQueue+".failed", "dlx", false, nil); err != nil {
|
||||
return fmt.Errorf("bind dlq: %w", err)
|
||||
}
|
||||
if err := ch.QueueBind(cfg.InputQueue, cfg.InputRoutingKey, cfg.InputExchange, false, nil); err != nil {
|
||||
return fmt.Errorf("bind input queue: %w", err)
|
||||
}
|
||||
|
||||
for _, q := range []string{cfg.AnalyseQueue, cfg.TaggingQueue} {
|
||||
if _, err := ch.QueueDeclare(q, true, false, false, false, nil); err != nil {
|
||||
return fmt.Errorf("declare queue %s: %w", q, err)
|
||||
}
|
||||
if err := ch.QueueBind(q, "", cfg.OutputExchange, false, nil); err != nil {
|
||||
return fmt.Errorf("bind queue %s: %w", q, err)
|
||||
}
|
||||
}
|
||||
|
||||
return ch.Qos(cfg.Prefetch, 0, false)
|
||||
}
|
||||
|
||||
func (c *Consumer) Run(ctx context.Context) error {
|
||||
if err := c.ch.Confirm(false); err != nil {
|
||||
return fmt.Errorf("confirm mode: %w", err)
|
||||
}
|
||||
|
||||
msgs, err := c.ch.Consume(c.cfg.InputQueue, "", false, false, false, false, nil)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
slog.Info("transcribe worker started", "queue", c.cfg.InputQueue, "output_exchange", c.cfg.OutputExchange)
|
||||
|
||||
for {
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
return nil
|
||||
case d, ok := <-msgs:
|
||||
if !ok {
|
||||
return fmt.Errorf("delivery channel closed")
|
||||
}
|
||||
c.handle(ctx, d)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func (c *Consumer) handle(ctx context.Context, d amqp.Delivery) {
|
||||
var task models.AudioTask
|
||||
if err := json.Unmarshal(d.Body, &task); err != nil {
|
||||
slog.Warn("bad message", "delivery_tag", d.DeliveryTag, "error", err)
|
||||
_ = d.Nack(false, false)
|
||||
return
|
||||
}
|
||||
|
||||
slog.Info("message received", "task_id", task.TaskID, "file_path", task.FilePath, "filename", task.Filename)
|
||||
|
||||
txCtx, cancel := context.WithTimeout(ctx, c.cfg.NexaraTimeout+30*time.Second)
|
||||
defer cancel()
|
||||
|
||||
text, lang, segments, err := c.nexara.TranscribeFile(txCtx, task.FilePath)
|
||||
if err != nil {
|
||||
slog.Warn("transcription failed", "task_id", task.TaskID, "error", err)
|
||||
_ = d.Nack(false, false)
|
||||
return
|
||||
}
|
||||
|
||||
promptList, err := c.prompts.Load(txCtx)
|
||||
if err != nil {
|
||||
slog.Warn("prompts load failed", "task_id", task.TaskID, "error", err)
|
||||
_ = d.Nack(false, false)
|
||||
return
|
||||
}
|
||||
|
||||
result := models.TranscriptionResult{
|
||||
TaskID: task.TaskID,
|
||||
Filename: task.Filename,
|
||||
FilePath: task.FilePath,
|
||||
Transcription: text,
|
||||
Language: lang,
|
||||
Segments: segments,
|
||||
Prompts: promptList,
|
||||
TranscribedAt: time.Now().Unix(),
|
||||
}
|
||||
|
||||
body, err := json.Marshal(result)
|
||||
if err != nil {
|
||||
slog.Warn("marshal failed", "task_id", task.TaskID, "error", err)
|
||||
_ = d.Nack(false, false)
|
||||
return
|
||||
}
|
||||
|
||||
confirms := c.ch.NotifyPublish(make(chan amqp.Confirmation, 1))
|
||||
if err := c.ch.PublishWithContext(txCtx, c.cfg.OutputExchange, "", false, false, amqp.Publishing{
|
||||
ContentType: "application/json",
|
||||
Body: body,
|
||||
DeliveryMode: amqp.Persistent,
|
||||
}); err != nil {
|
||||
slog.Warn("publish failed, requeue", "task_id", task.TaskID, "error", err)
|
||||
_ = d.Nack(false, true)
|
||||
return
|
||||
}
|
||||
select {
|
||||
case confirm := <-confirms:
|
||||
if !confirm.Ack {
|
||||
slog.Warn("publish not confirmed, requeue", "task_id", task.TaskID)
|
||||
_ = d.Nack(false, true)
|
||||
return
|
||||
}
|
||||
case <-txCtx.Done():
|
||||
slog.Warn("publish timeout, requeue", "task_id", task.TaskID)
|
||||
_ = d.Nack(false, true)
|
||||
return
|
||||
}
|
||||
|
||||
slog.Info("transcribed", "task_id", task.TaskID, "language", lang, "chars", len(text), "segments", len(segments), "prompts", len(promptList))
|
||||
_ = d.Ack(false)
|
||||
}
|
||||
34
workers/transcribe/internal/models/models.go
Normal file
34
workers/transcribe/internal/models/models.go
Normal file
@@ -0,0 +1,34 @@
|
||||
package models
|
||||
|
||||
type AudioTask struct {
|
||||
TaskID string `json:"task_id"`
|
||||
FilePath string `json:"file_path"`
|
||||
Filename string `json:"filename"`
|
||||
Size int64 `json:"size"`
|
||||
CreatedAt int64 `json:"created_at"`
|
||||
}
|
||||
|
||||
type Segment struct {
|
||||
Start float64 `json:"start"`
|
||||
End float64 `json:"end"`
|
||||
Text string `json:"text"`
|
||||
}
|
||||
|
||||
type Prompt struct {
|
||||
ID int `json:"id"`
|
||||
IDSection int `json:"id_section"`
|
||||
Name string `json:"name"`
|
||||
Prompt string `json:"prompt"`
|
||||
DtCreate string `json:"dt_create"`
|
||||
}
|
||||
|
||||
type TranscriptionResult struct {
|
||||
TaskID string `json:"task_id"`
|
||||
Filename string `json:"filename"`
|
||||
FilePath string `json:"file_path"`
|
||||
Transcription string `json:"transcription"`
|
||||
Language string `json:"language"`
|
||||
Segments []Segment `json:"segments,omitempty"`
|
||||
Prompts []Prompt `json:"prompts"`
|
||||
TranscribedAt int64 `json:"transcribed_at"`
|
||||
}
|
||||
117
workers/transcribe/internal/nexara/nexara.go
Normal file
117
workers/transcribe/internal/nexara/nexara.go
Normal file
@@ -0,0 +1,117 @@
|
||||
package nexara
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
"mime/multipart"
|
||||
"net/http"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/yourorg/transcribe/internal/models"
|
||||
)
|
||||
|
||||
type Client struct {
|
||||
apiURL string
|
||||
apiKey string
|
||||
model string
|
||||
httpClient *http.Client
|
||||
}
|
||||
|
||||
func New(baseURL, apiKey, model string, timeout time.Duration) *Client {
|
||||
baseURL = strings.TrimRight(baseURL, "/")
|
||||
return &Client{
|
||||
apiURL: baseURL + "/api/v1/audio/transcriptions",
|
||||
apiKey: apiKey,
|
||||
model: model,
|
||||
httpClient: &http.Client{
|
||||
Timeout: timeout,
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
func (c *Client) TranscribeFile(ctx context.Context, path string) (text, language string, segments []models.Segment, err error) {
|
||||
f, err := os.Open(path)
|
||||
if err != nil {
|
||||
return "", "", nil, fmt.Errorf("open file: %w", err)
|
||||
}
|
||||
defer f.Close()
|
||||
|
||||
body := &bytes.Buffer{}
|
||||
writer := multipart.NewWriter(body)
|
||||
part, err := writer.CreateFormFile("file", filepath.Base(path))
|
||||
if err != nil {
|
||||
return "", "", nil, err
|
||||
}
|
||||
if _, err := io.Copy(part, f); err != nil {
|
||||
return "", "", nil, err
|
||||
}
|
||||
if c.model != "" {
|
||||
if err := writer.WriteField("model", c.model); err != nil {
|
||||
return "", "", nil, err
|
||||
}
|
||||
}
|
||||
if err := writer.WriteField("response_format", "json"); err != nil {
|
||||
return "", "", nil, err
|
||||
}
|
||||
if err := writer.Close(); err != nil {
|
||||
return "", "", nil, err
|
||||
}
|
||||
|
||||
req, err := http.NewRequestWithContext(ctx, http.MethodPost, c.apiURL, body)
|
||||
if err != nil {
|
||||
return "", "", nil, err
|
||||
}
|
||||
req.Header.Set("Content-Type", writer.FormDataContentType())
|
||||
req.Header.Set("Authorization", "Bearer "+c.apiKey)
|
||||
|
||||
resp, err := c.httpClient.Do(req)
|
||||
if err != nil {
|
||||
return "", "", nil, fmt.Errorf("request: %w", err)
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
|
||||
respBody, err := io.ReadAll(resp.Body)
|
||||
if err != nil {
|
||||
return "", "", nil, err
|
||||
}
|
||||
if resp.StatusCode != http.StatusOK {
|
||||
return "", "", nil, fmt.Errorf("status %d: %s", resp.StatusCode, string(respBody))
|
||||
}
|
||||
|
||||
var raw map[string]any
|
||||
if err := json.Unmarshal(respBody, &raw); err != nil {
|
||||
return "", "", nil, fmt.Errorf("parse: %w", err)
|
||||
}
|
||||
if t, ok := raw["text"].(string); ok {
|
||||
text = t
|
||||
}
|
||||
if lang, ok := raw["language"].(string); ok {
|
||||
language = lang
|
||||
}
|
||||
if segs, ok := raw["segments"].([]any); ok {
|
||||
for _, s := range segs {
|
||||
m, ok := s.(map[string]any)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
var seg models.Segment
|
||||
if v, ok := m["start"].(float64); ok {
|
||||
seg.Start = v
|
||||
}
|
||||
if v, ok := m["end"].(float64); ok {
|
||||
seg.End = v
|
||||
}
|
||||
if v, ok := m["text"].(string); ok {
|
||||
seg.Text = v
|
||||
}
|
||||
segments = append(segments, seg)
|
||||
}
|
||||
}
|
||||
return text, language, segments, nil
|
||||
}
|
||||
100
workers/transcribe/internal/prompts/prompts.go
Normal file
100
workers/transcribe/internal/prompts/prompts.go
Normal file
@@ -0,0 +1,100 @@
|
||||
package prompts
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
"net/http"
|
||||
"os"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/yourorg/transcribe/internal/models"
|
||||
)
|
||||
|
||||
type Loader struct {
|
||||
source string
|
||||
filePath string
|
||||
baseURL string
|
||||
apiKey string
|
||||
sectionID int
|
||||
client *http.Client
|
||||
}
|
||||
|
||||
func New(source, filePath, baseURL, apiKey string, sectionID int) *Loader {
|
||||
return &Loader{
|
||||
source: source,
|
||||
filePath: filePath,
|
||||
baseURL: strings.TrimRight(baseURL, "/"),
|
||||
apiKey: apiKey,
|
||||
sectionID: sectionID,
|
||||
client: &http.Client{Timeout: 30 * time.Second},
|
||||
}
|
||||
}
|
||||
|
||||
func (l *Loader) Load(ctx context.Context) ([]models.Prompt, error) {
|
||||
switch strings.ToLower(l.source) {
|
||||
case "http":
|
||||
return l.loadHTTP(ctx)
|
||||
default:
|
||||
return l.loadStatic()
|
||||
}
|
||||
}
|
||||
|
||||
func (l *Loader) loadStatic() ([]models.Prompt, error) {
|
||||
data, err := os.ReadFile(l.filePath)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("read prompts file: %w", err)
|
||||
}
|
||||
var prompts []models.Prompt
|
||||
if err := json.Unmarshal(data, &prompts); err != nil {
|
||||
return nil, fmt.Errorf("parse prompts file: %w", err)
|
||||
}
|
||||
return filterSection(prompts, l.sectionID), nil
|
||||
}
|
||||
|
||||
func (l *Loader) loadHTTP(ctx context.Context) ([]models.Prompt, error) {
|
||||
if l.baseURL == "" {
|
||||
return nil, fmt.Errorf("PROMPTS_BASE_URL is required for http source")
|
||||
}
|
||||
url := fmt.Sprintf("%s/metrics/?id_section=%s", l.baseURL, strconv.Itoa(l.sectionID))
|
||||
req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
if l.apiKey != "" {
|
||||
req.Header.Set("Authorization", "Bearer "+l.apiKey)
|
||||
}
|
||||
resp, err := l.client.Do(req)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
body, err := io.ReadAll(resp.Body)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
if resp.StatusCode != http.StatusOK {
|
||||
return nil, fmt.Errorf("prompts api status %d: %s", resp.StatusCode, string(body))
|
||||
}
|
||||
var prompts []models.Prompt
|
||||
if err := json.Unmarshal(body, &prompts); err != nil {
|
||||
return nil, fmt.Errorf("parse prompts response: %w", err)
|
||||
}
|
||||
return filterSection(prompts, l.sectionID), nil
|
||||
}
|
||||
|
||||
func filterSection(prompts []models.Prompt, sectionID int) []models.Prompt {
|
||||
if sectionID <= 0 {
|
||||
return prompts
|
||||
}
|
||||
out := make([]models.Prompt, 0, len(prompts))
|
||||
for _, p := range prompts {
|
||||
if p.IDSection == sectionID {
|
||||
out = append(out, p)
|
||||
}
|
||||
}
|
||||
return out
|
||||
}
|
||||
Reference in New Issue
Block a user