Deployment
Docker image build, RunPod serverless configuration, and production deployment.
The worker is deployed as a RunPod serverless endpoint running on GPU instances.
Docker Image
The Dockerfile uses RunPod's PyTorch base image with CUDA support:
FROM runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
WORKDIR /app
COPY pyproject.toml .python-version ./
COPY uv.loc[k] ./
RUN uv sync --frozen --no-dev --no-install-project || uv sync --no-dev --no-install-project
# Bake LaBSE into image (~1.8 GB) to avoid cold-start download
RUN uv run python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('sentence-transformers/LaBSE')"
COPY src/ src/
CMD ["uv", "run", "python", "-m", "src.handler"]Key Build Decisions
- LaBSE baked in — the model is downloaded during build (
~1.8 GB). This eliminates cold-start latency from model downloads on fresh containers. - uv for dependency management — faster than pip, with lockfile support. Falls back to non-frozen install if no lockfile exists.
- No dev dependencies —
--no-devkeeps the image lean (no pytest, ruff). - Source copied last — Docker layer caching means dependency installation only reruns when
pyproject.tomloruv.lockchange.
Building
docker build -t topic-worker .The image is ~8-10 GB due to CUDA runtime + PyTorch + LaBSE model.
Pushing to Registry
docker tag topic-worker <registry>/topic-worker:latest
docker push <registry>/topic-worker:latestRunPod Configuration
Serverless Endpoint Setup
- Create a serverless endpoint on RunPod
- Point it to the Docker image in your registry
- Configure GPU type (any CUDA-capable GPU works; 16GB+ VRAM recommended)
- Set the endpoint URL in the API's
.env:
TOPIC_MODEL_WORKER_URL=https://api.runpod.ai/v2/<endpoint-id>/runsync
RUNPOD_API_KEY=<your-key>Request Flow
API → POST /v2/<endpoint-id>/runsync
Body: { input: { items: [...], params: {...} } }
Headers: { Authorization: Bearer <RUNPOD_API_KEY> }
RunPod → Starts container (or uses warm instance)
→ Calls handler({ input: { items: [...], params: {...} } })
Worker → Returns result dict
RunPod → Wraps in { id, status: "COMPLETED", output: <result> }
→ Returns to API
Scaling
| Setting | Recommended |
|---|---|
| Min workers | 0 (scale to zero when idle) |
| Max workers | 1-2 (topic modeling is a batch operation, not high-throughput) |
| Idle timeout | 30s (keep warm for short periods between pipeline stages) |
| Execution timeout | 300s (matches BULLMQ_TOPIC_MODEL_HTTP_TIMEOUT_MS) |
Configuration
The worker has no environment variables — all configuration is in src/config.py:
| Config | Value | Purpose |
|---|---|---|
LABSE_MODEL | sentence-transformers/LaBSE | Embedding model for KeyBERTInspired |
DEVICE | cuda or cpu (auto-detected) | PyTorch device |
WORKER_VERSION | 1.0.0 | Returned in responses, stored on TopicModelRun.workerVersion |
DEFAULT_PARAMS | RUN 012 values | Hyperparameter defaults |
Dependencies
Core runtime dependencies (pyproject.toml):
| Package | Version | Purpose |
|---|---|---|
runpod | ≥ 1.7.0 | RunPod serverless handler framework |
pydantic | ≥ 2.0 | Request/response validation |
sentence-transformers | ≥ 3.0 | LaBSE model loading |
bertopic | ≥ 0.16.0 | Topic modeling pipeline |
umap-learn | ≥ 0.5.6 | Dimensionality reduction |
hdbscan | ≥ 0.8.33 | Density-based clustering |
scikit-learn | ≥ 1.4.0 | Silhouette score, CountVectorizer |
gensim | ≥ 4.3.0 | NPMI coherence computation |
numpy | ≥ 1.26.0 | Array operations |