Deployment

Docker image build, RunPod serverless configuration, and production deployment.

The worker is deployed as a RunPod serverless endpoint running on GPU instances.

Docker Image

The Dockerfile uses RunPod's PyTorch base image with CUDA support:

FROM runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04
 
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
 
WORKDIR /app
 
COPY pyproject.toml .python-version ./
COPY uv.loc[k] ./
 
RUN uv sync --frozen --no-dev --no-install-project || uv sync --no-dev --no-install-project
 
# Bake LaBSE into image (~1.8 GB) to avoid cold-start download
RUN uv run python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('sentence-transformers/LaBSE')"
 
COPY src/ src/
 
CMD ["uv", "run", "python", "-m", "src.handler"]

Key Build Decisions

LaBSE baked in — the model is downloaded during build (~1.8 GB). This eliminates cold-start latency from model downloads on fresh containers.
uv for dependency management — faster than pip, with lockfile support. Falls back to non-frozen install if no lockfile exists.
No dev dependencies — --no-dev keeps the image lean (no pytest, ruff).
Source copied last — Docker layer caching means dependency installation only reruns when pyproject.toml or uv.lock change.

Building

docker build -t topic-worker .

The image is ~8-10 GB due to CUDA runtime + PyTorch + LaBSE model.

Pushing to Registry

docker tag topic-worker <registry>/topic-worker:latest
docker push <registry>/topic-worker:latest

RunPod Configuration

Serverless Endpoint Setup

Create a serverless endpoint on RunPod
Point it to the Docker image in your registry
Configure GPU type (any CUDA-capable GPU works; 16GB+ VRAM recommended)
Set the endpoint URL in the API's .env:

TOPIC_MODEL_WORKER_URL=https://api.runpod.ai/v2/<endpoint-id>/runsync
RUNPOD_API_KEY=<your-key>

Request Flow

API → POST /v2/<endpoint-id>/runsync
     Body: { input: { items: [...], params: {...} } }
     Headers: { Authorization: Bearer <RUNPOD_API_KEY> }

RunPod → Starts container (or uses warm instance)
       → Calls handler({ input: { items: [...], params: {...} } })

Worker → Returns result dict

RunPod → Wraps in { id, status: "COMPLETED", output: <result> }
       → Returns to API

Scaling

Setting	Recommended
Min workers	0 (scale to zero when idle)
Max workers	1-2 (topic modeling is a batch operation, not high-throughput)
Idle timeout	30s (keep warm for short periods between pipeline stages)
Execution timeout	300s (matches `BULLMQ_TOPIC_MODEL_HTTP_TIMEOUT_MS`)

Configuration

The worker has no environment variables — all configuration is in src/config.py:

Config	Value	Purpose
`LABSE_MODEL`	`sentence-transformers/LaBSE`	Embedding model for KeyBERTInspired
`DEVICE`	`cuda` or `cpu` (auto-detected)	PyTorch device
`WORKER_VERSION`	`1.0.0`	Returned in responses, stored on `TopicModelRun.workerVersion`
`DEFAULT_PARAMS`	RUN 012 values	Hyperparameter defaults

Dependencies

Core runtime dependencies (pyproject.toml):

Package	Version	Purpose
`runpod`	≥ 1.7.0	RunPod serverless handler framework
`pydantic`	≥ 2.0	Request/response validation
`sentence-transformers`	≥ 3.0	LaBSE model loading
`bertopic`	≥ 0.16.0	Topic modeling pipeline
`umap-learn`	≥ 0.5.6	Dimensionality reduction
`hdbscan`	≥ 0.8.33	Density-based clustering
`scikit-learn`	≥ 1.4.0	Silhouette score, CountVectorizer
`gensim`	≥ 4.3.0	NPMI coherence computation
`numpy`	≥ 1.26.0	Array operations