Topic Modeling Worker

BERTopic-based multilingual topic discovery worker — architecture, pipeline, and integration with the Faculytics analysis system.

The topic modeling worker is a GPU-accelerated microservice that discovers recurring themes in student feedback using BERTopic. It receives pre-cleaned text and pre-computed LaBSE embeddings from the NestJS API, runs unsupervised clustering, and returns discovered topics with per-document assignments and quality metrics.

How It Fits in the System

The worker sits in the middle of the analysis pipeline:

Sentiment analysis runs first, scoring every submission
A sentiment gate filters the corpus (negative/neutral pass; positive needs ≥10 words)
This worker receives the filtered submissions with their LaBSE embeddings
After topics are discovered, the API runs topic labeling (LLM) and then recommendations

Key Characteristics

Property	Value
Runtime	RunPod serverless (GPU)
Language	Python 3.11
ML Stack	BERTopic, UMAP, HDBSCAN, KeyBERTInspired
Embedding model	LaBSE (768-dim, baked into Docker image)
Input	Pre-cleaned text + pre-computed embeddings
Output	Topics, assignments, quality metrics
Error strategy	Domain errors → `status: "failed"` (no retry); infrastructure errors → exception (RunPod retries)

Design Principles

No preprocessing — text arrives pre-cleaned (cleanedComment) from the API. The worker never modifies input text.
No database access — the worker is pure compute. All persistence is handled by the NestJS API after receiving results.
Pre-computed embeddings — LaBSE embeddings are generated separately by the embedding worker and stored in pgvector. The topic worker receives them as input, avoiding redundant computation.
LaBSE loaded once — the model is baked into the Docker image and loaded at container start. It's used only for KeyBERTInspired keyword extraction, not for document embedding.
Deterministic seeds — UMAP and NumPy use random_state=42 for reproducible results across runs with the same input.