Multilingual Support
How the worker handles Cebuano, Tagalog, English, and code-switched student feedback.
Student feedback at the University of Cebu is written in a mix of Cebuano, Tagalog, English, and frequently code-switched combinations within a single response. The worker handles this through LaBSE embeddings and a curated multilingual stop word list.
Why LaBSE?
LaBSE (Language-agnostic BERT Sentence Embeddings) is a multilingual sentence embedding model that supports 109 languages, including Filipino and Cebuano. It maps sentences from different languages into a shared 768-dimensional vector space where semantically similar sentences are close together regardless of language.
This means a Cebuano comment like "Maayo kaayo siya mo explain" and an English comment like "Very good at explaining" will cluster together — exactly what's needed for topic discovery in a multilingual corpus.
Stop Words
The CountVectorizer in the BERTopic pipeline uses a curated stop word list (MULTILINGUAL_STOP_WORDS in src/topic_model.py) that covers three categories:
English Function Words
Standard grammatical tokens: articles, prepositions, pronouns, auxiliaries, and common filler words (the, is, are, very, just, really, etc.).
Cebuano Function Words
Grammatical particles and high-frequency words that dominate c-TF-IDF without adding topic signal:
| Category | Examples |
|---|---|
| Markers | ang, nga, sa, ni, si |
| Conjunctions | ug/og, pero |
| Particles | ba, na, man, lang, ra, jud/gyud |
| Pronouns | ko, mo, siya, niya, nako, sila, nila |
| Possessives | iyang, kanyang, among, atong, ilang, inyong |
| Demonstratives | dili, wala, naa, adto, diri, didto |
| Filler | kaayo, mao, bawat |
Tagalog Function Words
Grammatical particles common in Tagalog and Filipino:
| Category | Examples |
|---|---|
| Markers | ng, mga, ay |
| Particles | nang, rin/din, po, ho, naman, pa |
| Conjunctions | at, o, kung, dahil, para, kasi |
| Pronouns | namin, natin, ito, iyon |
| Negation | hindi |
| Informal | yung, yon, pag |
Role/Title Words
Common words for teachers and students that appear across all languages and don't contribute to topic differentiation:
propesor, estudyante, guro, magaaral, teacher, professor, instructor, maam, sir, atty, miss, faculty, student, students
Tokenization for NPMI
The NPMI coherence metric uses a custom tokenizer (_tokenize() in src/evaluate.py) that:
- Extracts only alphabetic tokens using the regex
[a-zA-Z\u00C0-\u024F]+— this covers Latin script including accented characters used in Filipino - Lowercases all tokens
- Filters out single-character tokens
This ensures Cebuano/Tagalog words with diacritics (less common but present in formal text) are captured correctly.
Code-Switching Handling
Code-switching (mixing languages within a single response) is handled implicitly:
- LaBSE embeddings capture the semantic meaning regardless of which language fragments are used
- Stop words cover all three languages, so function words from any language are filtered during c-TF-IDF
- KeyBERTInspired re-ranks keywords using LaBSE similarity, which understands multilingual semantics
No explicit language detection or per-language processing is needed.