Project Manas

State-of-the-art natural language understanding for literary Tibetan.

This project has a working prototype! Performance on word segmentation and part-of-speech tagging is promising. Contact us for more details, we cannot publish the model until Project Ozymandias is complete.

Problem

Even seemingly simple tasks in Tibetan computational linguistics, such as word segmentation, have eluded the small community working on these problems. With the advent of transformer models, we should now be able to pre-train neural networks that "understand" Tibetan.

Tibetan is a relatively simple language with simple grammar and morphology. Grammar consists of nested grammar trees that pivot on a small set of particles. Words are made up of a fairly small set of atomic word subunits, usually words with their own simple meanings. Indeed, legend has it that the Tibetan language was, so to speak, grown in a lab, like a kind of Esperanto for Buddhism.

Unfortunately, Tibetan is also highly ambiguous and very distant from proto-Indo-European languages like English. Tibetan is even fairly distant from Chinese, as the existence of a Sino-Tibetan language group is a question of some controversy. Many Tibetan texts add to the ambiguity with deliberately dropped particles, such as gender and plural indicators. Spoken Tibetan is notoriously homophonic and scribal errors are common.

Furthermore, language models for Tibetan in particular, unlike most other low-resource languages, are an especially thorny subject. This is due to the genocidal compaign waged against the native Tibetan people by the CCP. Care must be exercised to not create additional extreme harm to the Tibetans.

Proposal

Pre-train monolingual language models for literary Tibetan. There are corpora already extant that can be used for this, such as the Derge Kangyur and Tengyur. Fine-tune them on tasks such as part-of-speech tagging. If possible, design datasets for new language understanding tasks such as named entity recognition.

The prevention of harm to the Tibetans from any models trained in this way is the subject of Project Ozymandias. The two directions of research are to proceed hand-in-hand. Ideally, a Tibetan organization should collaborate with us on this.