Research results

Mining Folios for Parallel Sentences

Mining folios for parallel sentences Two available datasets As of now, 84,000 publishes two datasets: Parallel folios. This takes the form of their translations in XML... Read more...

Segmenting Long Documents

Segmenting long documents The need for segmentation Modern generative language models are trained on short sentences. While there are summarization models, they only accept... Read more...

No Language Left Behind

No Language Left Behind What is NLLB? In July 2022, FAIR (Facebook AI Research) released a large multilingual transformer model that they call No Language Left Behind, or NLLB... Read more...

Injecting Context During Decoding

Injecting context during decoding Ambiguity and context in classical Tibetan As a written language, Tibetan is a simple language in every way - syntactically, morphologically,... Read more...

Sequencing Long Text From Sentence Fragments

Sequencing long text from sentence fragments What is sequencing? There are two datasets available from 84,000: The raw English translations. This is English text only, it is... Read more...

1 2