Masakhane

Masakhane

Masakhane is isiZulu for 'We Build Together'. Masakhane is an open source collaborative project that brings together Researchers, Engineers, NLP enthusiasts and linguistic experts from across Africa. It is a project aimed at Machine translation for African Languages, and consequently, improving the digital presence of African Languages.

I am currently working on Kiswahili and Kikuyu Language models. This task involves data collection, cleaning, pre processing, and the training of a baseline Machine Translation model.

So far, I have trained a transformer model for English - Kiswahili translation, using the SAWA corpus. SAWA is a parallel English - Kiswahili corpus made up of 542.1 k English words and 442.9 k Swahili words. Here is a link to the work (https://github.com/Freshia/Kiswahili_MT). The test scores on this model were quite low, given that the model was tested against a test set built from JW300 data, which is bibilical text.

This led me to create another test set, from data randomly selected from a combination of JW300 corpus and SAWA Corpus. I then trained another transformer model that combined data from the two corpora (SAWA and JW300), and tested it against the custom test set. This model recorded much higher BLEU scores than the previous model. Here is a link to the wokr (https://github.com/Freshia/sawa_jw_MT)

Given the agglutinative nature of Swahili language, I used Byte Pair encoding (BPE), in an attempt to improve the model accuracy. BLEU scores were used as the evaluation metric.

Image placeholder

Charles Darwin

From so simple a beginning, endless forms most beautiful and most wonderful have been and are being created