#Africa #LLMs - Global telecommunications operator Orange has partnered with ChatGPT-maker OpenAI and Meta to incorporate African regional languages into large language models (LLMs). Orange will fine-tune OpenAI’s open-source speech models and Meta’s openly available Llama 3.1 model to support regional African languages. The initiative aims to enhance digital inclusion across Africa by enabling AI models to understand languages such as Pulaar and Wolof, spoken by 6 million and 16 million people respectively, in West Africa. The program will commence in 2025.
SO WHAT? - The African continent has many colloquial languages that do not have native written forms and use Latin script or a combination of scripts from different languages in their modern written forms. As a result colloquial African languages are underrepresented in digital form, making it prohibitively difficult and expensive to train large language models to support such languages. Although underrepresented in the digital world, languages like Pulaar and Wolof are spoken and in common usage by millions. Therefore, the rapid advances in Generative AI, which favour natively written languages, risk leaving African colloquial language behind in the AI revolution. This new initiative from Orange supports digital inclusion in GenAI.
Here are some key points regarding this initiative:
Global telecom operator Orange has partnered with OpenAI and Meta to incorporate African regional languages into Large Language Models (LLMs).
Orange will fine-tuning OpenAI’s Whisper speech model and Meta’s Llama 3.1 text model to better handle regional African languages.
The telco’s overall goal is to work with many AI technology providers to enable future models to recognise all languages spoken and written across Orange’s 18-country Africa footprint.
The initiative will open-source these local language-trained models for non-commercial use, such as in public health and education, to foster innovation and reduce the digital divide.
By collaborating with local startups and technology in Africa, the initiative aims to both promote digital inclusion and to drive innovation in African languages.
Starting in the first half of 2025 the initiative will first focus on incorporating regional languages, namely Pulaar and Wolof, spoken by 6 million people and 16 million people, respectively, in West Africa.
Orange aims to used different colloquial language-specific LLMs to support customer interactions in local languages for Orange’s services.
The initiative also supports Orange’s broader goals on Responsible AI.
ZOOM OUT - Fine-tuning large language models to perform well using colloquial languages is no small task. The core problem is that LLMs, and even small language models (SLM), need large volumes of high quality data for training. Large, high quality, well-rounded data sets for colloquial African languages are hard to come by and normally require a lot of work to source, aggregate, label and tokenise in order to train AI models. In addition, development projects require not only large volumes of quality data, but both AI experts and language experts to ensure the latest best practices are used for data preparation, tokenisation and training, While global ICT groups such as Orange may have the compute, technical and financial resources to create local language LLMs, they will certainly need to partner to ensure that they also have the right language experts involved.
Read more about African large language models:
MBZUAI launches Atlas-Chat (Africa AI News)
Research project taps crowdsourced data to build Algerian LLM (Africa AI News)