SabiYarn: Advancing Low-Resource Languages With Multitask NLP Pre-Training [Paper Reflections]
In recent years, Large Language Models (LLMs) have mostly improved by scaling. This has primarily involved increasing the size of the LLMs and the data they are trained on, resulting in a highly resource-intensive process that can cost up to millions of dollars.
While LLMs have become ubiquitous, the resource-intensive pre-training process poses a threat to the inclusion of low-resource languages, where data is scarce. Often, this is accompanied by a lack of funding for compute resources.
In our paper, SabiYarn: Advancing Low-Resource Languages with multi-task NLP Pre-Training, which was accepted at the AfricaNLP workshop at the 2025 ACL, we propose a series of optimization methods in the LLM pre-training process that made it possible to train a SOTA multilingual foundation model on Nigerian languages on a single 24 GB GPU.
One of these techniques is a mask-based loss computation strategy. This simple idea avoids computing loss on input prompt tokens the model already knows. This allows the loss function to accurately reflect the model’s true performance on the tokens that matter and avoids wasting compute by backpropagating losses that do not contribute to the model’s learning process.
In this article, we’ll explore this approach, how it reflects the broader compute-aware pre-training design and its influence on the model’s performance.
Prompt tokens are (too) expensive in low-resource settings
During pre-training, LLMs are trained in causal language modeling through a next-token prediction task. This is typically a slow process involving trillions of tokens, whose goal is to reduce the cross-entropy loss between the predicted token and the label through backpropagation. Along the way, the model acquires multiple skills, memorizes facts, and builds a world model.
For state-of-the-art models like Meta’s Llama 4 or OpenAI’s GPT-4, this computationally intensive process typically involves running thousands of GPUs for months, performing over 1025 floating-point operations (FLOP).
Let’s look at a concrete example. Given a sequence like “Translate English to Yoruba: I love rice. => Mo fẹ́ràn ìrẹsì,” the model is trained to predict every token, from the prompt to the actual answer:
|
Step
|
Prompt
|
Next token
|
|
|
1 |
Translate |
English |
Static prompt |
|
2 |
Translate English |
to |
Static prompt |
|
3 |
Translate English to |
Yoruba: |
Static prompt |
|
4 |
Translate English to Yoruba: |
I |
|
|
5 |
Translate English to Yoruba: I |
love |
|
|
6 |
Translate English to Yoruba: I love |
rice. |
|
|
7 |
Translate English to Yoruba: I love rice. |
-> |
Static prompt |
|
8 |
Translate English to Yoruba: I love rice. -> |
Mo |
|
|
9 |
Translate English to Yoruba: I love rice. -> Mo |
fẹ́ràn |
|
|
10 |
Translate English to Yoruba: I love rice. -> Mo fẹ́ràn |
iresi. |
|
In this setup, all tokens are treated equally, regardless of whether they are part of the prompt or the answer. On the one hand, this is straightforward to set up. On the other hand, it means spending compute on learning to predict tokens that are already known and static.
While this is fine in settings with virtually unlimited compute, it becomes problematic in resource-constrained training. Every token prediction contributes to the total training FLOPs. If half the sequence is an instruction or prompt that never changes, that’s half your compute spent on learning what the model doesn’t need to.
Making do without instruction-tuning
Due to severe compute constraints, we could not include a post-training stage where models are typically aligned with user-facing goals using supervised examples and reinforcement learning from human feedback (RLHF). In such stages, models learn not just to predict the next token but to generate helpful and aligned responses.
For example, a pre-trained base model may reply to “How are you today” with “?”, completing the sequence with the most likely next token. In contrast, an instruction-tuned model would try to provide a response that aligns with the goal of being a useful assistant or chatbot, e.g., “I’m doing good.”
Since post-training wasn’t feasible for SabiYarn, we embedded task awareness directly into the pre-training phase. Our goal was to help the model generalize beyond basic next-token prediction and toward solving meaningful tasks like named-entity recognition, sentiment analysis, and translation entirely through prompt-based conditioning.
In our paper, we propose a task-specific training scheme where the model is conditioned on the task it must perform using XML-like prompt tags. Taking inspiration from the T5 paper, we used the following template:
model_input Model’s output.
For example, an English-to-Pidgin translation task looks like this:
let me call my father : Make I go call my Papa
With this structured format, we were now able to only calculate the cross-entropy loss on just the label tokens (“Make I go call my Papa”).
This is straightforward to implement in PyTorch by masking out the prompt tokens in the label tensor. We use -100 as the ignore index, which PyTorch’s cross_entropy loss function skips:
labels = input_ids.clone()
labels[:, :prompt_len] = -100
Since PyTorch’s cross-entropy loss function ignores the -100 token by default, the prompt tokens are ignored when calculating the loss for that sequence.
Learning only what matters
An unexpected benefit of this approach is improved task focus. Since the model is not backpropagating on the input portion of the sequence, the model’s learning signal comes exclusively from task-relevant tokens.
Consider a pre-training scenario where an LLM is presented with:
translate> let me call my father : Make I go call my Papa
When the loss is computed on every token, the model learns to reproduce the prompt structure, memorizes the task tags, and generates the outputs. The learning signal is diluted across the entire sequence.
Using loss masking, the model can still make input-output connections through the self-attention mechanism during the forward pass. However, backpropagation (learning) only occurs when predicting the output tokens:
Make I go call my Papa
We can compare this to how we as humans learn to translate to a new language: We receive the full input as context, but learning occurs when we’re corrected on our translation, not on the input sentence already provided to us.
Masking out the input forces the model to treat prompts as context rather than a prediction target, allowing training to focus on input-output mappings and reducing the tendency to overfit on prompt formatting.
Investigating the impact of task focus on training performance
To substantiate this finding, we ran an experiment where we trained the model on a non-trivial problem of descrambling sentences using the masked loss scheme and a non-masked loss as a comparison.
The task was to turn grammatically incoherent sentences into their coherent forms using the same words in the input. For example, “The equations expensive. show is optimization computationally that.” should be corrected to “The equations show that optimization is computationally expensive.” This task requires learning complex relationships between input and output sequences.
Here’s what the loss curves looked like:

We can see that the model converged faster on the task when the loss on the input prompt wasn’t calculated. These efficiency gains compound dramatically over the entire training run and lead to faster convergence.
The cost of masking: what are we losing?
While masking the prompt tokens during loss computation helps conserve compute and sharpen focus, it’s not without tradeoffs. Excluding the prompts from the learning signal increases the risk that the model will fail to adapt to tasks where the prompt structure or phrasing changes at inference time.
That said, such tradeoffs must be weighed against the reality of resource constraints. In low-resource training scenarios, approaches that reduce compute while preserving core task performance are often preferable to fully supervised, resource-intensive alternatives.
The case for native LLMs for African languages
While the broader African LLM community has focused its efforts on adapting open-source pre-trained models to African languages, pre-training a foundational model from scratch offers the promise of building a model that doesn’t inherit the cultural biases of Euro-American corpora. It also provides invaluable research insights and data about tokenization, transfer learning, linguistic patterns, and training dynamics for African languages.
An often neglected area is the tokenizer. Tokenizers determine how languages are broken into tokens that LLMs can recognize. Training from scratch enables us to train our own language-specific tokenizers, thereby integrating the morphological and phonological structure, such as tonal diacritics in Yoruba, which also carry semantic meaning.
It also helps with efficiency, as we obtain a tokenizer that effectively tokenizes each language into tokens that recognize useful grammatical structures, such as affixes and punctuation, which can be utilized by the model to learn meaningful representations. In contrast, using an existing tokenizer that is not trained on the target languages leads to poor tokenization, with tokens that don’t accurately reflect grammatical structure, inflated sequence lengths, and ultimately degraded performance. This is especially true for small models, which are appealing due to their lower computing demands.
Looking forward, the future work of our research group focuses on exploring modern LLM architectures, introducing reasoning, instruction following, and test-time computing strategies to resource-constrained pre-training. We’re also exploring hardware-specific optimizations in training and inference and expanding our efforts to even more African languages.