Accelerating Language Models with Multi-Token Prediction (2025)

Accelerating Language Models with Multi-Token Prediction (1)

Meta’s new research introduces an improved method for training Large Language Models (LLMs). This model predicts multiple tokens simultaneously in each prediction, rather than just one. Unlike previous approaches, it achieves this without additional training overhead. This advancement not only accelerates text generation but could also enhance the model’s intelligence, potentially ushering in a new training paradigm for cutting-edge AI.

When training a deep neural network, you need to define the task you want the model to optimize. Large language models (LLMs) have made great progress in generating human-like text and performing various language tasks. Traditionally, they are trained using a method called next-token prediction. This involves predicting the next word in a sequence based on the previous words. Once predicted, this word is added to the sequence, and the model generates the next word.

As with all neural networks, we need a way to measure each prediction from the model and use this feedback to reduce errors over time.

LLMs don’t just predict the most likely word; they model distributions. This means they assign a probability to every word in their vocabulary, not just the one they think is most suitable. This allows the model to be versatile and creative.

Example:

Imagine the sentence “The cat sat on the ____.” The model might predict “mat” as the most likely next word, but it also considers other possibilities like “bed,” “floor,” or “chair,” assigning a probability to each. By training on these probabilities, the model learns to generate diverse and contextually appropriate text.

So, how do we measure the error in above situation?

In the case of LLMs, we use the cross-entropy function to measure how well the model’s predictions match the actual words. Cross-entropy comes from information theory and measures the difference between two probability distributions.

In this context, cross-entropy acts as a loss function, helping the model adjust its parameters to make better predictions. When the model predicts the next word in a sentence, cross-entropy tells us how confident the model is about its prediction by comparing it to the true distribution.

By looking at cross-entropy values, we can see how closely the model’s predictions match the actual outcomes, giving us a way to assess the model’s performance quantitatively.

Perplexity is a metric used to measure the performance of language models. It quantifies how confident a model is in its predictions. When the probability distribution is not heavily skewed toward the correct word, the model is said to be “perplexed” or uncertain about its prediction.

  • If the model predicts “The cat sat on the mat” and assigns high probability to “mat,” it has low perplexity.
  • If the model is unsure and assigns similar probabilities to “mat,” “bed,” “chair,” etc., it has higher perplexity.

The single-token prediction method, though effective, encounters several difficulties:

  1. Inefficiency: Predicting one token at a time is slow and resource-intensive, particularly for long sequences.
  2. Limited Context Understanding: The model may find it hard to grasp long-range dependencies and maintain coherence over extended text, resulting in text that is locally fluent but globally inconsistent or nonsensical.
  3. High Computational Cost: This method demands many forward and backward passes, leading to high computational resource and energy consumption.

Meta introduces a new training paradigm that alters the overall model architecture. This approach, called multi-token prediction, changes the traditional method by having the model predict several future words simultaneously instead of just one. At each position in a sentence, the model uses multiple prediction pathways, or “heads,” to forecast the next several words all at once, working collaboratively to improve efficiency and coherence.

Methodology

Accelerating Language Models with Multi-Token Prediction (3)

Source: Better & Faster Large Language Models via Multi-token Prediction by Meta

In simple terms, we modify the LLM to predict the next four words instead of just the next one. To achieve this, we add more output heads to the model.

However, this doesn’t mean we predict 16 tokens in total. Each of the four heads produces four tokens, but we only use the first token from each head, discarding the rest (denoted as words 5, 6, 7, and 8 in the explanation).

The core idea of multi-token prediction is to train the model to predict a sequence of future words from each position in the training data, rather than just the next word. This method uses a shared underlying structure called a transformer trunk to understand the context and then employs multiple independent prediction heads to guess future words in parallel.

The researchers explain, “At each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk.”

For instance, consider the famous quote from Martin Luther King Jr.: “I have a dream that one day…” A standard LLM would predict “this,” then “nation,” then “will,” and finally “rise.”

In contrast, Meta’s model would predict the next four words in one go, significantly speeding up text generation by up to three times.

  1. Not All Predictions Are Created Equal

All output heads in the model share the same underlying LLM backbone (depicted as ‘Shared’ in previous explanations). This means they all use the same representation and have access to the same information to predict the next four tokens.

For this to be effective, the shared representation must consider not only the preceding words but also the likely potential next words for any given prediction.

This is crucial when some words have a direct impact on the following words, a concept referred to by the researchers as a “choice point.” On the other hand, some predictions may have little effect on the subsequent words.

In other words, not all predictions carry the same weight; some are more significant in shaping the sequence. Let’s look at examples of both:

  • Choice Point

“The explorer ventured deeper into the jungle, aware of the dangers. Suddenly, he saw a…”

Here, the next predicted word can greatly affect the story’s direction. For instance, if the model predicts “tiger” and then follows with “lurking in the bushes,” it sets a specific and dramatic course. Thus, “tiger” heavily constrains the next predictions.

  • Inconsequential Prediction

Consider a scenario where a character is performing a routine task: “She opened the cookbook and decided to…”

In this case, the next predicted word is less critical to the narrative. Whether the word is “bake,” “cook,” or “prepare,” the overall direction of the story remains largely unchanged; the activity described is similar in each instance.

The Point

By training the model to predict the next four tokens, it becomes more adept at recognizing whether the next prediction is critical (a choice point) or not. This enhances the overall quality and coherence of the generated text.

2. Syntax is Subtle

In standard next-word prediction, each word is predicted independently. While LLMs can learn patterns between closely related words (understanding that “I play the guitar” is correct and “I guitar the play” is not), they do so one word at a time. However, by forcing the model to predict multiple words at once, the model learns to generate sequences in the correct order, reducing the chances of incorrect sequences appearing.

With multi-token prediction, the model learns these short local patterns more effectively, enabling it to produce entire sequences correctly in one go. This approach helps the model internalize and replicate accurate syntax, improving its overall text generation capabilities.

3. Incredibly Fast

If you need a faster model, you can run all four heads simultaneously to predict multiple tokens at once.

One method to achieve this is Medusa, which increases generation speed by up to three times compared to standard LLMs.

How Medusa Works? In Medusa, each head is assigned a specific position in the generative process. For instance, if you have four heads, each head predicts one token: the first head predicts the first token, the second head predicts the second token, and so on.

With the top-k predictions of tokens for each position ready, the model builds a set of candidates. The longest valid candidate is then chosen using a specific heuristic.

Validity of Candidates? To determine if a candidate is valid, Meta researchers use a typical acceptance scheme. This means the candidate does not have to be the ideal one but must still be valid and coherent within the context.

According to researchers, multi-token prediction offers several key advantages that enhance the capabilities of LLMs:

1. Better Learning Efficiency:

  • Multi-token prediction enables the model to learn more effectively from the same dataset.
  • A 13-billion parameter model trained with this method solved 12% more problems on a coding test and 17% more on another benchmark compared to traditional next-token models.

2. Improved Generative Performance:

  • Models trained with multi-token prediction excel in tasks requiring content generation, such as coding.
  • These models capture and predict long-term patterns more accurately, outperforming other strong models by a significant margin.

3. Faster Inference:

  • Multi-token prediction models are up to three times faster at making predictions.
  • This speed is crucial for real-world applications where efficiency is essential.

Despite its advantages, multi-token prediction has some limitations that need to be addressed:

  1. Context Understanding: Processing multiple tokens independently or in parallel can lead to difficulties in understanding context. This approach might miss nuances and relationships between words that are crucial for accurate comprehension.
  2. Model Complexity: Designing models that can efficiently handle multi-token processing without compromising accuracy is complex and challenging.
  3. Data Requirements: Effective multi-token processing often requires vast amounts of training data to accurately learn the nuances of language patterns.

These challenges indicate that while multi-token prediction holds promise, further research and development are needed to fully realize its potential.

Multi-token prediction represents a significant advancement in the training of large language models (LLMs), offering notable improvements in learning efficiency, generative performance, and inference speed. By enabling models to predict multiple tokens simultaneously, this approach enhances the ability to capture and generate coherent text, particularly benefiting larger models and complex tasks.

The experimental validation highlights its potential, showing pronounced benefits in scaling and algorithmic reasoning, while also addressing issues of memory efficiency. However, challenges remain, such as ensuring effective context understanding, managing model complexity, and handling large data requirements.

As research progresses, addressing these limitations will be crucial for optimizing multi-token prediction. Overall, this method paves the way for more efficient and powerful LLMs, marking an important step forward in the field of artificial intelligence.

  1. Meta Research Paper — Better & Faster Large Language Models via Multi-token Prediction
Accelerating Language Models with Multi-Token Prediction (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Gregorio Kreiger

Last Updated:

Views: 5817

Rating: 4.7 / 5 (77 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Gregorio Kreiger

Birthday: 1994-12-18

Address: 89212 Tracey Ramp, Sunside, MT 08453-0951

Phone: +9014805370218

Job: Customer Designer

Hobby: Mountain biking, Orienteering, Hiking, Sewing, Backpacking, Mushroom hunting, Backpacking

Introduction: My name is Gregorio Kreiger, I am a tender, brainy, enthusiastic, combative, agreeable, gentle, gentle person who loves writing and wants to share my knowledge and understanding with you.