Learning to classify sequences one token at a time
Look at the movie review in the animation below. Can you guess whether it’s positive or negative as the words appear, one by one?

Chances are, after just a few tokens, you can already guess the sentiment with high confidence. In this work we’re interested in automating that ability: training sequence classification models that make predictions as the sequence unfolds, updating them as more of the sequence is revealed.
Formally, we want to learn a parametric classifier

that outputs the probability of class y given x≤t the first t elements of a sequence x = (x1, …, xT). In the movie review example, there are two classes (positive or negative), but in general we assume K possible classes.
Beyond sentiment analysis, this incremental sequence classification problem appears in several modern applications of language models, including:
- LLM & agentic verifiers (Cobbe et al, 2021), and
- multi-token prediction heads (Stern et al., 2018, Gloeckle et al., 2024)
In these applications, classifying partial sequences saves computational costs and improves LLM pre-training. Outside of NLP, the incremental sequence classification problem could also help in healthcare or finance, where every extra moment of waiting carries a time or opportunity cost.
Baseline approach
Probably the most straightforward way to train an incremental sequence classifier is to collect a dataset of labeled sequences, and optimize θ to minimize the categorical cross-entropy loss relative to the ground-truth label,

for each partial sequence x≤t and corresponding label y. Here, H[p‖q] = -∑i pi log qi denotes the categorical cross-entropy of q relative to p by and δy is the one-hot encoding of label y.
Intuitively, minimizing the cross-entropy nudges the model’s predicted probabilities toward the ground-truth distribution, which assigns probability 1.0 to the correct class and probability 0.0 to all the others.
Our key insight: temporal consistency
We argue that we can do better. The incremental nature of our problem comes with extra structure we can exploit. Specifically, the ground-truth distribution satisfies the following temporal-consistency condition.

This suggests that, for a well-calibrated model, the predicted probability of class y after seeing t elements should be equal, on average, to the probability after seeing one more element. (The average is taken over the distribution of all possible next elements). Informally: Your belief today should match what you expect to believe tomorrow.
We can turn this insight into a modified loss function:

Compared to the baseline, which pushes predictions directly towards the ground-truth label, this nudges them towards the model’s own prediction one step later. This makes the predictions temporally consistent, and acts as a kind of regularization that smooths predictions and makes better use of limited data.
In practice, there is a whole continuum between the “fully temporally-consistent” approach we’ve just outlined, and the baseline described above. For the full formulation, refer to our paper.
This idea of enforcing temporal consistency isn’t new, it has deep roots in reinforcement learning (RL), going back to Sutton’s seminal work on temporal- difference learning. Quoting Sutton (1988):
“Whereas conventional prediction-learning methods assign credit by means of the difference between predicted and actual outcomes, the new methods assign credit by means of the difference between temporally successive predictions.”
From an RL perspective, our incremental classifier plays the same role as a state-value function: it predicts an eventual outcome based on the current state (or partial sequence). Our contribution is to extend TD learning, originally designed for scalar value functions, to the multi-class classification setting.
Empirical results
We put our approach to the test on two kinds of problems: text classification (like the movie-review sentiment example discussed at the beginning) and LLM verification. In both cases, we train incremental classifiers by adding a classification head to a pretrained language model and fine-tuning on task-specific data. At a high level, our goal is to make better predictions, faster.
To measure this, we evaluate the model’s predictive performance after seeing only part of the sequence (for example, after 4, 8, or 16 tokens). We report the area under the ROC curve at these different prefix lengths, which reflects how well the model distinguishes between classes as the sequence unfolds. While the full paper explores a range of datasets and baselines, here we highlight two illustrative results.
Text classification
One of our benchmark tasks is OHSUMED, a dataset of medical abstracts grouped into 23 classes. We compare two training methods: the direct cross-entropy baseline and our temporally consistent approach. To put things in perspective, we fine-tune models of three sizes—125M, 350M, and 1.3B parameters.

Across the board, temporally consistent models achieve higher performance, especially when only a few tokens are available. Switching the loss function alone delivers gains comparable to scaling up the model by approximately 10x.
Verifying LLM generations
We also evaluate our method on the GSM8K dataset of grade-school math problems, where large language models generate step-by-step reasoning. Here, a verifier model judges whether a partial generation is likely to lead to the correct answer. Verifiers trained with our temporal-consistency loss learn to spot promising generations earlier, allowing unpromising ones to be stopped sooner.

Final thoughts
Encouraging consistency across time turns out to be a simple but powerful idea. By optimizing incremental models to minimize temporal inconsistency, we can make them more accurate, data-efficient, and reliable, without increasing the training cost.
In our paper, we also explore these ideas from a theoretical perspective, showing that the temporal-consistency loss converges and leads to more efficient estimators, under some assumptions.
We’re excited to present this work at NeurIPS 2025 in San Diego.
- Read the full paper.
- Explore the code.