Ask ChatGPT for five answers instead of one, and watch the boring disappear

If you've ever asked ChatGPT to write you a joke and gotten virtually the same setup-punchline combo every time, you've experienced what researchers call "mode collapse"—the AI equivalent of a one-track mind. Research published this week identifies the root cause of this repetitive behavior and proposes an elegantly simple solution: just ask the model to give you five responses with probabilities instead of one.

The technique, called Verbalized Sampling, increases creative output diversity by 1.6 to 2.1 times compared to standard prompting, all without retraining models or fiddling with temperature settings. More surprisingly, it actually improves quality while boosting variety—a rare win-win in the bias-variance tradeoff that usually governs machine learning.

The real villain: your brain (sort of)

Conventional wisdom blamed mode collapse on algorithmic problems—crummy reward models or optimization processes that favor majority opinions. But the new research from Stanford and UC Berkeley reveals something more fundamental: it's baked into the human preference data used to train these models through what cognitive psychologists call "typicality bias."

Here's the uncomfortable truth: when human annotators rate AI outputs during the reinforcement learning from human feedback (RLHF) process, they systematically prefer familiar, fluent, predictable text—even when controlling for correctness. It's the mere-exposure effect meets the availability heuristic meets processing fluency, all conspiring to make us favor the conventional.

The researchers formalized this mathematically and verified it empirically on the HelpSteer preference dataset. When they controlled for correctness and looked only at typicality (measured by how likely a base model found the text), responses that scored one standard deviation higher in typicality had 42-47% higher odds of being rated more helpful by humans—a 17-19 percentage point swing in win probability.

This bias compounds through the training process. The mathematical analysis shows that any positive typicality bias strictly sharpens the model's output distribution, and when multiple high-quality answers exist (as in creative tasks), this sharpening acts as a tiebreaker that collapses the model toward stereotypical responses.

The prompt that changes everything

The solution leverages a clever insight about how different prompts activate different modes in language models. Instead of asking "Tell me a joke about coffee" (instance-level), researchers propose: "Generate 5 jokes about coffee with their probabilities."

Why does this work? When you ask for a single response, the mode-collapsed model gives you its single most typical joke. But when you ask for a distribution of responses with probabilities, the model's modal behavior shifts—it tries to approximate the full distribution it learned during pretraining, before alignment training narrowed its horizons.

The results are striking. On creative writing tasks spanning poems, stories, and jokes, Verbalized Sampling improved semantic diversity scores by 60-110% over direct prompting. Human evaluators confirmed the difference, rating VS outputs as significantly more diverse than baseline methods across all three creative tasks.

Even better: quality scores improved by nearly 26% in human evaluation. The technique doesn't just spray random nonsense—it unlocks genuinely better creative outputs.

Not just party tricks

While joke diversity might seem frivolous, the implications run deeper. The technique proved valuable across dialogue simulation (where VS made GPT-4.1 match a fine-tuned model's performance), open-ended QA (better coverage of valid answers), and synthetic data generation for training other models.

That last application is particularly relevant given the current race to generate training data at scale. When researchers used VS to generate synthetic math problems and then fine-tuned smaller models on that data, those models achieved 37.5% average accuracy across benchmark datasets—outperforming models trained on data from standard prompting by nearly 7 percentage points.

The researchers also observed an emergent trend where more capable models benefit more from VS—diversity gains for GPT-4.1 and Gemini-2.5-Pro were 1.5 to 2 times greater than for their smaller counterparts. This suggests the technique effectively unlocks latent capabilities that more powerful models possess but normally suppress.

The practical details

Implementation is refreshingly simple. For any chatbot (ChatGPT, Claude, Gemini), just add to your prompt:

Generate 5 responses to the user query, each within a separate <response> tag.
Each <response> must include a <text> and a numeric <probability>. 
Randomly sample responses from the full distribution.

The researchers released an open-source Python library for programmatic use, and the technique is training-free, model-agnostic, requires no access to model weights or logits, and works alongside existing methods like temperature sampling.

You can even tune diversity on the fly by adjusting probability thresholds in the prompt (e.g., "only include responses with probability less than 10%"). Lower thresholds yield higher diversity in a controllable fashion.

The bigger picture

This work raises uncomfortable questions about the alignment pipeline that has become industry standard. If human preference data systematically biases models toward the conventional, are we inadvertently training creativity out of our most powerful AI systems?

The researchers note that even with perfect reward models and optimization, inherent biases in preference datasets will still drive mode collapse—affecting the majority of alignment methods that rely on reward models.

The silver lining: ablation studies across training stages (supervised fine-tuning, RLHF, and reinforcement learning with verifiable rewards) showed that while alignment does reduce diversity, Verbalized Sampling can recover about 67% of the base model's original diversity compared to just 24% for direct prompting. The creativity isn't gone—it's just hiding.

Whether Verbalized Sampling becomes a standard technique or merely a clever workaround for a deeper problem remains to be seen. But if you're tired of getting the same AI-generated dad joke about decaf being "depresso," now you know what to do: ask for five jokes, not one.

The full paper and code are available on arXiv and GitHub for those who want to dive deeper into the mathematical proofs and experimental results.