Sir, There Is a Goblin in My LLM: Fine-Tuning and Its Butterfly Effects

Rubén Castillo Sánchez

In a recent internal analysis, OpenAI documented an unusual behavioral pattern that became difficult to ignore once observed: the model had developed a tendency to reference goblins, gremlins, and similar creatures in contexts where they added no value. What initially appeared as an anecdotal curiosity became measurable over time, with a clear increase in both frequency and consistency.
At face value, this looks like a minor stylistic glitch. In practice, it offers a useful lens through which to examine how the final stages of modern language model training are carried out, and what implications those stages may have on observable behavior.
From Observation to Pattern
The initial reports did not point to randomness. Magical creatures began to be spotted across different prompts, domains, and use cases, without being tied to specific keywords or clearly identifiable triggers. Over time, the behavior became reproducible in a way that suggests the model had internalized a stylistic preference rather than producing occasional noise.
This detail proved important for researchers because, although large language models are designed to generate variation, persistent variation tends to reflect underlying structure. When recurring motifs begin to surface across contexts, they can usually be traced back to signals that were reinforced during the later stages of training and alignment.
In this particular case, OpenAI was able to pin the behavior down to the training used for its personality customization feature, specifically the “Nerdy” personality. That personality was explicitly designed to be playful, enthusiastic, and slightly irreverent, with instructions such as “undercut pretension through playful use of language” and to approach complex topics without excessive seriousness. In practice, this meant that responses using vivid or unusual metaphors were often judged more favorably during training.
What makes this case particularly interesting is not only that the behavior was reinforced within the “Nerdy” personality, but that it did not remain confined to it. Patterns that were initially associated with a specific stylistic setting began to appear in unrelated contexts, including environments such as coding, where they were clearly out of place. This highlights a structural limitation of current language models: training signals applied in one context are not guaranteed to remain localized. Because the model shares a single set of parameters across all use cases, behaviors learned under one configuration can propagate more broadly.
Over time, outputs that included references to creatures such as “goblins” or “gremlins” were scored slightly higher than similar answers without them. As those higher-rated examples were reused during subsequent training steps, the pattern appeared more frequently and began to generalize beyond the original setting. What started as a stylistic quirk within a specific personality became a recurring tendency in the model’s responses more broadly.
A High-Level View of Fine-Tuning
To understand how this happens, it helps to briefly look at how modern language models are trained. The process typically begins with a phase known as pretraining, where the model is exposed to large volumes of text and learns to predict the next token in a sequence. Through this process, it develops a broad statistical understanding of language, including grammar, structure, and many patterns of usage.
After this initial phase, the model has a broad statistical understanding of language, but its responses are not yet shaped to meet user expectations in a consistent way. The stages that follow focus on refining this behavior through more targeted forms of training, often grouped under what is commonly referred to as fine-tuning:
Supervised fine-tuning (SFT) introduces curated examples of desirable responses. These examples encode structure, tone, and task adherence.
Preference-based optimization, often implemented through reinforcement learning from human feedback (RLHF) or related methods, refines the model further. Outputs are ranked or scored, and the model is optimized to produce responses that align with higher-rated examples.
At this stage, the objective moves beyond simple next-token prediction toward a reward landscape that reflects human judgments about quality, usefulness, and style. This shift is critical, as it allows the model to learn which kinds of answers are preferred, in addition to which ones are correct.
Small Signals in the Reward Landscape
Human preference data carries implicit biases, which are transferred to the reward model used during post-training. When reviewers consistently rate certain responses as more engaging or higher quality, the model is pushed to increase the likelihood of producing similar outputs. This reinforcement does not usually target individual words directly. It affects broader patterns of expression, including tone, structure, pacing, and metaphor usage.
In language models, style is not just a surface layer added after reasoning or correctness. It shapes which responses are more likely to be produced, which analogies feel natural, and which lexical patterns become available in a given context. When a playful or metaphor-rich tone is repeatedly associated with higher reward, the model learns to reproduce the broader expressive pattern behind that tone.
As a result, coherent stylistic tendencies can emerge without being explicitly defined. A model does not need to be instructed to adopt a “nerdy” or pop-culture-aware voice for that tendency to appear. It is enough that responses with those characteristics are consistently preferred, allowing the model to converge toward a loosely defined stylistic region that it reuses across contexts.
If references to fantastical elements appear often enough within those highly rated responses, even as a byproduct of a more vivid style, they become statistically entangled with reward. During optimization, the model adjusts its output distribution toward regions where those patterns are more likely to appear, slightly increasing the probability of similar references across contexts.
Over successive training steps, these small probability shifts accumulate. What begins as occasional variation becomes a recurring feature of the model’s responses, because the references travel with a style that has been consistently reinforced. The resulting behavior can appear intentional, even though it emerges from distributed adjustments across many parameters.
Why These Effects Scale
The dynamics at play here are consistent with what is often described as a butterfly effect, where small and localized changes accumulate into visible and sometimes unexpected global behavior. In the context of post-training, this emerges from the way the model is updated through many incremental optimization steps, each introducing slight adjustments to its probability distribution.
Individually, these updates are negligible. Collectively, they can reshape behavior in a meaningful way, as several factors contribute to this amplification:
Iterative optimization compounds small preferences across many training steps
Generalization extends local patterns to new contexts
Interconnected representations allow subtle correlations to influence seemingly unrelated parts of the model’s behavior
Because of this, even weak signals can evolve into noticeable traits. A consistent preference, applied repeatedly during optimization, is sufficient to shift the model toward regions of the output space where certain patterns become more likely to appear. These effects are not only amplified over time, but can also spread across contexts, making it difficult to confine learned behaviors to the settings in which they were originally encouraged.
The goblin phenomenon provides a concrete example of this dynamic, making visible how small stylistic biases introduced during training can scale into persistent features of the model’s behavior.
Mitigation and Its Limits
OpenAI addressed the issue through a combination of targeted interventions. They removed the “Nerdy” personality that had originally amplified the behavior, eliminated the reward signal that had favored creature-based metaphors during training, and filtered training data containing words such as “goblin” and “gremlin” to reduce their prevalence. Because the issue had already propagated into later model versions, they also introduced additional instructions in certain contexts, such as developer prompts in Codex, to discourage these references when generating responses.
At a practical level, this means changing what the model is encouraged to produce. By reducing how strongly certain stylistic patterns are associated with positive feedback, the training process becomes less likely to favor responses where those patterns appear. Data curation plays a complementary role by limiting how often these patterns show up in the examples the model learns from, while additional instructions act as a corrective layer when the model is generating its responses.
These measures are effective in reducing the immediate symptom, but they do not remove the underlying sensitivity of the system. Fine-tuning remains dependent on the quality, balance, and consistency of the signals it receives, and small shifts in those signals can still propagate through optimization.
Each intervention also introduces trade-offs. Reducing stylistic drift can narrow the range of expressive variation available to the model. Stronger constraints can improve reliability in specific contexts, while affecting fluency or naturalness in others. Adjusting reward signals can correct one pattern while unintentionally weakening others that were correlated with it.
The objective, therefore, is not to reach a perfectly stable configuration, but to continuously manage these trade-offs with increasing visibility into how training signals translate into observable behavior.
Conclusion
The goblin case is easy to frame as a benign anomaly, but it highlights how alignment operates in practice. Language models are guided by approximations of human preference, which capture tendencies rather than precise rules. As a result, they can produce behaviors that are consistent with their training signals while remaining misaligned with user expectations in specific contexts.
This gap reflects a broader challenge. It is less a question of capability than of how objectives are specified and how those objectives propagate through training. Small, consistent signals can accumulate into stable behavioral patterns that are difficult to anticipate from individual design decisions.
The value of this episode lies in what it reveals. Alignment is an ongoing process, sensitive to details that may appear insignificant in isolation, and it requires the ability to trace observed behavior back to the signals that produced it.
The goblin is a minor character in this story, but it serves as a useful reminder: small oversights in how training is configured can surface later as visible and persistent traits in model behavior, and the same dynamics that produce harmless quirks can, under different conditions, lead to more consequential effects.






