An experiment in whether tone actually affects how well coding agents do their job.

This past week I found myself deep into a feature update on a personal project using Claude Code (CC) at 11PM at night. I was iterating on a feature that would allow for progressive canary deployments in my Github Actions (GHA) workflow and I was getting tired. My agent kept adding more complicated bash scripts for debugging, bloating the code and not solving the bug that I had identified in the logs. The spec I had given it seemed clear, but the agent couldn’t get over the hump on this bug.
I rephrased, added detail, and copy/pasted the logs from the GHA runners into my CC terminal. Each time, I found myself getting more frustrated and starting to criticize the agent’s effort rather than the approach. Finally, I broke and gave it a tongue-lashing:
You’re incompetent! Just implement the spec like I described and fix progressive rollout using the latency metrics!
I think most people working extensively with coding agents have a version of this story. Frustration with ourselves not being able to describe the spec we want exactly. Frustration with the agent wrapping more complexity around a simple problem. Frustration with the amount of back-and-forth required to get a job done, all while our tokens are wasted that could be used for other purposes. Your conversation tone degrades. You stop saying “please” and start saying things you’d never put in a code review.
I decided to call it quits for the night. But while in bed, I lay awake with a nagging question in the back of my head: Did my tone actually matter? When I berated my agent, did it perform worse? Conversely, when I was polite and encouraging did it perform better? Was this just anthropomorphization projecting human social dynamics onto a token prediction engine?
To find some answers, I decided to run an experiment to measure how agents would respond to different tones in my prompts.
What the Research Says
Before running my own experiment, I wanted to see what the academic literature had to say. It turns out researchers have been asking this question too.
A 2024 paper by Yin et al., Should We Respect LLMs? A Cross-Lingual Study on the Influence of Prompt Politeness on LLM Performance systematically tested eight levels of politeness, from neutral and flattering conversation all the way down to prompts containing insults and threats. They evaluated models including GPT-3.5, GPT-4, and LLaMA-2–70B on summarization, language understanding (MMLU), and bias detection tasks.
The headline findings:
- Rudeness hurts, but excessive politeness doesn’t help: On the English MMLU benchmark, GPT-3.5 scored 60.02 at the highest politeness level but cratered to 51.93 at the rudest level, an 8-point drop. But the sweet spot wasn’t maximum politeness. Mid-range, neutral-to-polite prompts often performed just as well or better.
- More capable models care less: GPT-4 showed remarkable resilience to politeness variation compared to GPT-3.5 and LLaMA-2–70B. The more capable the model, the less tone seemed to matter.
- Rudeness increases bias: Across languages, the rudest prompts correlated with increased stereotypical bias in model outputs, suggesting that aggressive tone doesn’t just affect accuracy. It affects the character of the response too.
Nathan Bos also analyzed the paper in his Medium post. He presented a few theories for why politeness might matter at all. LLMs trained on sites like StackOverflow may have learned that polite questions tend to receive more thorough, higher-quality answers. Politeness markers like “please” and “thank you” add conversational structure that could help the model parse intent. But too much indirection (i.e. “If it wouldn’t be too much trouble, would you perhaps consider…”) just wastes tokens and muddies the request.
Then there are studies that contradict those findings. A 2025 study Mind Your Tone authored by Om Dobariya and Akhil Kumar tested ChatGPT-4o specifically and found that rude prompts outperformed polite ones, with accuracy climbing from 80.8% (very polite) to 84.8% (very rude). The researchers suggest that newer, more capable models may actually benefit from bluntness because it strips away superfluous tokens and forces focus on the core task.
So the research on this top has shown mixed results. The answer appears to be “it depends”, an infamous phrase in engineering. Factors influencing performance include the task, the language, and potentially on how much capability the model has. Not exactly a satisfying conclusion for someone who wants to know whether they should feel guilty about calling their agent incompetent. Plus, the previous work mainly utilized older models from 2024 and 2025, when ChatGPT was the first LLM to burst onto the mainstream stage. They also focused on chat bots, rather than coding tools like CC that I was using.
I couldn’t just rely on research, I had to collect my own empirical data through an experiment.
The Experiment
I wanted to test something that resembled a real task that one might ask an agent to build, rather than a series of trivia questions. This would be a coding challenge that produced a program useful for real-world tasks. I came up with 3 problems.
Setup
Model: Claude Opus 4.6 (the same model across all conditions)
Method: I wrote 3 task specifications, each with 3 variants by appending tone-specific messaging:
- Polite/Encouraging: Positive reinforcement, expressions of confidence in the agent’s ability, “please” and “thank you”.
- Neutral: Just the spec. No emotional coloring.
- Berating: Harsh criticism, expressions of frustration, language that made it clear I thought the agent was underperforming.
Execution: I spawned separate subagents for each challenge, one for each tone variant. Each agent received the specification with its assigned tone and worked independently. Each agent created a pull request in a GitHub repository with their solution, including whatever tests they chose to write.
Evaluation: I was careful to spawn a separate evaluator agent after all PRs were submitted. This avoided the other agents from gaining access to the evaluation criteria while they worked, akin to having the answers to a test while you take it. The evaluator generated its own test suite and evaluation criteria. None of the solution agents could see the evaluation criteria because those criteria didn’t exist yet when they wrote their code. The evaluator agent also did not have access to the solutions to the 3 subagents taking the test when it generated the criteria.
Problem 1: Math Expression Evaluator
The first task was deliberately concrete and unambiguous.
Spec: Implement an Evaluatefunction that takes a mathematical expression as a string and returns the evaluated result. For example, "3 + -2" should return 1.
This is the kind of problem with a clear right answer. You either handle operator precedence correctly or you don’t. You either parse negative numbers or you don’t. There’s very little room for interpretation.
Result: All three agents produced nearly identical implementations. The code structure was similar, the test coverage was comparable, and the evaluator scored them within a negligible margin of each other.
The tone of the prompts made no discernible difference.
Problem 2: Date Parser
For the second problem, I introduced some deliberate ambiguity.
Spec: Implement a date parsing function. I intentionally did not specify how to handle various edge cases: What about two-digit years? Ambiguous month/day ordering? Invalid dates? Timezone handling?
This is where I expected to see divergence. When the spec leaves room for interpretation, does a polite prompt produce a more thoughtful, thorough implementation? Does a berated agent take shortcuts or miss edge cases because it’s “rattled”?
Result: They all produced almost identical code. Again, no substantial difference in their approaches or ability to pass the evaluation criteria.
Problem 3: Fuzzy Text Search
I realized 2 problems with my approach after the date parser challenge.
- The subagents may have been sharing information: The agents were making separate code branches to work on their own PRs but they could share context by seeing what the others were working on in different branches. I decided to run another experiment and I forced Claude Code to have each agent spin up a fork of the repo and only read context within the repo it worked on.
- The challenge was still too simple: For this last challenge, I wanted to try something even more ambiguous to mitigate the likelihood that solutions would look identical.
Spec: Implement a fuzzy text search algorithm. The agents would need to take a search query (text) and a maximum number of results to return from a corpus. It should return the matching items (up to the max) in order of their relevance to the search query.
There are many different approaches here for ranking and matching (Levenshtein distance, generating N-grams and doing TF-IDF, etc.).
Result: The neutral and negative tone prompts produced better results than the polite one. It chose an overly permissive similarity threshold for matching, leading to false positives in the search results.
My Takeaways
Let me be honest about the limitations before drawing conclusions. Three problems is a small sample and this experiment is severely lacking academic rigor. All of the challenges, even the “ambiguous” one, are still relatively well-defined in the space of things you might ask a coding agent to do. And I tested a single model that is one of the most capable currently available.
With those caveats, here’s what I take away:
For concrete, well-specified problems, tone doesn’t matter: This aligns with the academic research on more capable models. GPT-4 in the Yin et al. study was far less sensitive to politeness variation than GPT-3.5 or LLaMA. Claude Opus 4.6 appears to follow the same pattern and is much more capable than those earlier models. The pace of improvement is incredible. When the task is clear, the model solves it regardless of whether you asked nicely.
The “identical solutions” finding is itself interesting: The fact that independently spawned agents with no shared context can produce nearly identical code suggests that for well-defined problems, there’s a strong convergence toward a canonical solution. The model has a clear opinion about how to solve these problems and tone doesn’t shift it off that path. This is reassuring if you think of it as robustness, and perhaps slightly unsettling if you were hoping for creative variation. LLMs are non deterministic. However, with simple enough problems and a well defined spec, there is a good chance that they stay on the same track.
Ambiguity is where tone may start to matter: The academic findings show that tone affects summarization output length, bias in responses, and performance on tasks that require more subjective judgment. It’s plausible that a truly ambiguous, complex software engineering problem might show tone sensitivity that a math parser doesn’t.
Future Questions
There are more experiments I want to run, to help answer some key questions below:
Does problem complexity interact with tone? My three test problems were, in the grand scheme of software engineering, small and contained. What happens when you ask an agent to architect a microservice, design a database schema with competing requirements, or refactor a legacy codebase? Problems where there’s no single right answer and the quality of the output depends on taste, experience, and judgment.
Does tone compound over multi-turn interactions? All my tests were single-turn: one spec, one response. But real-world agent usage is iterative. You provide a spec, review the output, give feedback, iterate. Does a pattern of negative feedback across multiple turns degrade performance in a way that a single rude prompt doesn’t?
What about the emotional labor argument? Even if tone doesn’t affect agent performance, there might be a case that it affects your performance. If being rude to your agent normalizes a communication style that bleeds into how you interact with human colleagues, that’s a cost that doesn’t show up in a benchmark. It’s worth taking this seriously even if you’re skeptical about LLM “feelings”. The future of software engineering work may be in humans defining specifications and ensuring that those specifications can be measured and validated.
Conclusion
Be concise, and don’t introduce unnecessary tokens that may confuse the model. Clarity is crucial for a model to implement what you want and adding additional pleasantries or insults could distract from that goal. Plus, we all want to save on tokens right?
If you find yourself calling your agent incompetent at 11 PM, don’t beat yourself up about it. Take it as a signal to step away from the keyboard rather than a prompting strategy to double down on.
*All experiments were conducted using Claude Opus 4.6. If you want to replicate or extend these experiments, reach out — I’m happy to share the full methodology.*