Rendered at 22:11:45 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
underyx 17 hours ago [-]
> the slow performance decays
the decays are just more capable other models entering the population, making all prior models lose more frequently
TekMol 14 hours ago [-]
No, that is not how ELO scores work.
qnleigh 13 hours ago [-]
As far as I understand, this is exactly how ELO scores work. If a more capable show up and starts beating all the other models, it literally takes ELO points from everyone else.
If a more capable show up and starts
beating all the other models
There is an instance of this in the chart. In 2025-06-24 when Gemini-2.5-pro shows up. As you can see, the ELO of the others do not drop.
harperlee 13 hours ago [-]
Depends on the test design; is an agent competing against other agent in a given match, or against a test? Plus! Does the test's ELO fluctuate?
bitshiftfaced 7 hours ago [-]
It's a fitted Bradley Terry model, scaled to familiar Elo scores, anchored to wins against Mixtral-8x7B at 1114 (at least last time I looked at it). When you fit the model against historical data, and then you add another month of time that contains newer models, the relative strength of a given model might decline even if its absolute ability remained fixed.
tasuki 12 hours ago [-]
Yes, that is in fact how Elo can work[0]. There are quite many ways Elo systems can work.
It depends what you use as an anchor. If the anchor is a fixed model, you’re right. If the anchor is updated to a better model over time, then the elo of historical models degrades, right?
tedsanders 16 hours ago [-]
For what it's worth, I work at OpenAI and I can guarantee you that we don't switch to heavily quantized models or otherwise nerf them when we're under high load. It's true that the product experience can change over time - we're frequently tweaking ChatGPT & Codex with the intention of making them better - but we don't pull any nefarious time-of-day shenanigans or similar. You should get what you pay for.
selcuka 16 hours ago [-]
> we don't switch to heavily quantized models
That sounded like a press bulletin, so just to let you clarify yourself: Does that mean you may switch to lightly quantized models?
jychang 16 hours ago [-]
There's almost 0% chance that OpenAI doesn't quantize the model right off the bat.
I am willing to bet large amounts of money that OpenAI would never release a model served as fully BF16 in the year of our lord 2026. That would be insane operationally. They're almost certainly doing QAT to FP4 for FFN, and a similar or slightly larger quant for attention tensors.
selcuka 16 hours ago [-]
It's ok if they never release a BF16 model, but it's less ok if they release it, win the benchmarks, then quantise it after a few weeks.
retinaros 12 hours ago [-]
that is for sure what everyone does. also they train on evals with the datasets that they would be bench against.
tedsanders 6 hours ago [-]
What do you mean by this? We don’t train on evals, and if we did I’d quit on the spot.
(The loose version of this that’s true is that there may exist eval data contamination in pretraining. This is a hard problem to fully solve.)
retinaros 4 hours ago [-]
its not that loose of a version. its the reality and as probably is surely a focus of a dedicated post training RL-ing these kind of githubs. of course you would train specifically on the task. you would mix this eval data with others in thousands of githubs repos.
tedsanders 6 hours ago [-]
Thanks - let me clarify that we don’t switch to lightly quantized models by time of day or when under heavy load either.
(I used the adjective heavily because that’s what the original post said. I have no intention of making misleading but technically true statements.)
Ciph 16 hours ago [-]
Thank you for your answer. I have a similar question as OP, but in regards of the GPT models in MS copilot. My experience is that the response quality is much better when calling the API directly or through the webUI.
I know this might be a question that's impossible for you to answer, but if you can shed any light to this matter, I'd be grateful as I am doing an analysis over what AI solutions that can be suitable for my organisation.
sans_souse 14 hours ago [-]
As phrased the only answer is the question; "as opposed to what?"
aiscoming 14 hours ago [-]
webUIs have giant system prompts built in
APIs have much smaller ones
_kidlike 14 hours ago [-]
its very interesting to see that this only happens to American companies. What gives?
jdw64 15 hours ago [-]
This is great, but personally, I really wish we had an Elo leaderboard specifically for the quality of coding agents.
Honestly, in my opinion, GPT-5.5 Codex doesn't just crush Claude Code 4.7 opus —it's writing code at a level so advanced that I sometimes struggle to even fully comprehend it. Even when navigating fairly massive codebases spanning four different languages and regions (US, China, Korea, and Japan), Codex's performance is simply overwhelming.
How would we even go about properly measuring and benchmarking the Elo for autonomous agents like this?
vachanmn123 15 hours ago [-]
Isn't code that you fail to understand literally a sign that its worse?
jdw64 15 hours ago [-]
It was often much faster, and when I revisited the code later, there were cases where I realized it had moved the implementation toward a better abstraction.
jdw64 15 hours ago [-]
I should also add that I am not claiming to be a particularly great programmer. I have never worked at FAANG, and I haven't had much exposure to the kind of massive codebases those engineers deal with every day.
Most of the code I've worked with comes from Korean and Chinese startups, industrial contractors, or older corporate research-lab environments. So I know my frame of reference is limited.
When I write code, I usually rely on fairly conservative patterns: Result-style error handling instead of throwing exceptions through business logic, aggressive use of guard clauses, small policy/strategy objects, and adapters at I/O boundaries. I also prefer placing a normalization layer before analysis and building pure transformation pipelines wherever possible.
So when Codex produced a design that decoupled the messy input adapter from the stable normalized data, and then separated that from the analyzer, it wasn't just 'fancier code.' It aligned perfectly with the architectural direction I already value, but it pushed the boundaries of that design further than I would have initially done myself.
This is exactly why I hesitate to dismiss code as 'bad' just because I don't immediately understand it. Sometimes, it really is just bad code. But sometimes, the abstraction is simply a bit ahead of my current local mental model, and I only grasp its true value after a second or third requirement is introduced.
To be completely honest, using AI has caused a significant drop in my programming confidence. Since AI is ultimately trained on codebases written by top-tier programmers, its output essentially represents the average of those top developers—or perhaps slightly below their absolute peak.
I often find myself realizing that the code I write by hand simply cannot beat it
kimjune01 15 hours ago [-]
Although Arena is adversarial and resistant to goodharting, it's not immune. Models that train on Arena converge on helpfulness, not necessarily truthiness
cherioo 15 hours ago [-]
The interesting thing I find is how Anthropic has been more consistently improving over time in the last few years, that allows it to catchup and surpass OpenAI and Google. The latter two have pretty much plateau over the last year or so. GPT 5.5 is somehow not moving the needle at all.
I hope to see the other labs can bring back competition soon!
XCSme 15 hours ago [-]
Gpt 5.5 is quite a big leap, it's a lot better than opus 4.7 for agentic coding
energy123 15 hours ago [-]
Arena only allows very small context sizes, so it's a noisy benchmark for what we care about IRL.
mettamage 14 hours ago [-]
Better in what ways? I'm just curious about your experience.
XCSme 14 hours ago [-]
Consistency, not making mistakes.
mettamage 14 hours ago [-]
Ahh... that is indeed an issue I have with Claude. I'll check it out!
eis 17 hours ago [-]
The Elo rating system measures relative performance to the other models. As the other models improve or rather newer better models enter the list, the Elo score of a given existing model will tend to decrease even though there might be no changes whatsoever to the model or its system prompt.
You can't use Elo scores to measure decay of a models performance in absolute terms. For that you need a fixed harness running over a fixed set of tests.
mayerwin 10 hours ago [-]
Yes, that is definitely a limitation. If all models become worse at the same pace, we won't see any degradation either. I couldn't find any historical dataset of model benchmarks (I'd really have loved that, to see how performance holds over time vs. the initial announcement), so the Elo data from Arena AI was the least imperfect proxy I could find.
bob1029 14 hours ago [-]
The relative and auto-scaling nature of Elo ranking feels like an advantage here.
Relative ranking systems extract more information per tournament. You will get something approximating the actual latent skill level with enough of them.
eis 14 hours ago [-]
Advantage for what exactly though? I'm not saying Elo Ranking doesn't give any information. It just doesn't give the information that the OP's project claims to be able to give: that models get nerfed over time. You could extract this kind of information from the raw results of each evaluation round between two models, ignoring any new model entries and compare these over time but not from the resulting Elo scores with an ever changing list of models.
New models are on average better than older models, the average skill of the population of models increases over time and so you are mathematically guaranteed that any existing model will over time degrade in Elo score even though it didn't change itself in any way.
It's like benchmarking a model against a list of challenges that over time are made more and more difficult and then claiming the model got nerfed because its score declined.
Elo is good at establishing an overall ranking order across models but that's not what this is about.
Is that strictly true? ELO rankings do also inflate over time (looking at you, Chess GMs)
tasuki 12 hours ago [-]
Elo systems often include one or more ways new points can enter the system. The system used by the European Go Federation has three ways iirc: 1. Cannot go under 100, 2. Cannot lose more than 100 points in one tournament, 3. Weaker player beating a stronger one (which is countered by the stronger player beating the weaker one, but it's not balanced: if two people only play each other forever and ever, both of their Elos will grow).
ponyous 14 hours ago [-]
Seems like Chinese labs are the only ones that are trustworthy (at least when it gets to this specific issue). This feels so ironic haha
mordae 13 hours ago [-]
I am using novita-hosted DeepSeek V4 (Flash) for work and DeepSeek API for personal projects.
Novita's has occassional problem counting white space. DeepSeek hosted does not.
No idea why.
lukewarm707 12 hours ago [-]
there is something greatly trustworthy about open source
tedsanders 16 hours ago [-]
FYI, Elo isn't an acronym - it's a person's name. No need to capitalize it as ELO.
mayerwin 10 hours ago [-]
You're right, thanks for the heads up! Corrected (I can't edit the post on HN though).
JKCalhoun 7 hours ago [-]
Don't bring me down…
alex_duf 13 hours ago [-]
Electric Light Orchestra anyone?
andrewshadura 15 hours ago [-]
Unless you've just missed your last train to London.
Thank you, I just looked at the chart and said to myself: ELO? YOLO!
That Elo ranking is also called chess ranking
andrewshadura 15 hours ago [-]
Élő. Meaning alive (él = it lives, -ő = adjective)
whiplash451 14 hours ago [-]
Neat. Would you add the option to normalize the elo over time (e.g update the model used as an anchor for the elo computation) so the diff between labs is more visible?
bbstats 11 hours ago [-]
The logic for which models stay active when you click on a group of them is extremely not working
mayerwin 10 hours ago [-]
It'd be amazing if you could open an issue with a screenshot so I can take a look, I haven't been able to find issues when clicking on a group of models: https://github.com/mayerwin/AI-Arena-History/issues. Note: the model change points label being hidden when more than one curve is active is by design (to avoid cluttering), if this is what you were referring to.
fph 14 hours ago [-]
Very neat! It would be great to extend it to non-flagship models as well.
Thomashuet 15 hours ago [-]
It seems to be a USA only thing, Chinese models and Mistral don't show any downward trend.
TurdF3rguson 12 hours ago [-]
Sure they do. Most models are on a downward trend because newer models are moving into top spots.
patall 15 hours ago [-]
Wouldn't it be really weird if a open-weight model dropped in performance? Because then, it would rather be the Elo ranking
refulgentis 16 hours ago [-]
Is this slop? It has wildly aggressive language that agrees with a subset of pop sentiment, re: models being “nerfed”. It promises to reveal this nerfing. Then, it goes on to…provide an innocuous mapping of LM Arena scores that always go up?
ninjalanternshk 13 hours ago [-]
It links to the GitHub repo for the project, and while it’s not inconceivable that an AI bot would create and populate a functioning public GitHub repo, it’s pretty unlikely.
refulgentis 3 hours ago [-]
Whether there's a public GitHub repo is orthogonal to if the content on the site was written by AI, especially given it doesn't make any sense on it's own terms.
Orthogonal to that, 6 months ago, an AI making a repo was trivial. Wouldn't read it as a sign of anything.
the decays are just more capable other models entering the population, making all prior models lose more frequently
https://en.wikipedia.org/wiki/Elo_rating_system
[0]: https://en.wikipedia.org/wiki/Elo_rating_system
That sounded like a press bulletin, so just to let you clarify yourself: Does that mean you may switch to lightly quantized models?
I am willing to bet large amounts of money that OpenAI would never release a model served as fully BF16 in the year of our lord 2026. That would be insane operationally. They're almost certainly doing QAT to FP4 for FFN, and a similar or slightly larger quant for attention tensors.
(The loose version of this that’s true is that there may exist eval data contamination in pretraining. This is a hard problem to fully solve.)
(I used the adjective heavily because that’s what the original post said. I have no intention of making misleading but technically true statements.)
I know this might be a question that's impossible for you to answer, but if you can shed any light to this matter, I'd be grateful as I am doing an analysis over what AI solutions that can be suitable for my organisation.
APIs have much smaller ones
Honestly, in my opinion, GPT-5.5 Codex doesn't just crush Claude Code 4.7 opus —it's writing code at a level so advanced that I sometimes struggle to even fully comprehend it. Even when navigating fairly massive codebases spanning four different languages and regions (US, China, Korea, and Japan), Codex's performance is simply overwhelming.
How would we even go about properly measuring and benchmarking the Elo for autonomous agents like this?
Most of the code I've worked with comes from Korean and Chinese startups, industrial contractors, or older corporate research-lab environments. So I know my frame of reference is limited.
When I write code, I usually rely on fairly conservative patterns: Result-style error handling instead of throwing exceptions through business logic, aggressive use of guard clauses, small policy/strategy objects, and adapters at I/O boundaries. I also prefer placing a normalization layer before analysis and building pure transformation pipelines wherever possible.
So when Codex produced a design that decoupled the messy input adapter from the stable normalized data, and then separated that from the analyzer, it wasn't just 'fancier code.' It aligned perfectly with the architectural direction I already value, but it pushed the boundaries of that design further than I would have initially done myself.
This is exactly why I hesitate to dismiss code as 'bad' just because I don't immediately understand it. Sometimes, it really is just bad code. But sometimes, the abstraction is simply a bit ahead of my current local mental model, and I only grasp its true value after a second or third requirement is introduced.
To be completely honest, using AI has caused a significant drop in my programming confidence. Since AI is ultimately trained on codebases written by top-tier programmers, its output essentially represents the average of those top developers—or perhaps slightly below their absolute peak.
I often find myself realizing that the code I write by hand simply cannot beat it
I hope to see the other labs can bring back competition soon!
You can't use Elo scores to measure decay of a models performance in absolute terms. For that you need a fixed harness running over a fixed set of tests.
Relative ranking systems extract more information per tournament. You will get something approximating the actual latent skill level with enough of them.
New models are on average better than older models, the average skill of the population of models increases over time and so you are mathematically guaranteed that any existing model will over time degrade in Elo score even though it didn't change itself in any way.
It's like benchmarking a model against a list of challenges that over time are made more and more difficult and then claiming the model got nerfed because its score declined.
Elo is good at establishing an overall ranking order across models but that's not what this is about.
To detect nerfing of a model, projects like https://marginlab.ai/trackers/claude-code/ are much much better (I'm not affiliated in any way).
Novita's has occassional problem counting white space. DeepSeek hosted does not.
No idea why.
Thank you, I just looked at the chart and said to myself: ELO? YOLO!
That Elo ranking is also called chess ranking
Orthogonal to that, 6 months ago, an AI making a repo was trivial. Wouldn't read it as a sign of anything.