It does really well on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable. I really like that benchmark because it's one of the few benchmarks that allows LLMs to elect not to answer if they are unsure and punishes them for trying to bullshit their way through the benchmark
A lot of benchmarks are setup to not punish false positives (irrelevant answers or extra text) and punish false negatives (missing the snippet being looked for).
This leads to answer bloat and/or hallucination if you benchmaxx on those
There is a tradeoff where as factual accuracy increases, creativity decreases, and the model becomes more "rigid" and less general. Unfortunately it seems that creativity is a good quality for reasoning and ultimately problem solving.
So we have a situation where models that can solve challenging problems, also tend to have problems with hallucinating, but those hallucinations seem be the breeding ground for the solutions that got them high "Wow" factor intelligence.
Yes. Most benchmarks just measure how many answers are correct. The best way to optimize that is to confidently state something, in hopes it's correct. Which is exactly how most LLMs behave, despite plenty of evidence that they do know whether they "know" something
if this is the case, then GLM 5.2 model seems better than gpt 5.5 or maybe even "Fable" depending upon what you are trying to achieve.
Fable model being removed from Anthropic because of security concerns by the US government (or well, also partially because of the personal vendetta between US govt and Anthropic)
Bullshitting is how LLMs work. It doesn't require active encouragement. All it takes is a machine without consciousness or physical access to the world and an actually-lived life. A training set that contains lots of confident answers and few to no refusals doesn't help either.
An LLM outputs tokens, one-by-one. It stops the loop if it outputs the end-of-text token. Which is, of course, statistically much rarer than any other kind of token.
(This is why you cannot, in general, prompt an LLM with something like "don't answer if the result is correct". It has to output something, by design.)
They are, especially multiple choice questions. The same happens with humans exams:
Let's say there are 100 questions, with 4 answers each. A good answer is worth 1 point. By just guessing you get an average of 25/100, way more than 0/100 by not replying.
If instead a wrong answer is -1 point, by just guessing you get on average -75/100, way worse than 0/100.
Where do you see that? I see they have GPT-5.5 (xhigh) at 55, GPT-5.5 (high) at 53, and Muse Spark at 43. Muse Spark does beat GPT-5.4 mini (xhigh) which scores 40, but the key there is "mini".
In the coding index, GPT-5.5 gets 59.1, 58.5, 56.2, and 52.1 for xhigh, high, medium, and low while Muse Spark is behind at 47.5. For agentic, GPT-5.5 gets 74.1, 72.0, 69.4, and 59.7 (xhigh, high, medium, low) while Muse Spark gets 62.0 (beating only GPT-5.5 low).
GPT-5.5 only gets beaten by Opus 4.8 in their general index, is the top spot for coding, and is #3 behind Opus 4.8 and GLM-5.2 for agentic (excluding Fable 5 which takes the top spot, but is unavailable).
I'm glad those tests apparently work out for you but a benchmark where three of the top 5 models are different flavors of Gemini Flash and zero are anything by Anthropic, is just so wildly divergent from my personal experience with the models that it's not useful to me.
Whatever it is you're measuring, it's not anything related to what I use models for.
Not only coding but also general knowledge work, anything from learning about how some things work (e.g. walking me through PNP vs NPN transistors) to summarizing texts, doing web research, and occasionally some OCR.
I've experimented with a few models for all this and have found Gemini the best at OCR but quite a bit worse at the rest. Claude is worse than GPT at web research-shaped things, but Opus 4.8 wins my anecdote benchmark for the other tasks besides those two.
But really, for code or knowlege stuff Gemini is markedly worse than the others, while Opus and GPT 5.5 are very very close.
PS: Just added a cool feature, so you can filter the leaderboard for multiple models at once, by using a comma, like: https://aibenchy.com/?q=glm,claude
I think it will eventually result in regulation and a potential grey market, and/or implosion of the centralized LLM services — I doubt they can keep hardware from becoming cheaper forever, and diminishing returns will make consumer hardware suitable for all but the hardest problems. At that point, the hardware “moat” will be completely gone and have become an extreme unrecoverable sunk cost.
There’s no way Anthropic can keep jacking up the prices like this for every marginally better model. I think even tokenmaxxing companies are going to soon balk at $50/million output tokens.
reply