It is interesting how they contrast with OpenAI in that they didn't try hyping up the tension with some ridiculous model name on LMSys, like they did with GPT-4o.
When it does show up on the leaderboard, how long until openAI plays their hand to be #1 again? Weeks? It will it take months? I really wonder if they have GPT-5 ready already or if it is true what some are saying that they have actually only now just started pre-training it. I am still leaning that they already have GPT-5 pretrained and are still testing it/getting it to work. But I think both possibilities are plausible.
They have completed at least one pre-training run of what would effectively be a 'GPT-5' scale model, they are running multiple training runs and then need time for fine tuning and red teaming. Sources have said to partners and insiders that they should expect to wait 12 months from May this year, so earlier half of next year is most likely.
Nothing I've heard has ruled out seeing a 4.5 model before then.
Its clear Sam is just a hype man like Elon. Building hype to get money because their companies are on the verge of running out of money. They over promise and sometimes it comes to bite them on theass.
https://preview.redd.it/j134a59h348d1.png?width=797&format=png&auto=webp&s=7464e9a6b78480aa91579d6cf3cc821141b4de4e
It's there, you can talk to it under Direct Chat.
It will appear, but they have so many models nowadays, I imagine it might be more rare
In my opinion the LMSys leaderboard doesnt matter anymore, claude 3.5 is the best because of its artifacts feature which isnt included on the leaderboard, and its coding ability which also wont be evaluated on the leaderboard.
Forget about the lmsys leaderboard- it's not accurate - the data is the result of AI nerds, like us, judging responses from LLMs, with the many issues this raises.
Use more objective leaderboards:
https://livebench.ai/ Look where sonnet 3.5 is on this leaderboard.
https://scale.com/leaderboard hasn't included 3.5 yet
I’d also add mixEval to this list: https://mixeval.github.io/#leaderboard
Honestly lmsys is probably the most “realistic” and least “gameable” benchmark because there’s no chance of contamination.
It also captures human preference which other benchmarks don’t. Eg Llama3 is so high because it’s got a very “human” way of talking, it might not be the smartest model for its size but it sounds the least robotic by far
Not only is the lmsys leaderboard accurate, it's the most useful one.
It looks at the actual usage from the people using AI.
It judges how much people find the model useful for people.
Premade benchmarks don't use LLMs, humans do.
That makes lmsys the best benchmark out there.
And there is no doubt sonnet will obviously top the lmsys benchmark, because lmsys works.
Just watch.
Public user judgments can be biased or inconsistent, and may not always align with objective performance metrics.
Devs can programme an LLM to output very nicely formatted and sounding text, that is inaccurate, or absed upon poor reasoning.
The lmsys leaderboard favours models with internet access.
The lmsys leaderboard can, of course, give an additional dimension for comparisons, however if someone uses this leaderboard solely, they would be misguided.
Enlighten yourself, and read this comparison:
[https://medium.com/@olga.zem/exploring-llm-leaderboards-8527eac97431](https://medium.com/@olga.zem/exploring-llm-leaderboards-8527eac97431)
"However, there are worries that the system might prefer models that give the most agreeable answers instead of the most accurate ones. This could twist the results and not truly show what the models can do. Since it relies on human judgment, people’s different standards and preferences could make the evaluations less objective."
There are many leaderboards mentioned.
Enlighten yourself by thinking for yourself with this hypothetical situation:
If you have an LLM that is overall good at a million different thing that you don't care about and yet another is good at fewer things but they matter to you, the one that is more useful for you might score worse than the one who is good at everything, and despite the higher overall score you would find the jack of all trades one less useful.
Jack of all trades, master of none.
People don't care if an LLM is better than other at something weird like (badly) folding protein or being able to translate some imaginary languages.
People want an LLM to be good at what interests them and that is more likely with a model that's better at what interests most people rating an LLM.
After all the point of an LLM is to be useful or did I miss something?
Also you don't know what you are talking about, any general benchmark that requires knowledge favours a model with internet access because RAG makes models objectively better, that has nothing to do with lmsys, you don't have a very good understanding of what makes a benchmark perform better or worse
Oh, no, your comment is so limited. An insight into your thought processes, it seems.
I don't know where to start.
"Also you don't know what you are talking about, any general benchmark that requires knowledge favours a model with internet access because RAG makes models objectively better, that has nothing to do with lmsys, you don't have a very good understanding of what makes a benchmark perform better or worse"
lol
Keep on dreaming, friend.
It is interesting how they contrast with OpenAI in that they didn't try hyping up the tension with some ridiculous model name on LMSys, like they did with GPT-4o.
When it does show up on the leaderboard, how long until openAI plays their hand to be #1 again? Weeks? It will it take months? I really wonder if they have GPT-5 ready already or if it is true what some are saying that they have actually only now just started pre-training it. I am still leaning that they already have GPT-5 pretrained and are still testing it/getting it to work. But I think both possibilities are plausible.
They have completed at least one pre-training run of what would effectively be a 'GPT-5' scale model, they are running multiple training runs and then need time for fine tuning and red teaming. Sources have said to partners and insiders that they should expect to wait 12 months from May this year, so earlier half of next year is most likely. Nothing I've heard has ruled out seeing a 4.5 model before then.
Its clear Sam is just a hype man like Elon. Building hype to get money because their companies are on the verge of running out of money. They over promise and sometimes it comes to bite them on theass.
It needs to accumulate more data. Wait a few more days.
If so then shouldn't it appear in the arena sometimes? Or they do inside audits first
https://preview.redd.it/j134a59h348d1.png?width=797&format=png&auto=webp&s=7464e9a6b78480aa91579d6cf3cc821141b4de4e It's there, you can talk to it under Direct Chat. It will appear, but they have so many models nowadays, I imagine it might be more rare
They down-frequency the bad/old models, so the old models barely appear anyways. So most battles are between the top models.
How are they able to offer so many high quality, SOTA, models for free?
It is there got it several times, you can also select it in the Side by Side Arena or direct
I got a response yesterday from sonnet 3.5. It won that round from me :-)
Yes, it is there and won a round of Mona Lisa made of emojis. It was not perfect but much better than competitor.
It's in the arena. I just opened your link and my second matchup was gpt-4 vs claude-3-5
Ok I was just unlucky for a long time than
new models usually appear in a few days, they have a hard coded limit of how many votes they should get before appearing
In my opinion the LMSys leaderboard doesnt matter anymore, claude 3.5 is the best because of its artifacts feature which isnt included on the leaderboard, and its coding ability which also wont be evaluated on the leaderboard.
the model takes a few days. Benchmarks need enough human votes to reduce the confidence interval before it can be ranked
Forget about the lmsys leaderboard- it's not accurate - the data is the result of AI nerds, like us, judging responses from LLMs, with the many issues this raises. Use more objective leaderboards: https://livebench.ai/ Look where sonnet 3.5 is on this leaderboard. https://scale.com/leaderboard hasn't included 3.5 yet
I’d also add mixEval to this list: https://mixeval.github.io/#leaderboard Honestly lmsys is probably the most “realistic” and least “gameable” benchmark because there’s no chance of contamination. It also captures human preference which other benchmarks don’t. Eg Llama3 is so high because it’s got a very “human” way of talking, it might not be the smartest model for its size but it sounds the least robotic by far
Not only is the lmsys leaderboard accurate, it's the most useful one. It looks at the actual usage from the people using AI. It judges how much people find the model useful for people. Premade benchmarks don't use LLMs, humans do. That makes lmsys the best benchmark out there. And there is no doubt sonnet will obviously top the lmsys benchmark, because lmsys works. Just watch.
Public user judgments can be biased or inconsistent, and may not always align with objective performance metrics. Devs can programme an LLM to output very nicely formatted and sounding text, that is inaccurate, or absed upon poor reasoning. The lmsys leaderboard favours models with internet access. The lmsys leaderboard can, of course, give an additional dimension for comparisons, however if someone uses this leaderboard solely, they would be misguided. Enlighten yourself, and read this comparison: [https://medium.com/@olga.zem/exploring-llm-leaderboards-8527eac97431](https://medium.com/@olga.zem/exploring-llm-leaderboards-8527eac97431) "However, there are worries that the system might prefer models that give the most agreeable answers instead of the most accurate ones. This could twist the results and not truly show what the models can do. Since it relies on human judgment, people’s different standards and preferences could make the evaluations less objective." There are many leaderboards mentioned.
Enlighten yourself by thinking for yourself with this hypothetical situation: If you have an LLM that is overall good at a million different thing that you don't care about and yet another is good at fewer things but they matter to you, the one that is more useful for you might score worse than the one who is good at everything, and despite the higher overall score you would find the jack of all trades one less useful. Jack of all trades, master of none. People don't care if an LLM is better than other at something weird like (badly) folding protein or being able to translate some imaginary languages. People want an LLM to be good at what interests them and that is more likely with a model that's better at what interests most people rating an LLM. After all the point of an LLM is to be useful or did I miss something? Also you don't know what you are talking about, any general benchmark that requires knowledge favours a model with internet access because RAG makes models objectively better, that has nothing to do with lmsys, you don't have a very good understanding of what makes a benchmark perform better or worse
Oh, no, your comment is so limited. An insight into your thought processes, it seems. I don't know where to start. "Also you don't know what you are talking about, any general benchmark that requires knowledge favours a model with internet access because RAG makes models objectively better, that has nothing to do with lmsys, you don't have a very good understanding of what makes a benchmark perform better or worse" lol Keep on dreaming, friend.
Ohh scale just added 3.5 Sonnet in coding and instruction following ... Interesting