T O P

  • By -

djm07231

It is interesting how they contrast with OpenAI in that they didn't try hyping up the tension with some ridiculous model name on LMSys, like they did with GPT-4o.


Puzzleheaded_Pop_743

When it does show up on the leaderboard, how long until openAI plays their hand to be #1 again? Weeks? It will it take months? I really wonder if they have GPT-5 ready already or if it is true what some are saying that they have actually only now just started pre-training it. I am still leaning that they already have GPT-5 pretrained and are still testing it/getting it to work. But I think both possibilities are plausible.


RollingWallnut

They have completed at least one pre-training run of what would effectively be a 'GPT-5' scale model, they are running multiple training runs and then need time for fine tuning and red teaming. Sources have said to partners and insiders that they should expect to wait 12 months from May this year, so earlier half of next year is most likely. Nothing I've heard has ruled out seeing a 4.5 model before then.


latamxem

Its clear Sam is just a hype man like Elon. Building hype to get money because their companies are on the verge of running out of money. They over promise and sometimes it comes to bite them on theass.


LegitimateLength1916

It needs to accumulate more data. Wait a few more days.


XvX_k1r1t0_XvX_ki

If so then shouldn't it appear in the arena sometimes? Or they do inside audits first


FosterKittenPurrs

https://preview.redd.it/j134a59h348d1.png?width=797&format=png&auto=webp&s=7464e9a6b78480aa91579d6cf3cc821141b4de4e It's there, you can talk to it under Direct Chat. It will appear, but they have so many models nowadays, I imagine it might be more rare


uishax

They down-frequency the bad/old models, so the old models barely appear anyways. So most battles are between the top models.


GreedyWorking1499

How are they able to offer so many high quality, SOTA, models for free?


Utoko

It is there got it several times, you can also select it in the Side by Side Arena or direct


hippydipster

I got a response yesterday from sonnet 3.5. It won that round from me :-)


Dron007

Yes, it is there and won a round of Mona Lisa made of emojis. It was not perfect but much better than competitor.


KrazyA1pha

It's in the arena. I just opened your link and my second matchup was gpt-4 vs claude-3-5


XvX_k1r1t0_XvX_ki

Ok I was just unlucky for a long time than


pxp121kr

new models usually appear in a few days, they have a hard coded limit of how many votes they should get before appearing


Antiprimary

In my opinion the LMSys leaderboard doesnt matter anymore, claude 3.5 is the best because of its artifacts feature which isnt included on the leaderboard, and its coding ability which also wont be evaluated on the leaderboard.


Pretty_Afternoon9022

the model takes a few days. Benchmarks need enough human votes to reduce the confidence interval before it can be ranked


bnm777

Forget about the lmsys leaderboard- it's not accurate - the data is the result of AI nerds, like us, judging responses from LLMs, with the many issues this raises. Use more objective leaderboards: https://livebench.ai/ Look where sonnet 3.5 is on this leaderboard. https://scale.com/leaderboard hasn't included 3.5 yet


itsjase

I’d also add mixEval to this list: https://mixeval.github.io/#leaderboard Honestly lmsys is probably the most “realistic” and least “gameable” benchmark because there’s no chance of contamination. It also captures human preference which other benchmarks don’t. Eg Llama3 is so high because it’s got a very “human” way of talking, it might not be the smartest model for its size but it sounds the least robotic by far


GraceToSentience

Not only is the lmsys leaderboard accurate, it's the most useful one. It looks at the actual usage from the people using AI. It judges how much people find the model useful for people. Premade benchmarks don't use LLMs, humans do. That makes lmsys the best benchmark out there. And there is no doubt sonnet will obviously top the lmsys benchmark, because lmsys works. Just watch.


bnm777

Public user judgments can be biased or inconsistent, and may not always align with objective performance metrics. Devs can programme an LLM to output very nicely formatted and sounding text, that is inaccurate, or absed upon poor reasoning. The lmsys leaderboard favours models with internet access. The lmsys leaderboard can, of course, give an additional dimension for comparisons, however if someone uses this leaderboard solely, they would be misguided. Enlighten yourself, and read this comparison: [https://medium.com/@olga.zem/exploring-llm-leaderboards-8527eac97431](https://medium.com/@olga.zem/exploring-llm-leaderboards-8527eac97431) "However, there are worries that the system might prefer models that give the most agreeable answers instead of the most accurate ones. This could twist the results and not truly show what the models can do. Since it relies on human judgment, people’s different standards and preferences could make the evaluations less objective." There are many leaderboards mentioned.


GraceToSentience

Enlighten yourself by thinking for yourself with this hypothetical situation: If you have an LLM that is overall good at a million different thing that you don't care about and yet another is good at fewer things but they matter to you, the one that is more useful for you might score worse than the one who is good at everything, and despite the higher overall score you would find the jack of all trades one less useful. Jack of all trades, master of none. People don't care if an LLM is better than other at something weird like (badly) folding protein or being able to translate some imaginary languages. People want an LLM to be good at what interests them and that is more likely with a model that's better at what interests most people rating an LLM. After all the point of an LLM is to be useful or did I miss something? Also you don't know what you are talking about, any general benchmark that requires knowledge favours a model with internet access because RAG makes models objectively better, that has nothing to do with lmsys, you don't have a very good understanding of what makes a benchmark perform better or worse


bnm777

Oh, no, your comment is so limited. An insight into your thought processes, it seems. I don't know where to start. "Also you don't know what you are talking about, any general benchmark that requires knowledge favours a model with internet access because RAG makes models objectively better, that has nothing to do with lmsys, you don't have a very good understanding of what makes a benchmark perform better or worse" lol Keep on dreaming, friend.


roti_bao

Ohh scale just added 3.5 Sonnet in coding and instruction following ... Interesting