AI agents wrong ~70% of time: Carnegie Mellon study

Jaden Norman@lemmy.world · 9 days ago

AI agents wrong ~70% of time: Carnegie Mellon study

Log in | Sign up@lemmy.world · edit-2 7 days ago

Wow. 30% accuracy was the high score!
From the article:

Testing agents at the office

For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.

They call it TheAgentCompany. It’s a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.

the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.

⚫ Gemini-2.5-Pro (30.3 percent)
⚫ Claude-3.7-Sonnet (26.3 percent)
⚫ Claude-3.5-Sonnet (24 percent)
⚫ Gemini-2.0-Flash (11.4 percent)
⚫ GPT-4o (8.6 percent)
⚫ o3-mini (4.0 percent)
⚫ Gemini-1.5-Pro (3.4 percent)
⚫ Amazon-Nova-Pro-v1 (1.7 percent)
⚫ Llama-3.1-405b (7.4 percent)
⚫ Llama-3.3-70b (6.9 percent),
⚫ Qwen-2.5-72b (5.7 percent),
⚫ Llama-3.1-70b (1.7 percent)
⚫ Qwen-2-72b (1.1 percent).

“We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks,” the authors state in their paper

vane@lemmy.world · 6 days ago

Reading with CEO mindset. 3 out of 10 employees can be fired.

gargle@lemmy.world · 7 days ago

I asked Claude 3.5 Haiku to write me a quine in COBOL in the bs2000 dialect. Claude does now that creating a perfect quine in COBOL is challenging due to the need to represent the self-referential nature of the code. After a few suggestions Claude restated its first draft, without proper BS2000 incantations, without a perform statement, and without any self-referential redefines. It’s a lot of work. I stopped caring and moved on.

For those who wonder: https://sourceforge.net/p/gnucobol/discussion/lounge/thread/495d8008/ has an example.

Colour me unimpressed. I dread the day when they force the use of ‘AI’ on us at work.

iopq@lemmy.world · 7 days ago

Now I’m curious, what’s the average score for humans?

sircac@lemmy.world · 6 days ago

Why would they be right beyond word sequence frecuencies?

dan69@lemmy.world · 7 days ago

And it won’t be until humans can agree on what’s a fact and true vs not… there is always someone or some group spreading mis/dis-information

TheGrandNagus@lemmy.world · edit-2 8 days ago

LLMs are an interesting tool to fuck around with, but I see things that are hilariously wrong often enough to know that they should not be used for anything serious. Shit, they probably shouldn’t be used for most things that are not serious either.

It’s a shame that by applying the same “AI” naming to a whole host of different technologies, LLMs being limited in usability - yet hyped to the moon - is hurting other more impressive advancements.

For example, speech synthesis is improving so much right now, which has been great for my sister who relies on screen reader software.

Being able to recognise speech in loud environments, or removing background noice from recordings is improving loads too.

My friend is involved in making a mod for a Fallout 4, and there was an outreach for people recording voice lines - she says that there are some recordings of dubious quality that would’ve been unusable before that can now be used without issue thanks to AI denoising algorithms. That is genuinely useful!

As is things like pattern/image analysis which appears very promising in medical analysis.

All of these get branded as “AI”. A layperson might not realise that they are completely different branches of technology, and then therefore reject useful applications of “AI” tech, because they’ve learned not to trust anything branded as AI, due to being let down by LLMs.

snooggums@lemmy.world · 8 days ago

LLMs are like a multitool, they can do lots of easy things mostly fine as long as it is not complicated and doesn’t need to be exactly right. But they are being promoted as a whole toolkit as if they are able to be used to do the same work as effectively as a hammer, power drill, table saw, vise, and wrench.

TeddE@lemmy.world · 8 days ago

Because the tech industry hasn’t had a real hit of it’s favorite poison “private equity” in too long.

The industry has played the same playbook since at least 2006. Likely before, but that’s when I personally stated seeing it. My take is that they got addicted to the dotcom bubble and decided they can and should recreate the magic evey 3-5 years or so.

This time it’s AI, last it was crypto, and we’ve had web 2.0, 3.0, and a few others I’m likely missing.

But yeah, it’s sold like a panacea every time, when really it’s revolutionary for like a handful of tasks.

rottingleaf@lemmy.world · 8 days ago

That’s because they look like “talking machines” from various sci-fi. Normies feel as if they are touching the very edge of the progress. The rest of our life and the Internet kinda don’t give that feeling anymore.

NarrativeBear@lemmy.world · 8 days ago

Just add a search yesterday on the App Store and Google Play Store to see what new “productivity apps” are around. Pretty much every app now has AI somewhere in its name.

Punkie@lemmy.world · 8 days ago

I’d compare LLMs to a junior executive. Probably gets the basic stuff right, but check and verify for anything important or complicated. Break tasks down into easier steps.

Katana314@lemmy.world · 8 days ago

I’m in a workplace that has tried not to be overbearing about AI, but has encouraged us to use them for coding.

I’ve tried to give mine some very simple tasks like writing a unit test just for the constructor of a class to verify current behavior, and it generates output that’s both wrong and doesn’t verify anything.

I’m aware it sometimes gets better with more intricate, specific instructions, and that I can offer it further corrections, but at that point it’s not even saving time. I would do this with a human in the hopes that they would continue to retain the knowledge, but I don’t even have hopes for AI to apply those lessons in new contexts. In a way, it’s been a sigh of relief to realize just like Dotcom, just like 3D TVs, just like home smart assistants, it is a bubble.

jj4211@lemmy.world · 7 days ago

I’ve found that as an ambient code completion facility it’s… interesting, but I don’t know if it’s useful or not…

So on average, it’s totally wrong about 80% of the time, 19% of the time the first line or two is useful (either correct or close enough to fix), and 1% of the time it seems to actually fill in a substantial portion in a roughly acceptable way.

It’s exceedingly frustrating and annoying, but not sure I can call it a net loss in time.

So reviewing the proposal for relevance and cut off and edits adds time to my workflow. Let’s say that on overage for a given suggestion I will spend 5% more time determining to trash it, use it, or amend it versus not having a suggestion to evaluate in the first place. If the 20% useful time is 500% faster for those scenarios, then I come out ahead overall, though I’m annoyed 80% of the time. My guess as to whether the suggestion is even worth looking at improves, if I’m filling in a pretty boilerplate thing (e.g. taking some variables and starting to write out argument parsing), then it has a high chance of a substantial match. If I’m doing something even vaguely esoteric, I just ignore the suggestions popping up.

However, the 20% is a problem still since I’m maybe too lazy and complacent and spending the 100 milliseconds glancing at one word that looks right in review will sometimes fail me compared to spending 2-3 seconds having to type that same word out by hand.

That 20% success rate allowing for me to fix it up and dispose of most of it works for code completion, but prompt driven tasks seem to be so much worse for me that it is hard to imagine it to be better than the trouble it brings.

TimewornTraveler@lemmy.dbzer0.com · edit-2 8 days ago

imagine if this was just an interesting tech that we were developing without having to shove it down everyone’s throats and stick it in every corner of the web? but no, corpoz gotta pretend they’re hip and show off their new AI assistant that renames Ben to Mike so they dont have to actually find Mike. capitalism ruins everything.

jsomae@lemmy.ml · edit-2 8 days ago

I’d just like to point out that, from the perspective of somebody watching AI develop for the past 10 years, completing 30% of automated tasks successfully is pretty good! Ten years ago they could not do this at all. Overlooking all the other issues with AI, I think we are all irritated with the AI hype people for saying things like they can be right 100% of the time – Amazon’s new CEO actually said they would be able to achieve 100% accuracy this year, lmao. But being able to do 30% of tasks successfully is already useful.

outhouseperilous@lemmy.dbzer0.com · 8 days ago

Please stop.

jsomae@lemmy.ml · 8 days ago

I’m not claiming that the use of AI is ethical. If you want to fight back you have to take it seriously though.

outhouseperilous@lemmy.dbzer0.com · 8 days ago

It cant do 30% of tasks vorrectly. It can do tasks correctly as much as 30% of the time, and since it’s llm shit you know those numbers have been more massaged than any human in history has ever been.

jsomae@lemmy.ml · 8 days ago

I meant the latter, not “it can do 30% of tasks correctly 100% of the time.”

outhouseperilous@lemmy.dbzer0.com · 8 days ago

You get how that’s fucking useless, generally?

jsomae@lemmy.ml · 8 days ago

yes, that’s generally useless. It should not be shoved down people’s throats. 30% accuracy still has its uses, especially if the result can be programmatically verified.

Knock_Knock_Lemmy_In@lemmy.world · 8 days ago

Run something with a 70% failure rate 10x and you get to a cumulative 98% pass rate. LLMs don’t get tired and they can be run in parallel.

outhouseperilous@lemmy.dbzer0.com · edit-2 8 days ago

Less broadly useful than 20 tons of mixed texture human shit, and more ecologically devastatimg.

NarrativeBear@lemmy.world · 8 days ago

The ones being implemented into emergency call centers are better though? Right?

TeddE@lemmy.world · 8 days ago

Yes! We’ve gotten them up to 94℅ wrong at the behest of insurance agencies.

ApeNo1@lemmy.world · 8 days ago

They’ve done studies, you know. 30% of the time, it works every time.

lepinkainen@lemmy.world · 8 days ago

Wrong 70% doing what?

I’ve used LLMs as a Stack Overflow / MSDN replacement for over a year and if they fucked up 7/10 questions I’d stop.

Same with code, any free model can easily generate simple scripts and utilities with maybe 10% error rate, definitely not 70%

TimewornTraveler@lemmy.dbzer0.com · 8 days ago

it specifies the tasks in the article

Frenezul0_o@lemmy.world · 7 days ago

I notice that the research didn’t include DeepSeek. It would have been nice to see how it compares.

kinsnik@lemmy.world · 8 days ago

I haven’t used AI agents yet, but my job is kinda pushing for them. but i have used the google one that creates audio podcasts, just to play around, since my coworkers were using it to “learn” new things. i feed it with some of my own writing and created the podcast. it was fun, it was an audio overview of what i wrote. about 80% was cool analysis, but 20% was straight out of nowhere bullshit (which i know because I wrote the original texts that the audio was talking about). i can’t believe that people are using this for subjects that they have no knowledge. it is a fun toy for a few minutes (which is not worth the cost to the environment anyway)