AI agents wrong ~70% of time: Carnegie Mellon study

Jaden Norman@lemmy.world · 9 days ago

AI agents wrong ~70% of time: Carnegie Mellon study

jsomae@lemmy.ml · edit-2 9 days ago

I’d just like to point out that, from the perspective of somebody watching AI develop for the past 10 years, completing 30% of automated tasks successfully is pretty good! Ten years ago they could not do this at all. Overlooking all the other issues with AI, I think we are all irritated with the AI hype people for saying things like they can be right 100% of the time – Amazon’s new CEO actually said they would be able to achieve 100% accuracy this year, lmao. But being able to do 30% of tasks successfully is already useful.

outhouseperilous@lemmy.dbzer0.com · 9 days ago

Please stop.

jsomae@lemmy.ml · 9 days ago

I’m not claiming that the use of AI is ethical. If you want to fight back you have to take it seriously though.

outhouseperilous@lemmy.dbzer0.com · 9 days ago

It cant do 30% of tasks vorrectly. It can do tasks correctly as much as 30% of the time, and since it’s llm shit you know those numbers have been more massaged than any human in history has ever been.

jsomae@lemmy.ml · 9 days ago

I meant the latter, not “it can do 30% of tasks correctly 100% of the time.”

outhouseperilous@lemmy.dbzer0.com · 9 days ago

You get how that’s fucking useless, generally?

jsomae@lemmy.ml · 9 days ago

yes, that’s generally useless. It should not be shoved down people’s throats. 30% accuracy still has its uses, especially if the result can be programmatically verified.

Knock_Knock_Lemmy_In@lemmy.world · 8 days ago

Run something with a 70% failure rate 10x and you get to a cumulative 98% pass rate. LLMs don’t get tired and they can be run in parallel.

jsomae@lemmy.ml · 8 days ago

The problem is they are not i.i.d., so this doesn’t really work. It works a bit, which is in my opinion why chain-of-thought is effective (it gives the LLM a chance to posit a couple answers first). However, we’re already looking at “agents,” so they’re probably already doing chain-of-thought.

Log in | Sign up@lemmy.world · 8 days ago

What’s 0.7^10?

outhouseperilous@lemmy.dbzer0.com · edit-2 9 days ago

Less broadly useful than 20 tons of mixed texture human shit, and more ecologically devastatimg.

jsomae@lemmy.ml · 9 days ago

Are you just trolling or do you seriously not understand how something which can do a task correctly with 30% reliability can be made useful if the result can be automatically verified.