AI agents wrong ~70% of time: Carnegie Mellon study

Jaden Norman@lemmy.world · 9 days ago

AI agents wrong ~70% of time: Carnegie Mellon study

Knock_Knock_Lemmy_In@lemmy.world · 8 days ago

Run something with a 70% failure rate 10x and you get to a cumulative 98% pass rate. LLMs don’t get tired and they can be run in parallel.

jsomae@lemmy.ml · 8 days ago

The problem is they are not i.i.d., so this doesn’t really work. It works a bit, which is in my opinion why chain-of-thought is effective (it gives the LLM a chance to posit a couple answers first). However, we’re already looking at “agents,” so they’re probably already doing chain-of-thought.

Knock_Knock_Lemmy_In@lemmy.world · 8 days ago

Very fair comment. In my experience even increasing the temperature you get stuck in local minimums

I was just trying to illustrate how 70% failure rates can still be useful.

Log in | Sign up@lemmy.world · 8 days ago

What’s 0.7^10?

Knock_Knock_Lemmy_In@lemmy.world · 8 days ago

About 0.02

Log in | Sign up@lemmy.world · 8 days ago

So the chances of it being right ten times in a row are 2%.

Knock_Knock_Lemmy_In@lemmy.world · edit-2 7 days ago

No the chances of being wrong 10x in a row are 2%. So the chances of being right at least once are 98%.

Log in | Sign up@lemmy.world · 8 days ago

Ah, my bad, you’re right, for being consistently correct, I should have done 0.3^10=0.0000059049

so the chances of it being right ten times in a row are less than one thousandth of a percent.

No wonder I couldn’t get it to summarise my list of data right and it was always lying by the 7th row.

Knock_Knock_Lemmy_In@lemmy.world · 7 days ago

That looks better. Even with a fair coin, 10 heads in a row is almost impossible.

And if you are feeding the output back into a new instance of a model then the quality is highly likely to degrade.

Log in | Sign up@lemmy.world · 7 days ago

Whereas if you ask a human to do the same thing ten times, the probability that they get all ten right is astronomically higher than 0.0000059049.

Knock_Knock_Lemmy_In@lemmy.world · 7 days ago

Dunno. Asking 10 humans at random to do a task and probably one will do it better than AI. Just not as fast.

jwmgregory@lemmy.dbzer0.com · 8 days ago

don’t you dare understand the explicitly obvious reasons this technology can be useful and the essential differences between P and NP problems. why won’t you be angry >:(