Are there any initiatives aimed at training generative AI using 100% public domain works and works authorized by the creator?

HiddenLayer555@lemmy.ml · 27 days ago

Are there any initiatives aimed at training generative AI using 100% public domain works and works authorized by the creator?

trxxruraxvr@lemmy.world · 27 days ago

I’d say the biggest issue with generative AI is the energy use and the fact that it’s increasing the rate at which we’re destroying the climate and our planet.

Ceedoestrees@lemmy.world · 27 days ago

Do we know how energy usage of AI compares to other daily tasks?

Like: rendering a minute of a fully animated film, flying from L.A. to New York, watching a whole series on Netflix, scrolling this site for an hour, or manufacturing a bottle of tylenol?

How does asking AI “2+2” compare to generating a three second animation in 1080p? There has to be a wide gamut of energy use per task.

And then the impact would depend on where your energy comes from. Which is a whole other thing, we should be demanding cleaner, more efficient energy sources.

A quick search on energy consumption by AI brings up a list of articles repeating the mantra that it’s substantial, but sources are vague or non-existent. None provide details to be able to confidently answer any of the above questions.

That’s not to say AI doesn’t consume significant power, it’s saying most people don’t regulate their lives by energy consumption.

kadup@lemmy.world · 27 days ago

We do have fairly precise numbers of how much energy it takes to train the models using the best GPUs available, and slightly less precise but also reasonable estimates on how much it costs to run servers for users to toy around with.

It’s extremely high, but not different from what it would be like if these were cloud gaming or 3D rendering servers.

The main point is usually is it worth it and that’s highly subjective.

Ceedoestrees@lemmy.world · 27 days ago

That’s my point. We don’t allocate energy resources based on importance, and when this argument is brought up there’s no scale for comparison when someone says AI, specifically, is destroying the planet.

Artisian@lemmy.world · 27 days ago

As I understand it, there are many many such models. Especially those made for academic use. Some common training corpus’s are listed here: https://www.tensorflow.org/datasets

Examples include wikipedia edits and discussions, and open source scientific articles.

Almost all research models are going to be trained on stuff like this. Many of them have demos, open code, and local installation instructions. They generally don’t have a marketing budget. Some of the models listed here certainly qualify: https://github.com/eugeneyan/open-llms?tab=readme-ov-file

Both of these are lists that are not so difficult to get on; so I imagine some of these have trouble with falsification or mislabeling, as you point out. But there’s little reason for people to do so (beyond improving a papers results I guess?).

Art generation seems to have had a harder time, but there are stable diffusion equivalents that used only CC work. A few minutes of search found: Common Canvas, claims to have been competitive.

Crackhappy@lemmy.world · 27 days ago

Excellent, thank you for posting sources and being a generally excellent human.

solrize@lemmy.ml · 27 days ago

For text generation the result would be almost useless since most public domain works are very old. For images, you could train with video feeds maybe.

General_Effort@lemmy.world · edit-2 27 days ago

For images, yes. Most notable is probably Adobe. Their AI, which powers photoshop’s generative fill among other things, is trained on public domain and licensed works.

For text, there’s nothing similar. LLMs get better the more data you have. So, the less training data you use, the less useful they are. I think there are 1 or a few small models for research purposes, but it really doesn’t get you there.

Of course, such open source projects are tricky. When you take these extreme, maximalist views of (intellectual) property, then giving stuff away for free isn’t the obvious first step.

kadup@lemmy.world · 27 days ago

It’s also very hard to keep track of licenses for text based content on the internet. Do most users know what’s the default licence for their comments on Reddit? How about Facebook? How about the comments section of a random blog? How about the title of their Medium post? And so on

General_Effort@lemmy.world · 27 days ago

The usual tends to be that the platform can do basically whatever. That shouldn’t really be surprising. But I see your point. If you literally want consent, not just legally licensed material, then you need more than just a clause in the TOS.

You could raise the same issue with permissively licensed material. People who released it may not have foreseen AI training as a use, and might not have wanted to actually allow it.

kadup@lemmy.world · 27 days ago

Exactly - the platform owner usually can do everything. Can a third party crawler? I don’t know

General_Effort@lemmy.world · 27 days ago

You mean legally? Yeah, no problem. It depends on the location, though. In the EU, the rights-holder can opt out. So if you want to do it in the EU you have to pay off Reddit, Meta, and so on. In Japan, it’s fine regardless. In the US, it should turn out similarly, but it’s up to the courts to work out the details, and it’s quite up in the air if you can trust the system to work.

sturlabragason@lemmy.world · 27 days ago

Hey PubDomainLLM tell me something that only exists in that proprietary dataset? “I’m sorry, you’ve caught me lackin’”

You would want your LLM to be trained on as comprehensive a dataset as you can. But I would suggest we should be coming up with better ways to license proprietary works for uses like this instead of walling it up for the cable tv of proprietary knowledge gardens.

I agree with you partially in principle but not in practice.

Ultimately we want as smart LLMs as we can, just compare the best models with the mediocre ones, or use them all day long, there is a vast difference.