How to use GPUs over multiple computers for local AI?

marauding_gibberish142@lemmy.dbzer0.com · 3 months ago

How to use GPUs over multiple computers for local AI?

Sims@lemmy.ml · 3 months ago

There are several solutions:

https://github.com/b4rtaz/distributed-llama

https://github.com/exo-explore/exo

https://github.com/kalavai-net/kalavai-client

https://petals.dev/

Didn’t try any of them and haven’t looked for 6 months, so maybe something better have arrived…

marauding_gibberish142@lemmy.dbzer0.com · 3 months ago

Thank you for the links. I will go through them

Wigglytuff@lemmy.world · 3 months ago

I’ve tried Exo and it worked fairly well for me. Combined my 7900 XTX, GTX 1070, and M2 MacBook Pro.

BombOmOm@lemmy.world · edit-2 3 months ago

Basically no GPU needs a full PCIe x16 slot to run at full speed. There are motherboards out there which will give you 3 or 4 slots of PCIe x8 electrical (x16 physical). I would look into those.

Edit: If you are willing to buy a board that supports AMD Epyc processors, you can get boards with basically as many PCIe slots as you could ever hope for. But that is almost certainly overkill for this task.

marauding_gibberish142@lemmy.dbzer0.com · 3 months ago

Aren’t Epyc boards really expensive? I was going to buy 3-4 used computers and stuff a GPU in each.

Are there motherboards on the used market that can run the E5-2600 V4 series CPUs and have multiple PCIe Xi slots? The only ones I found were super expensive/esoteric.

reptar@lemmy.world · edit-2 3 months ago

Hey I built a micro -atx epyc for work that has tons of pcie slots. Pretty sure it was an ASRock (or ASRack). I can find the details tomorrow if you’d like. Just let me know!

E: well, it looks like I remembered wrong and it was an atx, not micro. I think it is ASRock Rack ROMED8-2T and it has 7 PCIe4.0 x16 (I needed a lot). Unfortunately I don’t think it’s sold anymore other than really high prices on eBay.

marauding_gibberish142@lemmy.dbzer0.com · 3 months ago

Thank you, and that highlights the problem - I don’t see any affordable options (around $200 or so for a motherboard + CPU combo) for a lot of PCIe lanes other than purchasing Frankenstein boards from Aliexpress. Which isn’t going to be a thing for much longer with tariffs, so I’m looking elsewhere

reptar@lemmy.world · 3 months ago

Yes, I inadvertently emphasized your challenge :-/

just_another_person@lemmy.world · 3 months ago

Wow, so you want to use inefficient models super cheap. I guarantee nobody has ever thought of this before. Good move coming to Lemmy for tips on how to do so. I bet you’re the next Sam Altman 🤣

marauding_gibberish142@lemmy.dbzer0.com · 3 months ago

I don’t understand your point, but I was going to use 4 GPUs (something like used 3090s when they get cheaper or the Arc B580s) to run the smaller models like Mistral small.

False@lemmy.world · 3 months ago

You’re entering the realm of enterprise AI horizontal scaling which is $$$$

marauding_gibberish142@lemmy.dbzer0.com · 3 months ago

I’m not going to do anything enterprise. I’m not sure how people seem to think of it this way when I didn’t even mention it.

I plan to use 4 GPUs with 16-24GB VRAM each to run smaller 24B models.

False@lemmy.world · 3 months ago

I didn’t say you were, I said you were asking about a topic that enters that area.

marauding_gibberish142@lemmy.dbzer0.com · 3 months ago

I see. Thanks

vane@lemmy.world · edit-2 3 months ago

If you want to use supercomputer software, setup SLURM scheduler on those machines. There are many tutorials how to do distributed gpu computing with slurm. I have it on my todo list.
https://github.com/SchedMD/slurm
https://slurm.schedmd.com/

marauding_gibberish142@lemmy.dbzer0.com · 3 months ago

Thanks but I’m not going to run supercomputers. I just want to run 4 GPUs separately because of inadequate PCIe lanes in a single computer to run 24B-30B models

vane@lemmy.world · edit-2 3 months ago

I believe you can run 30B models on single used rtx 3090 24GB at least I run 32B deepseek-r1 on it using ollama. Just make sure you have enought ram > 24GB.

marauding_gibberish142@lemmy.dbzer0.com · 3 months ago

Heavily quantized?

vane@lemmy.world · edit-2 3 months ago

I run this one. https://ollama.com/library/deepseek-r1:32b-qwen-distill-q4_K_M with this frontend https://github.com/open-webui/open-webui on single rtx 3090 hardware 64gb ram. It works quite well for what I wanted it to do. I wanted to connect 2x 3090 cards with slurm to run 70b models but haven’t found time to do it.

marauding_gibberish142@lemmy.dbzer0.com · 3 months ago

I see. Thanks

just_another_person@lemmy.world · edit-2 3 months ago

Why?

You’re trying to run a DC setup in your home for AI bullshit?

marauding_gibberish142@lemmy.dbzer0.com · 3 months ago

It is because modern consumer GPUs do not have enough VRAM to load the 24B models. I want to run Mistral small locally.

just_another_person@lemmy.world · edit-2 3 months ago

I assume you’re talking about a CUDA implementation here. There’s ways to do this with that system, and even sub-projects that expand on that. I’m mostly pointing how pointless it is for you to do this. What a waste of time and money.

Edit: others are also pointing this out, but I’m still being downvoted. Mkay.

marauding_gibberish142@lemmy.dbzer0.com · 3 months ago

Used 3090s go for $800. I was planning to wait for the ARC B580s to go down in price to buy a few. The reason for the networked setup is because I didn’t find there to be enough PCIe lanes in any of the used computers I was looking at. If there’s either an affordable card with good performance and 48GB of VRAM, or there’s an affordable motherboard + CPU combo with a lot of PCIe lanes under $200, then I’ll gladly drop the idea of the distributed AI. I just need lots of VRAM and this is the only way I could think of.

Thanks

just_another_person@lemmy.world · 3 months ago

PLEASE look back at the crypto mining rush of a decade ago. I implore you.

You’re buying into something that doesn’t exist.