The problem is simple: consumer motherboards don’t have that many PCIe slots, and consumer CPUs don’t have enough lanes to run 3+ GPUs at full PCIe gen 3 or gen 4 speeds.
My idea was to buy 3-4 computers for cheap, slot a GPU into each of them and use 4 of them in tandem. I imagine this will require some sort of agent running on each node which will be connected through a 10Gbe network. I can get a 10Gbe network running for this project.
Does Ollama or any other local AI project support this? Getting a server motherboard with CPU is going to get expensive very quickly, but this would be a great alternative.
Thanks
There are several solutions:
https://github.com/b4rtaz/distributed-llama
https://github.com/exo-explore/exo
https://github.com/kalavai-net/kalavai-client
Didn’t try any of them and haven’t looked for 6 months, so maybe something better have arrived…
Thank you for the links. I will go through them
I’ve tried Exo and it worked fairly well for me. Combined my 7900 XTX, GTX 1070, and M2 MacBook Pro.
Basically no GPU needs a full PCIe x16 slot to run at full speed. There are motherboards out there which will give you 3 or 4 slots of PCIe x8 electrical (x16 physical). I would look into those.
Edit: If you are willing to buy a board that supports AMD Epyc processors, you can get boards with basically as many PCIe slots as you could ever hope for. But that is almost certainly overkill for this task.
Aren’t Epyc boards really expensive? I was going to buy 3-4 used computers and stuff a GPU in each.
Are there motherboards on the used market that can run the E5-2600 V4 series CPUs and have multiple PCIe Xi slots? The only ones I found were super expensive/esoteric.
Hey I built a micro -atx epyc for work that has tons of pcie slots. Pretty sure it was an ASRock (or ASRack). I can find the details tomorrow if you’d like. Just let me know!
E: well, it looks like I remembered wrong and it was an atx, not micro. I think it is ASRock Rack ROMED8-2T and it has 7 PCIe4.0 x16 (I needed a lot). Unfortunately I don’t think it’s sold anymore other than really high prices on eBay.
Thank you, and that highlights the problem - I don’t see any affordable options (around $200 or so for a motherboard + CPU combo) for a lot of PCIe lanes other than purchasing Frankenstein boards from Aliexpress. Which isn’t going to be a thing for much longer with tariffs, so I’m looking elsewhere
Yes, I inadvertently emphasized your challenge :-/
Wow, so you want to use inefficient models super cheap. I guarantee nobody has ever thought of this before. Good move coming to Lemmy for tips on how to do so. I bet you’re the next Sam Altman 🤣
I don’t understand your point, but I was going to use 4 GPUs (something like used 3090s when they get cheaper or the Arc B580s) to run the smaller models like Mistral small.
You’re entering the realm of enterprise AI horizontal scaling which is $$$$
I’m not going to do anything enterprise. I’m not sure how people seem to think of it this way when I didn’t even mention it.
I plan to use 4 GPUs with 16-24GB VRAM each to run smaller 24B models.
I didn’t say you were, I said you were asking about a topic that enters that area.
I see. Thanks
If you want to use supercomputer software, setup SLURM scheduler on those machines. There are many tutorials how to do distributed gpu computing with slurm. I have it on my todo list.
https://github.com/SchedMD/slurm
https://slurm.schedmd.com/Thanks but I’m not going to run supercomputers. I just want to run 4 GPUs separately because of inadequate PCIe lanes in a single computer to run 24B-30B models
I believe you can run 30B models on single used rtx 3090 24GB at least I run 32B deepseek-r1 on it using ollama. Just make sure you have enought ram > 24GB.
Heavily quantized?
I run this one. https://ollama.com/library/deepseek-r1:32b-qwen-distill-q4_K_M with this frontend https://github.com/open-webui/open-webui on single rtx 3090 hardware 64gb ram. It works quite well for what I wanted it to do. I wanted to connect 2x 3090 cards with slurm to run 70b models but haven’t found time to do it.
I see. Thanks
Why?
You’re trying to run a DC setup in your home for AI bullshit?
It is because modern consumer GPUs do not have enough VRAM to load the 24B models. I want to run Mistral small locally.
I assume you’re talking about a CUDA implementation here. There’s ways to do this with that system, and even sub-projects that expand on that. I’m mostly pointing how pointless it is for you to do this. What a waste of time and money.
Edit: others are also pointing this out, but I’m still being downvoted. Mkay.
Used 3090s go for $800. I was planning to wait for the ARC B580s to go down in price to buy a few. The reason for the networked setup is because I didn’t find there to be enough PCIe lanes in any of the used computers I was looking at. If there’s either an affordable card with good performance and 48GB of VRAM, or there’s an affordable motherboard + CPU combo with a lot of PCIe lanes under $200, then I’ll gladly drop the idea of the distributed AI. I just need lots of VRAM and this is the only way I could think of.
Thanks
PLEASE look back at the crypto mining rush of a decade ago. I implore you.
You’re buying into something that doesn’t exist.