TPToolPazar
Ana Sayfa/Rehberler/How To Share A Gpu Across Machines

How To Share A Gpu Across Machines

📖 Bu rehber ToolPazar ekibi tarafından hazırlanmıştır. Tüm araçlarımız ücretsiz ve reklamsızdır.

Three ways to share, ordered by what you’re trying to do

You don’t want to combine GPUs — you want one fast GPU to serve a whole household or team. The right pattern is a model-serving daemon on the GPU host that exposes an HTTP endpoint everyone else points their tools at. Pick one of:

1. One GPU serves many clients (the 90% case)

All three speak the OpenAI HTTP wire format, so clients (Cursor, Continue.dev, custom scripts, agents) need only a base URL change to start using the shared GPU.

2. One model split across multiple GPUs on the same machine

If you have two 24-GB cards in one box and want to run a 70B model in FP8 (~70 GB), that’s tensor parallelism. Both vLLM and TGI handle it natively:

3. One model split across multiple machines (pipeline parallelism)

The two GPUs need to be in the same chassis with NVLink or at least PCIe 4.0 x16 each. Splitting one model across PCIe between machines is technically possible but latency-prohibitive — do that with pipeline parallelism (next section) instead of tensor parallelism.

Throughput math you should run before buying anything

The output of a serving stack is tokens-per-second-per-user multiplied by concurrent users. Both numbers move with the model size and the request mix. Rough single-GPU ballpark on an RTX 4090 (24 GB), measured with vLLM:

Network requirements (less than you think for case 1)

The pattern: small models with continuous batching deliver near-linear scaling up to 4–8 simultaneous users. Past that the math depends on cache pressure and prompt length.

What about Apple Silicon, ROCm, and CPU-only hosts?

For the “one GPU, many clients” pattern, the network sees compact request / response tokens — usually 1–10 KB per round trip. A standard 1 GbE LAN handles 50+ concurrent users without breaking a sweat. The sensitive number is latency, not bandwidth: ping the GPU host from each client; if it’s under 5 ms, you’re fine. Wi-Fi is usable but adds a noticeable first-token delay versus wired.

Auth and access control

For tensor-parallel splitting across machines (rare, hard, and slow over PCIe-class hardware), you’re in 25 GbE+ territory. Skip it for home labs.

The 30-minute starter

That’s the foundation. Add vLLM later if you outgrow Ollama’s throughput, add a pod (Hyperspace / exo) when you outgrow a single GPU’s memory budget.