Ask HN: Is distributed LLM training in browsers (WebRTC and WebGPU) possible?

4 points by trekhleb 18 hours ago | 5 comments

gtsop 17 hours ago |
Possible? Yes. Practical? No
It is possible because training is already being done using distributed methods in the datacenter. It is not practical because of the sheer volume of network delay piled on top of the computation
astlouis44 17 hours ago |
Client-side inference will be a bigger opportunity for AI in the browser than client-side training, in my opion.
trekhleb 17 hours ago |
What I mean is training something like GPT-3 in a distributed manner using a large number of regular browsers or laptops with average WebGPU support/power and WebRTC for communication.
Does it even make sense to ask this? Is it reasonable or feasible?
I understand there are many nuances, such as the size and source of the training data, the size of the model (which would be too large for any browser to handle), network overhead, and the challenge of merging all the pieces together, among others. However, speculative calculations suggest that GPT-3 required around 3x10^22 FLOPs, which might (very speculatively) be equivalent to about 3,000 regular GPUs, each with an average performance of 6 TFLOPs, training it for ~30 days (which also sounds silly, I understand).
Of course, these are naive and highly speculative calculations that don’t account for whether it’s even possible to split the dataset, model, and training process into manageable pieces across such a setup.
But if this direction is not totally nonsensical, does it mean that even with a tremendous network overhead there is a huge potential for scaling (there are potentially a lot of laptops connected to the internet that potentially and voluntary could be used for training)?
soheil 17 hours ago |
This is the promise of WebGPU. Once all major browsers fully adopt it it'll be a game change for any ML training. Apple Silicon is the giant no one is talking about training is almost exclusively done on NVDA chips. Cost per TFlop is the lowest on Apple chips both operational and acquisition (mac mini).
xyzsparetimexyz 9 hours ago |
Distributed training in general isn't practical.