It is possible because training is already being done using distributed methods in the datacenter. It is not practical because of the sheer volume of network delay piled on top of the computation
Does it even make sense to ask this? Is it reasonable or feasible?
I understand there are many nuances, such as the size and source of the training data, the size of the model (which would be too large for any browser to handle), network overhead, and the challenge of merging all the pieces together, among others. However, speculative calculations suggest that GPT-3 required around 3x10^22 FLOPs, which might (very speculatively) be equivalent to about 3,000 regular GPUs, each with an average performance of 6 TFLOPs, training it for ~30 days (which also sounds silly, I understand).
Of course, these are naive and highly speculative calculations that don’t account for whether it’s even possible to split the dataset, model, and training process into manageable pieces across such a setup.
But if this direction is not totally nonsensical, does it mean that even with a tremendous network overhead there is a huge potential for scaling (there are potentially a lot of laptops connected to the internet that potentially and voluntary could be used for training)?