Since Nvidia owns most of the Gpgpu products, they have top notch networking and interconnect, I wonder if they don't have a plan to own all datacenter hardware in the future. Maybe they plan to also release CPUs, motherboards, storage and whatever else is needed.
Google has its own TPUs and don’t really use GPUs except to sell them to end customers on cloud I think. So using Nvidia networking for Nvidia GPUs across many machines on cloud is really just a reflection of what external customers want to buy.
Disclaimer, I work at Google but have no non public info about this.
If they have other opportunities for investment with higher margins, they should seize those, of course. And perhaps even call up investors for more capital, if required.
https://ultraethernet.org/wp-content/uploads/sites/20/2023/0...
https://www.youtube.com/live/Y2F8yisiS6E?si=GbyzzIG8w-mtS7s-...
Google's DC networking is interesting because of how deeply integrated it is into the entire software stack. Click on some of the links and you'll see it mentions SDN (Software Defined Network). This is so Borg instances can talk to each other within the same service at high throughput and low latency. 8-10 years ago this was (IIRC) 40Gbps connections. It's probably 100Gbps now but that's just a guess.
But the networking is also integrated into global services like traffic management to handle, say, DDoS attacks.
Anyway, from reading this it doesn't sound like Google is abandoning their custom TPU silicon (ie it talks about the upcoming A3 Ultra and Trillium). So where does NVidia ConnectX fit in? AFAICT that's just the NIC they're plugging into Jupiter. That's probably what enables (or will enable) 100Gbps connections between servers. Yes, 100GbE optical NICs have existed for a long time. I would assume that NVidia produce better ones in terms of price, performance, size, power usage and/or heat produced.
Disclaimer: Xoogler. I didn't work in networking though.
[I know nothing about Jupiter, and little about RDMA in practice, but used ConnectX for VMA, its hardware-accelerated, kernel-bypass tech.]
The outcome was really bad GPU-GPU latency & bandwidth between machines. My understanding is ConnectX is Nvidias supported (and probably still very profitable) way for these hyperscalers to use their proprietary networks without buying Infiniband switches and without paying the latency cost of moving bytes from the GPU to the CPU.
RoCE is IB over Ethernet. All the underlying documentation and settings to put this stuff together are the same. It doesn't require ConnectX NIC's though. We do the same with 8x Broadcom Thor 2 NIC's (into a Broadcom Tomahawk 5 based Dell Z9864F switch) for our own 400G cluster.
Just goes to show how drastic and extraordinary levels of scale can be.
Oh and make your data centers smaller. Not so big they can be seen in Google Maps. Because otherwise, you will be unable to move those whale sized workloads to an alternative.
https://youtu.be/mDNHK-SzXEM?t=564
https://news.ycombinator.com/item?id=35713001
"Unmasking Google Cloud: How to Determine if a Region Supports Physical Zone Separation" - https://cagataygurturk.medium.com/unmasking-google-cloud-how...
If I check London (where europe-west2 is kinda located) on Google Maps right now, I can easily discern manhole covers or people. If I check Jakarta (Asia-southeast2) things smaller than a car get confusing, but you can definitely see them.
The scale of cloud data centres reflects the scale of their customer base, not the size of the basket for each individual customer.
Larger data centres actually improve availability through several mechanisms: more power components such as generators means the failure of any one is just a few percent instead of a total blackout. You can also partition core infrastructure like routers and power rails into more fault domains and update domains.
Some large clouds have two update domains and five fault domains on top of three zones that are more than 10km apart. You can’t beat ~30 individual partitions with your data centres at a reasonable cost!
It is true that the nomenclature "AWS Availability Zone" has a different meaning than "GCP Zone" when discussing the physical separation between zones within the same region.
It's unclear why this is inherently a bad thing, as long as them same overall level of reliability is achieved.
In my experience, the set of issues that would affect 2 buildings close to each other, but not two buildings a mile apart, is vanishingly small, usually just last mile fiber cuts or power issues (which are rare and mitigated by having multiple independent providers), as well as issues like building fires (which are exceedingly rare, we know of, perhaps two of notable impact in more than a decade across the big three cloud providers).
Everything else is done at the zone level no matter what (onsite repair work, rollouts, upgrades, control plane changes, etc.) or can impact an entire region (non-last mile fiber or power cuts, inclement weather, regional power starvation, etc.)
There is a potential gain from physical zone isolation, but it protects against a relatively small set of issues. Is it really better to invest in that, or to invest the resources in other safety improvements?
It is true, and obvious, that GCP and AWS and Azure use different architectures. It does not obviously follow that any of those architectures are inherently more reliable. And even if it did, it doesn't obviously follow that any of the platforms are inherently more reliable due to a specific architectural decision.
Like, all cloud providers still have regional outages.
> One of the vanishingly small set of issues
At your scale, this attitude is even more concerning since the rare event at scale is not rare anymore.
That concept is useful when the scale of things you have is the same order of magnitude as the rate of failure. But we clearly don't have that here, because even at scale, these events aren't common. Like I said, there have been, across all cloud providers, less than a handful over a decade.
Like, you seem to be proclaiming that these kinds of events are common and, well, no, they aren't. That's why they make the top of HN when they do happen.
Google even recognizes this, and suggests that for disaster recovery planning, you should use multiple regions. AWS on the other hand does acknowledge some use cases for multiple regions (mostly performance or data sovereignty), but maintains the stance that if your only concern is DR, then a single region should be enough for the vast majority of workloads.
There's more to the story though, of course. GCP makes it easier to use multiple regions, including things like dual-region storage buckets, or just making more regions available for use. For example GCP has ~3 times as many regions in the US as AWS does (although each region is comparatively smaller). I'm not sure if there's consensus on which is the "right" way to do it. They both have pros and cons.
If you use Google Cloud....with the 100 ms of latency that will add to every interaction....
On GCP it sounds like you want to have a multi region architecture, not multi-zone (if you want firewalls outside the same data center).
> Resources that live in a zone, such as virtual machine instances or zonal persistent disks, are referred to as zonal resources. Other resources, like static external IP addresses, are regional. Regional resources can be used by any resource in that region, regardless of zone, while zonal resources can only be used by other resources in the same zone.
https://cloud.google.com/compute/docs/regions-zones
(No affiliation with Google, just had a similar confusion at one point)
Is that the worst for training? Namely: do superior solutions exist?
Cant wait to see the FreeBSD Netflix version of that post.
This also goes back to how increasing throughput is relatively easy and has a very strong roadmap. While increasing storage is difficult. I notice YouTube has been serving higher bitrate video in recent years with H.264. Instead of storing yet another copy of video files in VP9 or AV1 unless they are 2K+.
Choose any two.
You absolutely can have speed, scale and reliability. You can't have speed, scale, reliability and low cost.