Speed, scale and reliability: 25 years of Google datacenter networking evolution

288 points by sandwichsphinx 4 days ago | 75 comments

alex_young 4 days ago |
Like most discussions of the last 25 years, this one starts 9 years ago. Good times.
eru 4 days ago |
The Further Resources section goes a bit further back.
jerzmacow 4 days ago |
Wow and it doesn't open with a picture of their lego server? Wasn't that their first one, 25 years ago?
teractiveodular 4 days ago |
It's a marketing piece, they don't particularly want to emphasize the hacky early days for an audience of Serious Enterprise Customers.
DeathArrow 4 days ago |
It seems all cutting edge datacenters like x.ai Colossus are using Nvidia networking. Now Google is upgrading to Nvidia networking, too.
Since Nvidia owns most of the Gpgpu products, they have top notch networking and interconnect, I wonder if they don't have a plan to own all datacenter hardware in the future. Maybe they plan to also release CPUs, motherboards, storage and whatever else is needed.
Kab1r 4 days ago |
Grace Hopper already includes Arm based CPUs (and reference motherboards)
timenova 4 days ago |
That was their plan with trying to buy ARM...
danpalmer 4 days ago |
I read this slightly differently, that specific machine types with Nvidia GPU hardware also have Nvidia networking for tying together those GPUs.
Google has its own TPUs and don’t really use GPUs except to sell them to end customers on cloud I think. So using Nvidia networking for Nvidia GPUs across many machines on cloud is really just a reflection of what external customers want to buy.
Disclaimer, I work at Google but have no non public info about this.
dmacedo 4 days ago |
Having just worked with some of the Thread folks at M&S, thought I'd reach out and say hello. Seems like it was an awesome team! (=
danpalmer 4 days ago |
You're lucky to be working with them, an amazing team.
adrian_b 4 days ago |
Nvidia networking is what used to be called Mellanox networking, which was already dominant in datacenters.
immibis 4 days ago |
Only within supercomputers (including the smaller GPU ones used to train AI). Normal data centers use Cisco or Juniper or similarly.well known Ethernet equipment, and they still do. The Mellanox/Nvidia Infiniband networks are specifically used for supercomputer-like clusters.
wbl 4 days ago |
Mellanox Ethernet NIC got used a bunch of places due to better programmability.
dafugg 4 days ago |
You seem to have a narrow definition of “normal” for datacenters. Meta were using OCP mellanox NICs for common hardware platforms a decade ago and still are.
crmd 3 days ago |
Mellanox IB were ubiquitous in storage networking. None of the storage systems I worked on would have been possible without mellanox tech.
wil421 3 days ago |
Most people on the TrueNAS/FreeNAS forums use 10gb Mellanox nics that are sold on eBay. The sellers get them after server gear gets retired.
HDThoreaun 4 days ago |
I have to wonder if Nvidia has reached a point where it hesitates to develop new products because it would hurt their margins. Sure they could probably release a profitable networking product but if they did their net margins would decrease even as profit increased. This may actually hurt their market cap as investors absolutely love high margins.
eru 4 days ago |
They can always release capital back to investors, and then those investors can put the money into different companies that eg produce networking equipment.
thrw42A8N 4 days ago |
Why would they release money if they can invest it and return much more?
eru 3 days ago |
I was working under HDThoreaun's assumption that the margins would be lower.
If they have other opportunities for investment with higher margins, they should seize those, of course. And perhaps even call up investors for more capital, if required.
lovich 3 days ago |
When the agents employed by investors would be harmed by releasing capital back, which is guaranteed since so many people’s compensation is in the form of stock and returning capital leads to decreasing stock value, why would those agents ever return the capital voluntarily?
eru 3 days ago |
Have you heard of stock buybacks?
HDThoreaun 3 days ago |
Buybacks increase stock price. Nvidia has a tiny dividend but most shareholder returns are in the form of buybacks.
mikeyouse 4 days ago |
Yeah there’s a bit of industry worry about that very eventuality — hence the ultra Ethernet consortium trying to work on open source alternatives to the mellanox/nvidia lock-in.
https://ultraethernet.org/
ravetcofx 4 days ago |
Interesting Nvidia is on the steering committee
jclulow 3 days ago |
Cisco have sat on the steering committees for a lot of things where they had a proprietary initial version of something. It's not that unusual, and also, it's often frankly not actually that open; e.g., see the rent seeking racket for access to PCI documentation, or USB-IF actively seeking to prevent open source hardware from existing, etc.
mikeyouse 3 days ago |
Eh, the UEC effort is a standards org through the Linux Foundation so it won't be subject to any of the usual chicanery. And actually, it looks like Nvidia is jsut a general member and not one of the Steering Committee members;
https://ultraethernet.org/wp-content/uploads/sites/20/2023/0...
jonas21 3 days ago |
I believe this is what they plan on doing. See, for example:
https://www.youtube.com/live/Y2F8yisiS6E?si=GbyzzIG8w-mtS7s-...
cletus 4 days ago |
This mentions Jupiter generations, which I think is about 10-15 years old at this point. It doesn't really talk about what existed before so it's not really 25 years of history here. I want to say "Watchtower" was before Jupiter? but honestly it's been about a decade since I read anything about it.
Google's DC networking is interesting because of how deeply integrated it is into the entire software stack. Click on some of the links and you'll see it mentions SDN (Software Defined Network). This is so Borg instances can talk to each other within the same service at high throughput and low latency. 8-10 years ago this was (IIRC) 40Gbps connections. It's probably 100Gbps now but that's just a guess.
But the networking is also integrated into global services like traffic management to handle, say, DDoS attacks.
Anyway, from reading this it doesn't sound like Google is abandoning their custom TPU silicon (ie it talks about the upcoming A3 Ultra and Trillium). So where does NVidia ConnectX fit in? AFAICT that's just the NIC they're plugging into Jupiter. That's probably what enables (or will enable) 100Gbps connections between servers. Yes, 100GbE optical NICs have existed for a long time. I would assume that NVidia produce better ones in terms of price, performance, size, power usage and/or heat produced.
Disclaimer: Xoogler. I didn't work in networking though.
virtuallynathan 4 days ago |
This latest revision of Jupiter is apparently 400G, as is the ConnectX-7, A3 Ultra will have 8 of them!
neomantra 4 days ago |
Nvidia got ConnectX from their Mellanox acquisition -- they were experts in RMDA, particularly with Infiniband but eventually pushing Ethernet (RoCE). These NICs have hardware-acceleration of RDMA. Over the RDMA fabric, GPUs can communicate with each other without much CPU usage (the "GPU-to-GPU" mentioned in the article).
[I know nothing about Jupiter, and little about RDMA in practice, but used ConnectX for VMA, its hardware-accelerated, kernel-bypass tech.]
ceph_ 4 days ago |
From memory: Firehose > Watchtower > WCC > SCC > Jupiter v1
CBLT 4 days ago |
I would guess the Nvidia ConnectX is part of a secondary networking plane, not plugged into Jupiter. Current-gen Google NICs are custom hardware with a _lot_ of Google-specific functionality, such as running the borglet on the NIC to free up all CPU cores for guests.
cavisne 4 days ago |
The past few years there has been a weird situation where Google and AWS have had worse GPU's than smaller providers like Coreweave + Lambda Labs. This is because they didn't want to buy into Nvidias proprietary Infiniband stack for GPU-GPU networking, and instead wanted to make it work on top of their ethernet (but still pretty proprietary) stack.
The outcome was really bad GPU-GPU latency & bandwidth between machines. My understanding is ConnectX is Nvidias supported (and probably still very profitable) way for these hyperscalers to use their proprietary networks without buying Infiniband switches and without paying the latency cost of moving bytes from the GPU to the CPU.
latchkey 4 days ago |
Your understanding is correct. Part of the other issue is that at one point, there was a huge shortage of availability of IB switches... lead times of 1+ years... another solution had to be found.
RoCE is IB over Ethernet. All the underlying documentation and settings to put this stuff together are the same. It doesn't require ConnectX NIC's though. We do the same with 8x Broadcom Thor 2 NIC's (into a Broadcom Tomahawk 5 based Dell Z9864F switch) for our own 400G cluster.
maz1b 4 days ago |
Pretty crazy. Supporting 1.5mbps video calls for each human on earth? Did I read that right?
Just goes to show how drastic and extraordinary levels of scale can be.
sethammons 3 days ago |
Scale means different things to different people
486sx33 4 days ago |
The most amazing surveillance machine ever …
belter 4 days ago |
Awesome Google... Now learn what an availability zone is and stop creating them with firewalls across the same data center.
Oh and make your data centers smaller. Not so big they can be seen in Google Maps. Because otherwise, you will be unable to move those whale sized workloads to an alternative.
https://youtu.be/mDNHK-SzXEM?t=564
https://news.ycombinator.com/item?id=35713001
"Unmasking Google Cloud: How to Determine if a Region Supports Physical Zone Separation" - https://cagataygurturk.medium.com/unmasking-google-cloud-how...
tecleandor 4 days ago |
Making a datacenter not visible from Google Maps, at least on most big cities where Google zones are deployed, would mean making them smaller than a car. Or even smaller than a dishwasher.
If I check London (where europe-west2 is kinda located) on Google Maps right now, I can easily discern manhole covers or people. If I check Jakarta (Asia-southeast2) things smaller than a car get confusing, but you can definitely see them.
belter 4 days ago |
Your comment does not address the essence of the point I was trying to make. If you have a monstrous data-center, instead of many smaller, in relative size, you are putting too many eggs on a giant basket.
joshuamorton 4 days ago |
What if you have dozens of big data centers?
jiggawatts 4 days ago |
To reinforce your point:
The scale of cloud data centres reflects the scale of their customer base, not the size of the basket for each individual customer.
Larger data centres actually improve availability through several mechanisms: more power components such as generators means the failure of any one is just a few percent instead of a total blackout. You can also partition core infrastructure like routers and power rails into more fault domains and update domains.
Some large clouds have two update domains and five fault domains on top of three zones that are more than 10km apart. You can’t beat ~30 individual partitions with your data centres at a reasonable cost!
belter 4 days ago |
I provided three different references. Despite the massive downvotes on my comment I guess by Google engineers, as a troll...:-)I take comfort on the fact nobody was able to advance a reference to prove me wrong.
joshuamorton 3 days ago |
You haven't actually made an argument.
It is true that the nomenclature "AWS Availability Zone" has a different meaning than "GCP Zone" when discussing the physical separation between zones within the same region.
It's unclear why this is inherently a bad thing, as long as them same overall level of reliability is achieved.
belter 3 days ago |
The phrase "as long as the same overall level of reliability is achieved" is logically flawed when discussing physically co-located vs. geographically separated infrastructure.
joshuamorton 3 days ago |
Justify that claim.
In my experience, the set of issues that would affect 2 buildings close to each other, but not two buildings a mile apart, is vanishingly small, usually just last mile fiber cuts or power issues (which are rare and mitigated by having multiple independent providers), as well as issues like building fires (which are exceedingly rare, we know of, perhaps two of notable impact in more than a decade across the big three cloud providers).
Everything else is done at the zone level no matter what (onsite repair work, rollouts, upgrades, control plane changes, etc.) or can impact an entire region (non-last mile fiber or power cuts, inclement weather, regional power starvation, etc.)
There is a potential gain from physical zone isolation, but it protects against a relatively small set of issues. Is it really better to invest in that, or to invest the resources in other safety improvements?
retinaros 3 days ago |
what happened in gcp paris region then?
joshuamorton 3 days ago |
One of the vanishingly small set of issues I mentioned.
It is true, and obvious, that GCP and AWS and Azure use different architectures. It does not obviously follow that any of those architectures are inherently more reliable. And even if it did, it doesn't obviously follow that any of the platforms are inherently more reliable due to a specific architectural decision.
Like, all cloud providers still have regional outages.
belter 3 days ago |
I think you should have started this discussion by disclosing you work at Google...
> One of the vanishingly small set of issues
At your scale, this attitude is even more concerning since the rare event at scale is not rare anymore.
joshuamorton 2 days ago |
I think you're abusing the saying "at scale, rare events aren't rare" (https://longform.asmartbear.com/scale-rare/ etc.) here. It is true that when you are running thousands of machines, events that happen rarely happen often, but that scale usually becomes relevant at thousands, or hundreds of thousands, or millions of things (https://www.backblaze.com/cloud-storage/resources/hard-drive...).
That concept is useful when the scale of things you have is the same order of magnitude as the rate of failure. But we clearly don't have that here, because even at scale, these events aren't common. Like I said, there have been, across all cloud providers, less than a handful over a decade.
Like, you seem to be proclaiming that these kinds of events are common and, well, no, they aren't. That's why they make the top of HN when they do happen.
anewplace 3 days ago |
I think you're undermining the seriousness of a physical event like a fire. Even if the likelihood of these things is "vanishingly small", the impact is so large that it more than offsets it. Taking the OVH data center fire as an example, multiple companies completely lost their data and are effectively dead now. When you're talking about a company-ending-event, many people would consider even just two examples per decade as a completely unacceptable failure rate. And it's more than just fires: we're also talking about tornados, floods, hurricanes, terrorist attacks, etc.
Google even recognizes this, and suggests that for disaster recovery planning, you should use multiple regions. AWS on the other hand does acknowledge some use cases for multiple regions (mostly performance or data sovereignty), but maintains the stance that if your only concern is DR, then a single region should be enough for the vast majority of workloads.
There's more to the story though, of course. GCP makes it easier to use multiple regions, including things like dual-region storage buckets, or just making more regions available for use. For example GCP has ~3 times as many regions in the US as AWS does (although each region is comparatively smaller). I'm not sure if there's consensus on which is the "right" way to do it. They both have pros and cons.
nl 3 days ago |
AWS Zone is sort-roughly-kinda a GCP Region. It sounds like you want multi-region: https://cloud.google.com/compute/docs/regions-zones
belter 3 days ago |
> It sounds like you want multi-region
If you use Google Cloud....with the 100 ms of latency that will add to every interaction....
nl 3 days ago |
To address the availability point of your comment, Google's terminology is slightly different to AWS.
On GCP it sounds like you want to have a multi region architecture, not multi-zone (if you want firewalls outside the same data center).
> Resources that live in a zone, such as virtual machine instances or zonal persistent disks, are referred to as zonal resources. Other resources, like static external IP addresses, are regional. Regional resources can be used by any resource in that region, regardless of zone, while zonal resources can only be used by other resources in the same zone.
https://cloud.google.com/compute/docs/regions-zones
(No affiliation with Google, just had a similar confusion at one point)
erik_seaberg 3 days ago |
You also need to go multi-region with AWS. I liked their AZ story but in practice it hasn't avoided multi-zone outages (maybe deploys?)
dangoodmanUT 4 days ago |
Does gcp have the worst networking for gpu training though?
dweekly 4 days ago |
For TPU pods they use 3D torus topology with multi-terabit cross connects. For GPU, A3 Ultra instances offer "non-blocking 3.2 Tbps per server of GPU-to-GPU traffic over RoCE".
Is that the worst for training? Namely: do superior solutions exist?
ksec 4 days ago |
They managed to double from 6 Petabit per second in 2022 to 13 Pbps in 2023. I assume with ConnectX-8 this could be 26 Pbps in 2025/26. The ConnextX-8 is PCI-e 6 so I assume we could get 1.6Tbps ConnextX-9 with PCI-e 7.0 which is not far away.
Cant wait to see the FreeBSD Netflix version of that post.
This also goes back to how increasing throughput is relatively easy and has a very strong roadmap. While increasing storage is difficult. I notice YouTube has been serving higher bitrate video in recent years with H.264. Instead of storing yet another copy of video files in VP9 or AV1 unless they are 2K+.
reaperducer 4 days ago |
Speed, scale and reliability
Choose any two.
teractiveodular 4 days ago |
Which of those is Google's network missing?
nl 3 days ago |
In any decision making matrix you need a constraint that get consumed (economics, size, etc) to force a "choose any two" type situation.
You absolutely can have speed, scale and reliability. You can't have speed, scale, reliability and low cost.