It's time to replace TCP in the datacenter (2023)

187 points by ilove_banh_mi 6 days ago | 156 comments

wmf 6 days ago |
Previous discussions:
Homa, a transport protocol to replace TCP for low-latency RPC in data centers https://news.ycombinator.com/item?id=28204808
Linux implementation of Homa https://news.ycombinator.com/item?id=28440542
unsnap_biceps 6 days ago |
The original paper was discussed previously at https://news.ycombinator.com/item?id=33401480
parasubvert 6 days ago |
and here, from an LWN analysis. https://news.ycombinator.com/item?id=33538649
ksec 6 days ago |
Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities
https://people.csail.mit.edu/alizadeh/papers/homa-sigcomm18....
7e 6 days ago |
TCP was replaced in the data centers of certain FAANG companies years before this paper.
wmf 6 days ago |
If they keep it secret they don't get credit for it.
andrewflnr 6 days ago |
How do you figure? The right decision is the right decision, even if you don't tell people. (granting, for the sake of argument, that it is the right decision)
wmf 6 days ago |
Yeah, you get the benefit of secret tech (in this case faster networking) but people shouldn't give social credit for it because that creates incentives to lie. And, sadly, tech adoption runs entirely on social proof.
andrewflnr 6 days ago |
We're not really trying to allocate social credit here, not as our main goal anyway. We're evaluating raw effectiveness of the tech. So if they made an effective decision, we give them credit for, uh, making an effective decision. You don't have to love them for it.
michaelt 6 days ago |
When you are an outsider it's wise to take such claims with a grain of salt, because as the "secret" made its way to you, recounted by one person to another to another, there might have been an exaggeration, over-simplification or misunderstanding.
It's easy to imagine how, in the hands of tech journalists and youtubers optimising for clicks, "Google likes QUIC" and "Some ML clusters use infiniband" could get distorted into several faangs and the complete elimination of TCP.
andrewflnr 6 days ago |
True as far as it goes, but "they didn't actually do it" is a different story from "they did it secretly". The two claims exclude each other, so you can't really compare them in the context of "credit".
michaelt 6 days ago |
Allow me to rephrase.
I think the statement is not true, in a literal sense. I do not think there are multiple FAANG companies with data centres where TCP has been entirely eliminated.
But the statement is ambiguous enough you could come up with true interpretations, if you diluted it to the point of meaninglessness.
Therefore, without a clear public statement of what is being claimed, it's very difficult for me to be impressed.
andrewflnr 6 days ago |
Very reasonable. It does seem dubious. In that case I just think my comment, itself already responding to a very different concern, was a weird place to raise that question. :)
bushbaba 6 days ago |
*minority of the fangs.
avardaro 6 days ago |
A minority? What large tech company has not prioritized this?
cdchn 6 days ago |
Which have and with what?
albert_e 6 days ago |
Curious ... replaced with what, I would like to know.
parasubvert 6 days ago |
HTTP/3 aka QUIC (UDP).
JoshTriplett 6 days ago |
Or SRD, in AWS: https://aws.amazon.com/blogs/hpc/in-the-search-for-performan...
dradra67 6 days ago |
https://research.google/pubs/snap-a-microkernel-approach-to-...
akira2501 6 days ago |
> If Homa becomes widely deployed, I hypothesize that core congestion will cease to exist as a significant networking problem, as long as the core is not systemically overloaded.
Yep. Sure; but, what happens when it becomes overloaded?
> Homa manages congestion from the receiver, not the sender. [...] but the remaining scheduled packets may only be sent in response to grants from the receiver
I hypothesize it will not be a great day when you do become "systemically" overloaded.
andrewflnr 6 days ago |
Will it be a worse day than it would be with TCP? Either way, the only solution is to add more hardware, unless I'm misunderstanding the term "systemically overloaded".
bayindirh 6 days ago |
I think so. If your core saturates, you add more capacity to your core switch. In HOMA, you need to add more receivers, but if you can't add them because the core can't handle more ports?
Ehrm. Looks like core saturation all over again.
Edit: Just popped to my mind. What prevents maliciously reducing "receive quotas" on compromised receivers to saturate an otherwise capable core? Looks like it's a very low bar for a very high impact DOS attack. Ouch.
klysm 6 days ago |
This is designed for in data center use, so the security tradeoff is probably worth it
bayindirh 6 days ago |
Nope. Tending a datacenter close to two decades, I can say that putting people behind NATs and in jails/containers/VMs doesn't prevent security incidents all the time.
With all the bandwidth, processing power and free cooling, a server is always a good target, and sometimes people will come equipped with 0-days or exploits which are very, very fresh.
Been there, seen and cleaned that mess. I mean, reinstallation is 10 minutes, but the event is always ugly.
andrewflnr 6 days ago |
The "receivers" are just other machines in your data center. Their number is determined by the workload, same as always. Adding more will tend to increase traffic.
I'm not a datacenter expert, but is "not enough ports" really a showstopper? It just sounds like bad planning to me. And still not a protocol issue.
bayindirh 6 days ago |
Depends on where you run out of ports, actually.
A datacenter has layers of networking, and some of it is what we call "the core" or "the spine" of the network. Sometimes you need to shuffle things around, and you need to move things closer to core, and you can't because there are no ports. Or you procure the machines, and adding new machines requires some ports closer or at the core, and you are already running at full capacity there.
I mean, it can be dismissed like "planning/skill issue", but these switches are fat machines in terms of bandwidth, and they do not come cheap, or you can't get them during lunch break from the IT shop at the corner when required.
Being able to carry 1.2Tbit/sec of aggregated network from a dozen thin fibers is exciting, but scaling expensive equipment at a whim is not.
At the end of the day "network core" is more monolithic than your outer network layers and server layout. It's more viscous and evolves slowly. This inevitably corners you in some cases, but with careful planning you can postpone that, but not prevent.
andrewflnr 6 days ago |
Ok, that's good to know. But I still don't see how having congestion control be driven by receivers instead of senders makes it harder to fix than it is currently.
I mean, I don't actually see why you would need more ports at all. You still just have a certain number of machines that want to exchange a certain amount of traffic. That number is either above or below what your core can handle (guessing, again, at what the author means by "systemically overloaded").
bewo001 5 days ago |
I don't understand the difference to TCP here. If the path is not congested but the receiving endpoint is, the receiver can control the bandwidth by reducing the window size. Ultimately, it is always the sender that has to react to congestion by reducing the amount of traffic sent.
RPC is something of a red flag as well. RPCs will never behave like local procedure calls, so the abstraction will always leak (the pendulum of popularity keeps swinging back and forth between RPC and special purpose protocols every few years, though).
yesbut 6 days ago |
Another thing not worth investing time into for the rest of our careers. TCP will be around for decades to come.
t-writescode 6 days ago |
True! And chances are, if you're developing website software or video game software, you'll never think about these sorts of things, it'll just be a dumb pipe for you, still.
And that's okay!
But there are other sorts of computer people than website writers and business application devs, and they're some of the people this would be interesting for!
imtringued 6 days ago |
>True! And chances are, if you're developing website software or video game software, you'll never think about these sorts of things, it'll just be a dumb pipe for you, still.
Wrong. I've experienced most of the complaints in the paper when developing multiplayer video games. These days I simply use websockets instead of raw TCP because it is not worth the effort and yet you still have to do manual heartbeats.
UltraSane 6 days ago |
I wonder why Fibre Channel isn't used as a replacement for TCP in the datacenter. It is a very robust L3 protocol. It was designed to connect block storage devices to servers while making the OS think they are directly connected. OSs do NOT tolerate dropped data when reading and writing to block devices and so Fibre Channel has a extremely robust Token Bucket algorithm. The algo prevents congestion by allowing receivers to control how much data senders can send. I have worked with a lot of VMware clusters that use FC to connect servers to storage arrays and it has ALWAYS worked perfectly.
Sebb767 6 days ago |
> I wonder why Fibre Channel isn't used as a replacement for TCP in the datacenter
But it is often used for block storage in datacenters. Using it for anything else is going to be hard, as it is incompatible with TCP.
The problem with not using TCP is the same thing HOMA will face - anything already speaks TCP, nearly all potential hires know TCP and most problems you have with TCP have been solved by smart engineers already. Hardware is also easily available. Once you drop all those advantages, either your scale or your gains need to be massive to make that investment worth it, which is why TCP replacements are so rare outside of FAANG.
ksec 6 days ago |
I wonder if there are any work on making something similar ( conceptually ) to TCP, super / sub set of TCP while offering 50-80% benefits of HOMA.
I guess I am old. Everytime I see new tech that wants to be hyped, completely throw out everything that is widely supported and working for 80-90% of uses cases, not battle tested and may be conceptually complex I will simply pass.
Sebb767 6 days ago |
If you have a sufficiently stable network and/or known failure cases, you can already tune TCP quite a bit with nodelay, large congestion windows etc.. There's also QUIC, which basically is a modern implementation of TCP on top of UDP (with some trade-offs chosen with HTTP in mind). Once you stray too far, you'll loose the ability to use off-the-shelve hardware, though, at which point you'll quickly hit the point of diminishing returns - especially when simply upgrading the speed of the network hardware is usually a cheap alternative.
mikepurvis 6 days ago |
QUIC feels very pragmatic in terms of being built on UDP. As a lay person I don’t have a sense what additional gains might be on the table if the UDP layer were also up for reconsideration.
soneil 6 days ago |
UDP has very low cost, the header is pretty much source and dest ports. For this low, low price, you get compatibility with existing routing, firewalling, NAT, etc.
mananaysiempre 6 days ago |
One issue with QUIC in e.g. C is how heavyweight it feels to use compared to garden-variety BSD sockets (and that’s already not the most ergonomic of APIs). I haven’t encountered a QUIC library that didn’t feel like it would absolutely dominate a simple application in both code size and API-design pressure. Of course, for large applications that’s less relevant, but the result is that switching to QUIC gets perceived as a Big Deal, a step for when you’re doing Serious Stuff. That’s not ideal.
I’d love to just play with QUIC a bit because it’s pretty neat, but I always get distracted by this problem and end up reading the RFCs, which so far I haven’t had the patience to get through.
jbverschoor 6 days ago |
But you might not need TCP. For example, using file-sockets between an app, db, and http server (rails+pgsql+nginx for example) has many benefits. The beauty of OSI layers.
oneplane 6 days ago |
That would work on a single host, but the context of the datacenter probably assumes multihost/manyhost workloads.
wutwutwat 6 days ago |
Unix sockets can use tcp, udp, or be a raw stream
https://en.wikipedia.org/wiki/Unix_domain_socket#:~:text=The....
Puma creates a `UnixServer` which is a ruby stdlib class, using the defaults, which is extending `UnixSocket` which is also using the defaults
https://github.com/puma/puma/blob/fba741b91780224a1db1c45664...
Those defaults are creating a socket of type `SOCK_STREAM`, which is a tcp socket
> SOCK_STREAM will create a stream socket. A stream socket provides a reliable, bidirectional, and connection-oriented communication channel between two processes. Data are carried using the Transmission Control Protocol (TCP).
https://github.com/ruby/ruby/blob/5124f9ac7513eb590c37717337...
You still have the tcp overhead when using a local unix socket with puma, but you do not have any network overhead.
FaceValuable 6 days ago |
Hey! I know it’s muddled but that’s not quite correct. SOCK_STREAM is more general than TCP; SOCK_STREAM just means the socket is a byte stream. You would need to add IPPROTO_TCP on top of that to pull in the TCP stack.
UDS using SOCK_STREAM does not do that; ie, it is not using IPPROTO_TCP.
shawn_w 6 days ago |
Unix domain stream sockets do not use tcp. Nor do unix datagram sockets use udp. They're much simpler.
wutwutwat 6 days ago |
>The type parameter should be one of two common socket types: stream or datagram.[10] A third socket type is available for experimental design: raw.
> SOCK_STREAM will create a stream socket. A stream socket provides a reliable, bidirectional, and connection-oriented communication channel between two processes. Data are carried using the Transmission Control Protocol (TCP).
> SOCK_DGRAM will create a datagram socket.[b] A Datagram socket does not guarantee reliability and is connectionless. As a result, the transmission is faster. Data are carried using the User Datagram Protocol (UDP).
> SOCK_RAW will create an Internet Protocol (IP) datagram socket. A Raw socket skips the TCP/UDP transport layer and sends the packets directly to the network layer.
I don't claim to be an expert, I just have a certain confidence that I'm able to comprehend words I read. It seems you can have 3 types of sockets, raw, udp, or tcp.
https://en.wikipedia.org/wiki/Unix_domain_socket
FaceValuable 6 days ago |
Interesting! The Wikipedia article is quite wrong here. SOCK_STREAM certainly doesn’t imply TCP in all cases. I see the source is the Linux Programming Interface book; quite likely someone’s interpretation of that chapter was just wrong when they wrote this article. It is a subtle topic.
shawn_w 6 days ago |
Not the first time a Wikipedia article has been wrong. That one seems to be talking about IP sockets and local ones at the same time instead of focusing on local ones. Could definitely stand to be rewritten.
rcxdude 4 days ago |
Yeah, someone's gotten confused. SOCK_DGRAM and SOCK_STREAM imply TCP and UDP when using AF_INET sockets, but not when using AF_UNIX sockets, though unix domain sockets do often somewhat do an impression of a TCP socket (e.g. they will report a connection state in the same way as TCP). Reads and writes to unix domain sockets essentially amount to the kernel copying memory between different processes (interestingly, linux will also do the same for local TCP/UDP connections as well, as an optimisation. So the data never actually gets formatted into separate packets). This also accounts for some of the things you can do with unix sockets you can't do with a network protocol, like pass permissions and file descriptors across them.
kjs3 6 days ago |
How do you think 'file-sockets' are implemented over a network?
jbverschoor 5 days ago |
They don’t have to use TCP. The point was to use sockets as the abstraction layer and use another inter-connect instead of TCP/IP. That way you’ve easily replaced TCP in the datacenter without major changes to many applications
kjs3 5 days ago |
Oh...your argument is replacing TCP is the easy part. Gotcha. Sure.
YZF 6 days ago |
Are you suggesting some protocol layer of Fibre Channel to be used over IP over Ethernet?
TCP (in practice) runs on top of (mostly) routed IP networks and network architectures. E.g. a spine/leaf network with BGP. Fibre Channel as I understand it is mostly used in more or less point to point connections? I do see some mention of "Switched Fabric" but is that very common?
UltraSane 6 days ago |
Fibre Channel is a routed L3 protocol that can support loop-free multi-path typologies.
YZF 5 days ago |
I'll admit I'm not familiar with the routing protocols used for Fibre Channel. Is there some equivalent of BGP? How well does it scale? What vendors sell FC switches and what's the cost compared to Ethernet/IP/TCP?
UltraSane 5 days ago |
FC uses Fabric Shortest Path First, which is a lot like OSPF. It can scale to 2^64 ports. There is no FC equivalent of BGP. Broadcom, Cisco, HP, Lenovo, IBM sell FC switches, but some of them are probably rebadged Broadcom switches. The worst thing about FC is that the switches are licensed per port so you might not be able to use all the ports on the device. A brocade G720 with 24 32Gb usable ports is $28,000 on CDW. It has 64 physical ports. a 24 port license is $31,000 on CDW. So it is REALLY freaking expensive. But for servers a company can't make money without it is absolutely worth it. One place I worked had a old EMC FC SAN with 8 years of 100% uptime.
wejick 6 days ago |
I'm imagining having a shared memory mounted as block storages then do the RPC thru this block. Some synchronization and polling/notifications work will need to be done.
creshal 6 days ago |
The literal version of this is used by sanlock et al. to implement cluster-wide locks. But the whole "pretending to be block devices" overhead makes it not exactly ideal for anything else.
Drop the "pretending to be block devices" part and you basically end up with InfiniBand. It works well, if you ignore the small problem of "you need to either reimplement all your software to work with RDMA based IPC, or reimplement Ethernet on top of InfiniBand to remain compatible and throw away most advantages of InfiniBand again".
fmajid 6 days ago |
That’s essentially what RDMA is, except it is usually run over Infiniband although hyperscalers are wary of Nvidia’s control over the technology and looking for cheaper Ethernet-based alternatives.
https://blogs.nvidia.com/blog/what-is-rdma/
https://dl.acm.org/doi/abs/10.1145/3651890.3672233
hylaride 6 days ago |
If it's a secure internal network, RDMA is probably what you want if you need low-latency data transfer. You can do some very performance-oriented things with it and it works over ethernet or infiniband (the quality of switching gear and network cards matters, though).
Back in ~2012 I was setting up a high-frequency network for a forex company and at the time we deployed Mellanox and they had some very (at the time) bleeding edge networking drivers that significantly reduced the overhead of writing to TCP/IP sockets (particularily zero-copy which TL;DR meant data didn't get shifted around in memory as much and was written to the ethernet card's buffers almost straight away) that made a huge difference.
I eventually left the firm and my successors tried to replace it with cisco gear and Intel NICs and the performance plummeted. That made me laugh as I received so much grief pushing for the Mellanox kit (to be fair, they were a scrappy unheard of Israeli company at the time).
slt2021 6 days ago |
my take is that within-datacenter traffic is best served by Ethernet.
Anything on top of Ethernet, and we no longer know where this host is located (because of software defined networking). Could be next rack server, or could be something in the cloud, could be third party service.
And that's a feature, not a bug: because everything speaks TCP: we can arbitrarily cut and slice network just by changing packet forwarding rules. We can partition network however we want.
We could have a single global IP space shared by cloud, dc, campus networks, or could have Christmas Tree of NATs.
as soon as you introduce something other than TCP to the mix, now you will have gateways: chokepoints where traffic will have to be translated TCP<->Homa and I don't want to be a person troubleshooting a problem at the intersection of TCP and Homa.
in my opinion, the lowest level Ethernet should try its best to mirror the actual physical signal flow. Anything on top becomes software-network network
mafuy 6 days ago |
In data centers/HPC, you need to know which data is flowing where and then you design the hardware around that. Not the other way around. What you describe is a lower requirement level that is much easier to handle.
marcosdumay 6 days ago |
That may be true for HPC, but "datacenter" is a generic name that applies to all kinds of structures.
gsich 6 days ago |
>Anything on top of Ethernet, and we no longer know where this host is located (because of software defined networking). Could be next rack server, or could be something in the cloud, could be third party service.
Ping it and you can at least deduce where it's not.
KaiserPro 6 days ago |
FC was much more expensive than ethernet, so needed a reason to be used.
For block storage it is great, if slower than ethernet.
UltraSane 5 days ago |
Is their a fundamental reason for it being more expensive than Ethernet or is just greed?
KaiserPro 5 days ago |
At the time it was rarer and required more specialised hardware.
FC was one of the first non-mainframe specific storage area network fabrics. One of the key differences between FC and ethernet is the collision detection/avoidance and guaranteed throughput. All that extra coordination takes effort on big switches, so it cost more to develop/produce.
You could shovel data through it at "line speed" and know that it'll get there, or you'll get an error at the fabric level. Ethernet happily takes packets, and if it can't deliver, then it'll just drop them. (well, kinda, not all ethernet does that, because ethernet is the english language of layer 2 protocols)
holowoodman 6 days ago |
Fibrechannel is far too expensive, you need expensive switches, cables/transceivers and cards in addition to the Ethernet you'll need anyways. And this Fibrechannel hardware is quite limited in what you can do with it, by far not as capable as the usual Ethernet/IP stuff with regards to routing, encryption, tunneling, filtering and what not.
Similar things are happening with stuff like Infiniband, it has become far too expensive and Ethernet/ROCE is making inroads in lower- to medium-end installations. Availability is also an issue, Nvidia is the only Infiniband vendor left.
bluGill 6 days ago |
there is ip over fiber channel. no need for separate ethernet. At least in theory, in practice I'm sure if anyone implemented enough parts to make it useful but the spec exists.
UltraSane 5 days ago |
No. When I heard about Cisco having FC over Ethernet for their UCS servers I was grossed out because of how Ethernet is a L2 protocol that can't handle multi-path without ugly hacks like Virtual Port Channel and discovered that there is no real support for IP over Fiber Channel. There is a wikipedia page for IPFC but it seems to be completely dead.
https://en.wikipedia.org/wiki/IPFC
UltraSane 5 days ago |
Is there a fundamental reason why FC is more expensive than Ethernet?
convolvatron 5 days ago |
since its entire reason to exist was to effect an artificial market segmentation, I guess the answer is .. yes?
UltraSane 5 days ago |
It isn't an artificial market segmentation. Fibre Channel is a no compromise technology with a single purpose, to connect servers to remote storage with performance and reliability close to directly attached storage, and it does that really, REALLY well. It is by far the single most bullet proof technology I have ever used. In a parallel universe where FC won over ethernet and every Ethernet port in the world was an FC port I don't see why it would be any more expensive that ethernet.
markhahn 6 days ago |
the question is really: does it have anything vendor-specific, interop-breakers?
FC seems to work nicely in a single-vendor stack, or at least among specific sets of big-name vendors. that's OK for the "enterprise" market, where prices are expected to be high, and where some integrator is getting a handsome profit for making sure the vendors match.
besides consumer, the original non-enterprise market was HPC, and we want no vendor lock-in. hyperscale is really just HPC-for-VM-hosting - more or less culturally compatible.
besides these vendor/price/interop reasons, FC has never done a good job of keeping up. 100/200/400/800 Gb is getting to be common, and is FC there?
resolving congestion is not unique to FC. even IB has more, probably better solutions, but these days, datacenter ethernet is pretty powerful.
UltraSane 5 days ago |
FC speeds have really lagged. 64Gbps is available but not widely used and 128Gbps was introduced in 2023. But since by definition 100% of FC bandwidth can only be used for storage it has been adequate.
https://fibrechannel.org/preview-the-new-fibre-channel-speed...
tonetegeatinst 6 days ago |
Fiber is attractive. As someone who wants to upgrade to fiber, the main barrier to entry is the cost of switches and a router.
Granted I'm also trying to find a switch that supports ROCm and rdma. Not easy to find a high bandwidth switch that supports this stuff without breaking the bank.
jabl 6 days ago |
Fiber, as in Fibre Channel (FC, https://en.wikipedia.org/wiki/Fibre_Channel ), not fiber as in "optical fiber" instead of copper cabling.
_zoltan_ 6 days ago |
SN2100/2700 from eBay?
UltraSane 5 days ago |
Fibre Channel is a routed protocol invented specifically to connect block storage arrays to remote servers while making the block storage look locally attached to the OS. And it works REALLY REALLY well.
Sylamore 6 days ago |
InfiniBand would make more sense than Fibre Channel
UltraSane 5 days ago |
Maybe but InfiniBand is really expensive and InfiniBand switches are in short supply. And it is an L2 protocol while FC is L3
bmitc 6 days ago |
Unrelated to this article, are there any reasons to use TCP/IP over WebSockets? The latter is such a clean, message-based interface that I don't see a reason to use TCP/IP.
tacitusarc 6 days ago |
Websockets is a layer on top of TCP/IP.
bmitc 6 days ago |
Yes, I know that WebSockets layer over TCP/IP. But that both misses the point and is part of the point. The reason that I ask is that WebSockets seem to almost always be used in the context of web applications. TCP/IP still seems to dominate control communications between hardware. But why not WebSockets? Almost everyone ends up building a message framing protocol on top of TCP/IP, so why not just use WebSockets which has bi-directional message framing built-in? I'm just not seeing why WebSockets aren't as ubiquitous as TCP/IP and only seem to be relegated to web applications.
j16sdiz 6 days ago |
WebSocket is fairly inefficient protocol. and it needs to deal with the upgrade from HTTP. and you still need to implement you app specific protocol. This is adding complexity without additional benefit
It make sense only if you have an websocket based stack and don't want to maintain a second protocol.
imtringued 6 days ago |
You can easily build a JSON based RPC protocol in a few minutes using WebSockets and be done. With raw TCP you're going to be spending a week doing something millions of other developers have done again and again in your own custom bespoke way that nobody else will understand.
Your second point is very dismissive. You're inserting random application requirements that the vast majority of application developers don't care about and then you claim that only in this situation do WebSockets make sense when in reality the vast majority of developers only use WebSockets and your suggestion involves the second unwanted protocol (e.g. the horror that is protobuffers and gRPC).
bmitc 6 days ago |
> WebSocket is fairly inefficient protocol
In what way?
> This is adding complexity without additional benefit
I'm not sure that's a given. For example, WebSockets actually implement a message protocol. You have gauarantees that you sent or received the whole message. That may not be the case for TCP/IP, which is a byte streaming protocol.
wmf 6 days ago |
Interesting point. For example, Web apps cannot speak BitTorrent (because Web apps are not allowed to use TCP) but they can speak WebTorrent over WebRTC and native apps can also speak WebTorrent. So in some sense a protocol that runs over WebSockets/WebRTC is superior because Web apps and native apps can speak it.
dataviz1000 6 days ago |
There isn't much of a difference between a router between two machines physically next to each other and a router in Kansas connecting a machine in California with a machine in Miami. The packets of data are wrapped with an address of where they are going in the header.
WebSockets are long lived socket connection designed specifically for use on the 'web'. TCP is data sent wrapped in packets that is ordered and guaranteed delivery. This causes a massive overhead cost. This is different from UDP which doesn't guarantee order and delivery. However, a packet sent over UDP might arrive tomorrow after it goes around the world a few times.
With fetch() or XMLHttpRequest, the client has to use energy and time to open a new HTTP connection while a WebSocket opens a long lived connection. When sending lots of bi directional messages it makes sense to have a WebSocket. However, a simple fetch() request is easier to develop. A developer needs to good reason to use the more complicated WebSocket.
Regardless, they both send messages using TCP which ensures the order of packets and guaranteed delivery which features have a lot to do with why TCP is the first choice.
There is UDP which is used by WebRTC which is good for information like voice or video which can have missing and unordered packets occasionally.
If two different processes on the same machine want to communicate, they can use a Unix socket. A Unix socket creates a special file (socket file) in the filesystem, which both processes can use to establish a connection and exchange data directly through the socket, not by reading and writing to the file itself. But the Unix Socket doesn't have to deal with routing data packets.
(ChatGPT says "Overall, you have a solid grasp of the concepts, and your statements are largely accurate with some minor clarifications needed.")
tkin1980 6 days ago |
Well, Websocket is over TCP, so you already need it for that.
znpy 6 days ago |
> Unrelated to this article, are there any reasons to use TCP/IP over WebSockets?
Performance. TCP over TCP is pretty bad.
OpenVPN can do that (tcp-based vpn session over a tcp connection) and the documentation strongly advices against that.
slt2021 6 days ago |
the problem with trying to replace TCP only inside DC, is because TCP will still be used outside DC.
Networking Engineering is already convoluted and troublesome as it is right now, using only tcp stack.
When you start using homa inside, but TCP from outside things will break, because a lot of DC requests are created as a response for an inbound request from outside DC (like a client trying to send RPC request).
I cannot imagine trying to troubleshoot hybrid problems at the intersection of tcp and homa, its gonna be a nightmare.
Plus I don't understand why create a a new L4 transport protocol for a specific L7 application (RPC)? This seems like a suboptimal choice, because RPC of today could be replaced with something completely different, like RDMA over Ethernet for AI workloads or transfer of large streams like training data/AI model state.
I think tuning TCP stack in the kernel, adding more configuration knobs for TCP, switching from stream(tcp) to packet (udp) protocols where it is warranted, will give more incremental benefits.
One major thing author missed is security applications, these are considered table stakes: 1. encryption in transit: handshake/negotiation 2. ability to intercept and do traffic inspection for enterprise security purposes 3. resistance to attacks like flood 4. security of sockets in containerized Linux environment
nicman23 6 days ago |
only thing homa makes sense is when there is no external tcp to the peers or at least not on the same context ie for roce
slt2021 6 days ago |
1. add software defined network, where transport and signaling is done by vendor-specific underlay, possibly across multiple redundant uplinks
2. term "external" is really vague as modern networks have blended boundaries. Things like availability zone, region make dc-dc connection irrelevant, because at any point of time you will be required to failover to another AZ/DC/region.
3. when I think of inter-Datacenter, I can only think of Ethernet. That's really it. Even in Ethernet, what you think of a peer and existing in your same subnet, could be a different DC, again due to software-defined network.
jayd16 6 days ago |
Are you imagining external TCP traffic will be translated at the load balancer or are you actually worried that requests out of an API Gateway need to be identical to what goes in?
I could see the former being an issue (if that's even implied by "inside the data center") and I just don't see how it's a problem for the latter.
slt2021 6 days ago |
A typical software L7 load balancer (like nginx) will parse entire TCP stream and HTTP header and applies bunch of logic based on URL, and various HTTP headers.
There is a lot of work going on in the userland, like filling up TCP buffer, parsing HTTP stream, applying bunch of business logic, creating a downstream connection, sending data, getting response, etc.
This is a lot of work in the userland and because of that a default nginx config is like 1024 concurrent connections per core, so not a lot.
L4 load balance on the other hand works purely in a packet switching mode or NAT mode. So the work consists in just replacing IP header fields (src.ip, src.port, dst.ip, dst.port, proto), it can use various frameworks like intel vectorized packet processing or Intel dpdk for accelerated packet switching.
Because of that, L4 load balancer can work perform very very close to the line rate speed, meaning it can load balance connections as fast as packets arrive to the network interface card. Line rate is the theoretical maximum of packet processing.
In case of stateless L4 load balancing there is no upper bound in number of concurrent sessions to balance, it will almost as fast as core router that feeds the data.
As you can see L4 is clearly superior in performance, but the reason L4 LB is possible is because it has TCP inbound and TCP outbound, so the only work required is replace IP header and recalculate CRC.
With Homa, you would need to fully process TCP stream, before you initiate Homa connection, meaning you will waste a lot of RAM on keeping TCP buffers and rebuilding the stream according to the TCP sequence. Homa will lose all its benefits in the load balancing scenario.
Author pitches only one use case for homa: East-West traffic, but again - these days the software is really agnostic of this East-West direction. What your software thinks is running in the server in next rack, could as well be a server in a different Availability Zone or read replica in different geo region.
And that's the beauty of modern infra: everything is a software, everything is ephemeral, and we don't really care if we running this in a single DC or multiple DCs.
Because of that, I think we will still stick to TCP as a proven protocol that will seamlessly interop when crossing different WAN/LAN/VPN networks
I am not even talking about software defined networks, like SD-WAN where transport&signaling is done by the vendor-specfic underlay network, and overlay network is really just abstraction for users that hides a lot network discovery and network management underneath
runlaszlorun 6 days ago |
For those who might not have noticed, the author is John Ousterhout- best known for TCL/Tk as well as the Raft consensus protocol among others.
signa11 6 days ago |
and more recently (?) the book : “a philosophy of software design”, highly recommended !
stiray 6 days ago |
How long did we need to support ipv6? Is it supported yet and more widely in use than the ipv4, like in mobile networks where everything is stashed behind NAT and ipv4 kept?
Another protocol, something completely new? Good luck with that, i would rather bet on global warming to put us out of our misery (/s)...
https://imgs.xkcd.com/comics/standards.png
detaro 6 days ago |
Mobile networks especially are widely IPv6, with IPv4 being translated/tunneled where still needed. (End-user connections in general skew IPv6 in many places - it's observable how traffic patterns shift with people being at work vs at home. Corporate networks without IPv6 leading to more IPv4 traffic during the day, in the evening IPv6 from consumer connections takes over)
stiray 6 days ago |
Android: Settings -> About (just checked mine, 10...*), check your IP. We have 3 providers in our country, all 3 are using ipv4 "lan" for phone connectivity, behind NAT and I am observing this situation around most of EU (Germany, Austria, Portugal, Italy, Spain, France, various providers).
freetanga 6 days ago |
So, back to the mainframe and SNA in the data centers?
wmf 6 days ago |
If Rosenblum can get an award for rediscovering mainframe virtualization, why not give Ousterhout an award for rediscovering SNA?
(SNA was before my time so I have no idea if Homa is similar or not.)
parasubvert 6 days ago |
This has already been done at scale with HTTP/3 (QUIC), it's just not widely distributed beyond the largest sites & most popular web browsers. gRPC for example is still on multiplexed TCP via HTTP/2, which is "good enough" for many.
Though it doesn't really replace TCP, it's just that the predominant requirements have changed (as Ousterhout points out). Bruce Davie has a series of articles on this: https://systemsapproach.substack.com/p/quic-is-not-a-tcp-rep...
Also see Ivan Pepelnjak's commentary (he disagrees with Ousterhout): https://blog.ipspace.net/2023/01/data-center-tcp-replacement...
jpgvm 6 days ago |
Plus in a modern DC you can trivially convert it to be essentially lossless with DCSP/PFC and ECN which both work perfectly with any UDP based protocol (and is why NVMeOF, FCoE and RoCEv2 all woke so well today).
ECN isn't a necessity unless you need truly lossless network but the rest should get you pretty far as long as you are reasonably careful about communication patterns and blocking ratio of spine/core.
For anything that really needs the lowest possible latency at the cost of all other considerations there is still always Infiniband.
wbl 6 days ago |
QUIC is not trying to solve the same problem as Ousterhout is. End user networks very different from datacenter.
parasubvert 5 days ago |
How so? The same RPC-oriented L7 protocols are largely in use, just with a lot more east/west communications.
aseipp 5 days ago |
QUIC and Homa are not remotely similar and have completely different design constraints. I have no idea why people keep bringing up QUIC in this thread other than "It's another thing that isn't TCP." Yes, many things are not-TCP. The details are what matter.
parasubvert 5 days ago |
Not remotely similar? Both are RPC (request/response) optimized and are focused on removing head of line blocking and strict need for FIFO message ordering in favour of multiplexing.
QUIC is more focused on the global web applications, but most datacentres also leverage the web protocols (REST on HTTP 1.1 or HTTP/2, gRPC HTTP/2) for their inter-process communication, just with a a lot more east-west traffic (arguably 10x for every 1x N/S flow). There's also a fair amount app-specific messaging stacks (usually L7 over TCP) like Kafka, NATS or AMQP which have their own L7 facilities for dealing with TCP drawbacks that might benefit from a retrofit like Homa, but it's not clear if it's worth the effort.
They are design approaches for solving similar requirements. Yes, homa deals with other things (makes ECMP load balancing easier) but also has blindspots on datacenter requirements like security: a lot of data centre traffic requires hop by hop TLS for authentication, integrity and privacy, QUIC explicitly focuses on improving latency of TLS handshakes.
dveeden2 6 days ago |
Wasn't something like HOMA already tried with SCTP?
iforgotpassword 6 days ago |
And QUIC. And that thing tesla presented recently, with custom silicon even.
And as usual, hardware gets faster, better and cheaper over the next years and suddenly the problem isn't a problem anymore - if it even ever was for the vast majority of applications. We only recently got a new fleet of compute nodes with 100gbit NICs. The previous one only had 10, plus omnipath. We're going ethernet only this time.
I remember when saturating 10gbit/s was a challenge. This time around, reaching line speed with tcp, the server didn't even break a sweat. No jumbo frames, no fiddling with tunables. And that actually was while testing with 4 years old xeon boxes, not even the final hw.
Again, I can see how there are use cases that benefit from even lower latency, but thats a niche compared to all DC business, and I'd assume you might just want rdma in that case, instead of optimizing on top of ethernet or IP.
silisili 6 days ago |
This is a solid answer, as someone on the ground. TCP is not the bogeyman people point it out to be. It's the poison apple where some folks are looking for low hanging fruit.
GoblinSlayer 6 days ago |
> For many years, RDMA NICs could cache the state for only a few hundred connections; if the number of active connections exceeded the cache size, information had to be shuffled between host memory and the NIC, with a considerable loss in performance.
A massively parallel task? Sounds like something doable with GPGPU.
mafuy 6 days ago |
This has nothing to do with computing, it is about memory access.
dboreham 6 days ago |
Now the information has to be shuffled between the NIC and host memory and the GPU.
GoblinSlayer 5 days ago |
Just plug ethernet cable into the graphics card, then you need to shuffle memory between GPGPU and the wire.
Woodi 6 days ago |
You want to replace TCP becouse it is bad ? Then give better "connected" protocol over raw IP and other raw network topologies. Use it. Done.
Don't mess with another IP -> UDP -> something
tsimionescu 6 days ago |
Within a data center? Maybe. On the Internet? You try offering a consumer service over SCTP first, and that is decades old by this point.
indolering 6 days ago |
So token ring?
pif 6 days ago |
> Although Homa is not API-compatible with TCP,
IPv6 anyone? People must start to understand that "Because this is the way it is" is a valid, actually extremely valid, answer to any question like "Why don't we just switch technology A with technology B?"
Despite all the shortcomings of the old technology, and the advantages of the new one, inertia _is_ a factor, and you must accept that most users will simply even refuse to acknowledge the problem you want to describe.
For you your solution to get any traction, it must deliver value right now, in the current ecosystem. Otherwise, it's doomed to fail by being ignored over and over.
bamboozled 6 days ago |
Also there needs to be a big push to educate people on <new thing>. I know TCP very well, it would need quite a lot of incentive for me to drop that for something I don't yet understand as well. TCP was highly beneficial which is why we all adopted it in the first place, whatever is to replace it needs to be at least that beneficial...which will be a tall order.
nine_k 6 days ago |
I'd say that usually it's not about the balance of advantages but the balance of pain. You go through the pains of switching to a new and unfamiliar solution if your current solution is giving you even more pain.
If you don't feel much pain, you can and should stay with your current solution. If it's not broken, or not broken badly enough, don't fix it by radical surgery.
stonemetal12 6 days ago |
> inertia _is_ a factor
why would inertia be a factor? If I want to support protocol ABC in my data center, then I buy hardware that supports protocol ABC, including the ability to down shift to TCP when data leaves the data center. We aren't talking about the internet at large so there is no need to coordinate support with different organizations with different needs.
Google, could mandate you can't buy a router or firewall that doesn't support IPv6. Then their entire datacenter would be IPv6 internally. The only time to convert to IPv4, would be if the local ISP doesn't support v6.
ironhaven 6 days ago |
But tcp/ipv6 is API compatible with tcp/ipv4? You can even accept ipv4 connections to a ipv6 listening socket if you have a ipv4 address assigned. The issue is ipv4 is not forward binary compatible with ipv6 because you can't fit more than 2^32 addresses in a ipv4 packet.
But yeah if you are a large bloated enterprise like amazon or microsoft that owns large amounts of ipv4 address space and expensive ipv4 routing equipment there is not a ton of value in switching
kmeisthax 6 days ago |
Dumb question: why was it decided to only provide an unreliable datagram protocol in standard IP transit?
michaelt 6 days ago |
Because when you're sending a signal down a wire or through the air, fundamentally the communication medium only provides "Send it, maybe it arrives"
At any time, the receiver could lose power. Or a burst of interference could disrupt the radio link. Or a backhoe could slice through the cable. Or many other things.
IP merely reflects this physical reality.
kmeisthax 5 days ago |
Ok, but why does TCP exist, then? If we could make streams reliable in the late 1970s why didn't we apply that to datagrams as well?
mhandley 6 days ago |
It's already happening. For the more demanding workloads such as AI training, RDMA has been the norm for a while, either over Infiniband or Ethernet, with Ethernet gaining ground more recently. RoCE is pretty flawed though for reasons Ousterhout mentions, plus others, so a lot of work has been happening on new protocols to be implemented in hardware in next-gen high performance NICs.
The Ultra Ethernet Transport specs aren't public yet so I can only quote the public whitepaper [0]:
"The UEC transport protocol advances beyond the status quo by providing the following:
● An open protocol specification designed from the start to run over IP and Ethernet
● Multipath, packet-spraying delivery that fully utilizes the AI network without causing congestion or head-of-line blocking, eliminating the need for centralized load-balancing algorithms and route controllers
● Incast management mechanisms that control fan-in on the final link to the destination host with minimal drop
● Efficient rate control algorithms that allow the transport to quickly ramp to wire-rate while not causing performance loss for competing flows
● APIs for out-of-order packet delivery with optional in-order completion of messages, maximizing concurrency in the network and application, and minimizing message latency
● Scale for networks of the future, with support for 1,000,000 endpoints
● Performance and optimal network utilization without requiring congestion algorithm parameter tuning specific to the network and workloads
● Designed to achieve wire-rate performance on commodity hardware at 800G, 1.6T and faster Ethernet networks of the future"
You can think of it as the love-child of NDP [2] (including support for packet trimming in Ethernet switches [1]) and something similar to Swift [3] (also see [1]).
I don't know if UET itself will be what wins, but my point is the industry is taking the problems seriously and innovating pretty rapidly right now.
Disclaimer: in a previous life I was the editor of the UEC Congestion Control spec.
[0] https://ultraethernet.org/wp-content/uploads/sites/20/2023/1...
[1] https://ultraethernet.org/ultra-ethernet-specification-updat...
[2] https://ccronline.sigcomm.org/wp-content/uploads/2019/10/acm...
[3] https://research.google/pubs/swift-delay-is-simple-and-effec...
rwmj 6 days ago |
On a related topic, has anyone had luck deploying TCP fastopen in a data center? Did it make any difference?
In theory for shortlived TCP connections, fastopen ought to be a win. It's very easy to implement in Linux (just a couple of lines of code in each client & server, and a sysctl knob). And the main concern about fastopen is middleboxes, but in a data center you can control what middleboxes are used.
In practice I found in my testing that it caused strange issues, especially where one side was using older Linux kernels. The issues included not being able to make a TCP connection, and hangs. And when I got it working and benchmarked it, I didn't notice any performance difference at all.
ezekiel68 6 days ago |
I feel like there had ought to be a corollary to Betteridge's law which gets invoked whenever any blog, vlog, paper, or news headline that begins with "It's Time to..."
But the new law doesn't simply negate the assertion. It comes back with: "Or else, what?"
If this somehow catches on, I recommend the moniker "Valor's Law".
KaiserPro 6 days ago |
TLDR: No, not its not.
HOMA is great, but not good enough to justify a wholesale ripout of TCP in the "datacentre"
Sure a lot of traffic is message oriented, but TCP is just a medium to transport those messages. Moreover its trivial to do external requests with TCP because its supported. There is not a need to have HOMA terminators at the edge of each datacentre to make sure that external RPC can be done.
The author assumes that the main bottleneck to performance is TCP in a datacentre. Thats just not the case, in my datacentre, the main bottleneck is that 100gigs point to point isnt enough.
javier_e06 6 days ago |
In your data center usually there is a collection of switches from different vendors purchased through the years. Vendor A tries to outdo vendor B with some magic sauce that promise higher bandwidth. With open standards. To avoid vendor lock-ins. The risk averse manager knows the equipment might need to be re-used or re-sold elsewhere. Want to try something new? Plus: Who is ready to debug-maintain the new fancy standard?
cletus 6 days ago |
Network protocls are slow to change. Just look at IPv6 adoption. Some of this is for good reason. Some isn't. Because of everything from threat reduction to lack of imagination equipment at every step of the process will tend to throw away anything that looks weird, a process somebody coined as ossification. You'll be surprised how long-lasting some of these things are.
Story time: I worked on Google Fiber years ago. One of the things I worked on was on services to support the TV product. Now if you know anything about video delivery over IP you know you have lots of choices. There are also layers like the protocls, the container format and the transport protocol. The TV product, for whatever reason, used a transport protocol called MPEG2-TS (Transport Streams).
What is that? It's a CBR (constant bit rate) protocol that stuffs 7 188 byte payloads into a single UDP packet that was (IPv4) multicast. Why 7? Well because 7 payloads (plus headers) was under 1500 bytes and you start to run into problems with any IP network once you have larger packets than that (ie an MTU of 1500 or 1536 is pretty standard). This is a big issue with high bandwidth NICs such that you have things like Jumbo frames to increase throughput and decrease CPU overhead but support is sketchy on a hetergenous network.
Why 188 byte payloads? For compatibility with Asynchronous Transfer Mode ("ATM"), a long-dead fixed-packet size protocol (53 byte packets including 48 bytes of payload IIRC; I'm not sure how you get from 48 to 188 because 4x48=192) designed for fiber networks. I kind of thought of it was Fiber Channel 2.0. I'm not sure that's correct however.
But my point is that this was an entirely owned and operated Google network and it still had 20-30+ year old decisions impacting its architecture.
Back to Homa, three thoughts:
1. Focusing on at-least once delivery instead of at-most once delivery seems like a good goal. It allows you to send the same packet twice. Plus you're worried about data offset, not ACKing each specific packet;
2. Priority never seems to work out. Like this has been tried. IP has an urgent bit. You have QoS on even consumer routers. If you're saying it's fine to discard a packet then what happens to that data if the receiver is still expecting it? It's well-intentioned but I suspect it just won't work in practice, like it never has previously;
3. Lack of connections also means lack of a standard model for encryption (ie SSL). Yes, encryption still matters inside a data center on purely internal connections;
4. QUIC (HTTP3) has become the de-facto standard for this sort of thing, although it's largely implementing your own connections in userspace over UDP; and
5. A ton of hardware has been built to optimize TCP and offload as much as possible from the CPU (eg checksumming packets). You see this effect with QUIC. It has significantly higher CPU overhad per payload byte than TCP does. Now maybe it'll catch up over time. It may also change as QUIC gets migrated into the Linux kernel (which is an ongoing project) and other OSs.
efitz 6 days ago |
I wonder which problem is bigger- modifying all the things to work with IPv6 only or modifying all the things to work with (something-yet-to-be-standardized-that-isn’t -TCP)?
stego-tech 6 days ago |
As others have already hit upon, the problem forever lies in standardization of whatever is intended to replace TCP in the data center, or the lack thereof. You’re basically looking for a protocol supported in hardware from endpoint to endpoint, including in firewalls, switches, routers, load balancers, traffic shapers, proxies, etcetera - a very tall order indeed. Then, to add to that very expensive list of criteria, you also need the talent to support it - engineers who know it just as thoroughly as the traditional TCP/IP stack and ethernet frames, but now with the added specialty of data center tuning. Then you also need the software to support and understand it, which is up to each vendor and out of your control - unless you wrap/encapsulate it in TCP/IP anyway, in which case there goes all the nifty features you wanted in such a standard.
By the time all of the proverbial planets align, all but the most niche or cutting edge customer is looking at a project the total cost of which could fund 400G endpoint bandwidth with the associated backhaul and infrastructure to support it - twice over. It’s the problem of diminishing returns against the problem of entrenchment: nobody is saying modern TCP is great for the kinds of datacenter workloads we’re building today, but the cost of solving those problems is prohibitively expensive for all but the most entrenched public cloud providers out there, and they’re not likely to “share their work” as it were. Even if they do (e.g., Google with QUIC), the broad vibe I get is that folks aren’t likely to trust those offerings as lacking in ulterior motives.
klysm 6 days ago |
If anybody is gonna do it, it's gonna be someone like amazon that vertically integrates through most of the hardware
stego-tech 6 days ago |
That’s my point: TCP in the datacenter remains a 1% problem, in the sense that only 1% of customers actually have this as a problem, and only 1% of those have the ability to invest in a solution. At that point, market conditions incentivize protecting their work and selling it to others (e.g., Public Cloud Service Providers) as opposed to releasing it into the wild as its own product line for general purchase (e.g., Cisco). It’s also why their solutions aren’t likely to ever see widespread adoption, as they built their solution for their infrastructure and their needs, not a mass market.
wbl 6 days ago |
Nevertheless Infiband exists.
MichaelZuo 6 days ago |
Which makes the prospects of a replacement for either or both even more unlikely.
stego-tech 6 days ago |
As does Fibre Channel and a myriad of other solutions out there. The point wasn’t to bring every “but X exists” or “but Company A said in this blog post they solved it” response out of the fog, but to point out that these issues are incredibly fringe to begin with yet make the rounds a few times a year every time someone has an “epiphany” about the inefficiencies of TCP/IP in their edgiest of edge-case scenarios.
TCP isn’t the most efficient protocol, sure, but it survives and thrives because of its flexibility, cost, and broad adoption. For everything else, there’s undoubtedly something that already exists to solve your specific gripe about it.
panzagl 6 days ago |
Amazon's solutions to 1% problems are the next batch of cargo cult corporate solution speak that we're going to have to deal with. How can we have a web scale agile big data devops monorepo machine learning microservice oriented architecture if we're limited by TCP? I mean, it was developed in the 70s...
stego-tech 6 days ago |
Ugh, the fact I have to explain that we don’t need any of that cult-speak for our enterprise VMs (because enterprise software hasn’t even moved to containers yet) makes me grimace. I loathe how marketers and salesfolk have turned technology from a thing that can be adapted to solve your problems, into a productized solution for problems you never knew you had until the magazine they advertise in told your CIO about it.
iTokio 6 days ago |
They already started researching that in 2014 and have integrated the resulting protocol in some products like EBS:
https://www.allthingsdistributed.com/2024/08/continuous-rein...
skeptrune 6 days ago |
> Even if they do (e.g., Google with QUIC), the broad vibe I get is that folks aren’t likely to trust those offerings as lacking in ulterior motives.
It's pretty unfortunate that we've landed here. Hordes of venture-backed companies building shareware-like software with an "open source" label has done some severe damage.
getcrunk 6 days ago |
Quic is an attack on network based filtering (ad blockers) similar to doH. There likely are convenient ulterior motives.
stego-tech 6 days ago |
Which is ironic, because I remember TCP/IP maturing in the protocol wars of the 90s. My Cisco course specifically covered the protocols separately from the media layers because you couldn’t know if your future employer still leveraged Token Ring, or ATM, or IPX, or TCP; a decade later, the course had drastically simplified to “ethernet” and “TCP/IP” only.
Many of these came from companies who created the protocol solely to push products, which meant the protocols themselves had to compete outside of the vacuum chamber of software alone and instead operate in real world scenarios and product lines. This also meant that as we engineers and SysAdmins deployed them in our enterprises, we quite literally voted with our wallets where able on the gear and protocols that met our needs. Unsurprisingly, TCP/IP won out for general use because of its low cost of deployment and ongoing support compared to alternatives, and that point is lost on the modern engineer that’s just looking at this stuff as “paper problems”.
kjs3 3 days ago |
I remember TCP/IP maturing in the protocol wars of the 90s
Good times. But it didn't matter because ATM was the future. /s
Many of these came from companies who created the protocol solely to push products
Like 100Base-VG? That was a good laugh.
TCP/IP won out for general use because of its low cost of deployment and ongoing support compared to alternatives, and that point is lost on the modern engineer that’s just looking at this stuff as “paper problems”
Welcome to my world...
Shawnj2 5 days ago |
Obvious answer is to run something on top of UDP like QUIC does
ghaff 6 days ago |
There was a ton of effort and money especially during the dot-com boom to rethink datacenter communications. A fair number of things did happen under the covers--offload engines and the like. And InfiniBand persevered in HPC, albeit as a pretty pale shadow of what its main proponents hoped for--including Intel and seemingly half the startups in Austin.
tails4e 6 days ago |
The cost of standards is very high, probably second to the cost of no standards!
Joking aside, I've seen this first thang when using things like ethernet/tcp to transfer huge amounts of data in hardware. The final encapsulation of the data is simple, but there are so many layers on top, and it adds huge overhead. Then stanrds have modes, and even if you use a subset the hardware must usually support all to be compliant, adding much most cost in hardware. A clean room design could save a lot of hardware power and area, but the loss of compatibility and interop would cost later in software.. hard problem to solve for sure.
gafferongames 6 days ago |
Game developers have been avoiding TCP for decades now. It's great to finally see datacenters catching up.
gafferongames 6 days ago |
Downvote away but it's the truth :)
tonetegeatinst 6 days ago |
Many I am misunderstanding something about the issue, but isn't DCTCP a standard?
See the rfc here: https://www.rfc-editor.org/rfc/rfc8257
The DCTCP seems like its not a silver bullet to the issue, but it does seem to address some of the pitfalls of TCP in a HPC or data center environment. Iv even spun up some vm's and used some old hardware to play with it to see how smooth it is and what hurdles might exist, but that was so long ago and stuff has undoubtedly changed.
wmf 6 days ago |
Homa has much lower latency than DCTCP.
cryptonector 6 days ago |
Various RDMA protocols were all about that. Where are they now?
SoftTalker 6 days ago |
Infiniband is still widely used in data centers, RDMA and IPoIB. Intel tried Omnipath but that died quickly (I don't know specifically why).
ltbarcly3 6 days ago |
Just using IP or UDP packets lets you implement something like Homa in userspace.
What's the advantage of making it a kernel level protocol?