Yeah. Hot loading is clearly better than anything else when you've got a million clients connected and you want to make a code change. Of course, we didn't have any of these fancy 'release' tools, we just used GNU Make to rsync the code to prod and run erlc. Then you can grab a debug shell and l(module). (we did write utilities to see what code was modified, and to provide the right incantations so we wouldn't load if it would kill processes)
In the contexts in which I’ve worked, this was solved by issuing a command to the server to enter a lame-duck mode and stop accepting new connections, then restarting the process with updated code after all existing connections ended.
This worked in our case because connections had a TTL with a “reasonable” time, couldn’t have been more than an hour. We could always wait it out.
I suppose hot reloading is more necessary when you have connections without a set TTL.
For a small change, you can hot load your change and be done in minutes. This means you can push several small changes in an hour. Which helps get things done rapidly.
It's also really nice to be able to see that the change works as expected (or not) at full load right away. If you've got to wait for connections to accumulate on the new server, that takes longer without hot load too.
Some changes can't be effectively hot loaded[1], and for those you do need to do something to kick out users and let them reconnect elsewhere, and you could do all your updates that way, but it means a lot more client time spent reconnecting.
On the one hour TTL. Sometimes that's reasonable, but sometimes it's really not. Someone downloading a large file on a slow connection is better served by letting the download continue to trickle for hours than forcing them to reconnect and resume. A real time call is better served by letting it run until the participants are done. For someone on a low end phone, staying connected for as long as they can is probably better than forcing a reconnect where they'll need to generate new ephemeral keys and do a key exchange exercise.
[1] At the very least, BEAM updates and kernel changes are much more easily done by restarting. But not all userspace Erlang changes are easy to make hot loadable, either.
I’m still new to the erlang/elixir community and I haven’t run it in prod yet but this is my experience coming from Java, Node, and Python.
You can build os packages, and push those however you like.
You can use rsync.
You could push the files over dist, if you want.
You could probably do something cool with bittorrent (maybe that trend is over?)
If you write Makefiles to push, you can use make -j X to get low effort parallelization, which works ok if your node count isn't too big, and you don't need as instant as possible updates.
Erlang source and beam files don't tend to get very large. And most people's dist clusters aren't very large either; I don't think I've seen anyone posting large cluster numbers lately, but I'd be surprised if anyone was pushing to 10,000 nodes at once. Assuming they're well connected, pushing to 10,000 nodes takes some prep, but not that much; if you're driving it from your laptop, you probably want an intermediate pusher node in your datacenter, so you can push once from home/office internet to the pusher node, and then fork a bunch of pushers in the datacenter to push to the other hosts. If you've got multiple locations and you're feeling fancy, have a pusher node at each location, push to the pusher node nearest you; that pushes to the node at each location and from there to individual nodes.
Other issues are more pressing; like making sure you write your code so it's hotload friendly, and maybe trying to test that to confirm you won't use the immense power of hotloading to very rapidly crash all your server processes.
I wish I had used a pusher node to deploy things when a colleague was using almost all the upstream bandwidth in the office making a video call when my bosses were giving demo and the fix I coded for an issue discovered during the demo could not deploy via Capistrano
Honestly I wish we had had the ability to do both. Sometimes a change is so tricky that the argument that "hot code updates are complicated and it'll cause more issues than it will solve" is very true, and maybe a deploy that forces everyone to reconnect is best for that sort of change. But often times we'd deploy some mundane thing where you don't have to worry about upgrading state in a running gen server or whatever, and it'd be nice to have minimal impact.
Obviously that's even more complexity piled onto the system, but every time I pushed some minor change and caused a retry that (in a perfect world at least...) didn't need to retry, I winced a bit.
It's sometimes helpful to have the option to just restart one little corner of the full system, to minimize impact, but it is helpful to customer experience (if we don't screw it up) and very much the opposite for developer experience (it's crippling to velocity to need to discuss each change with multiple experts and determine the appropriate type of release).
It's one of those thing that sound nice on paper but a actually couple your runtime with ci/cd, if you have anything else beside Erlang what do you do? You now need a second solution to deploy code.
Ofc you can say "don't do that" but sometimes it's just the way it is...
But I agree, 99% of the time a rolling update is easier and works fine.
BEAM encourages you to structure your program as isolated interacting processes and so it’s not that far from a container runtime in itself.
With rolling deployments, your only choice is to wait until that connection drains by completing or failing the download. If that doesn't fit your use case, you're out of options.
If your application is an Erlang app, you could hot code reload an unaffected part of the application while the download finishes. Or, if the part of the application handling the download is an OTP pattern that supports hot code reloading (like a gen_server) you could even make changes to that module and release e.g. speed improvements mid download stream. This is why Erlang shines in applications like telephony, which it was originally designed for.
one of the cool things about unix is (and perhaps windows can do this in the right modes, idk), the running copy of a program is a link to the code on the disk (a link is a reference to the file, without the file name). You can delete a running program from the disk and replace it with a new program, but the running copy will continue and not be aware that you've done that. You don't need to wait till the program finishes anything.
on an every day basis, this is what happens when you run software updates while you are still using your machine, even if your currently active programs get updated. you'll sometimes notice this in a program like Firefox, it will lose its ability to open new tabs; that's because they go out of their way to do that, they wouldn't have to if they wanted to avoid it, just fork existing processes.
An even cooler thing is the running code is just mmaped into memory. One of the nifty things about mmaped files is if you change the backing file, you change it everywhere.
Not my recommended way to hot load code, but it might work in a pinch.
unlink, replace, start a new one, have the old one stop listening does work for many things. Some OSes have/had issues with dropping a couple pending connections sometimes; or you have to learn the secret handshake to do it right. A bigger problem is if your daemon is sized to fit your memory, you might not be able to run a draining and a filling daemon at once.
It also doesn't really solve the issue of changing code for existing connections, it's a point in time migration. Easier to reason about, for sure, but not always as effective.
Seems like they deploy Elixir on embedded Linux. The embedded Linux distro is Nerves which replaces systemd and boots to the BEAM VM instead as process 1, putting Elixir as close to the metal as they can.
I know nothing about any of the above (assumption is I'm fool enough to try and simplify) plus I know I've misused the concepts I wrote but that's my point so read the article. All simplifications are salads
What I would add is: you do not have bash (or sh, or any other "conventional" shell). The BEAM is your userspace.
You can SSH in, but what you end up with is an IEx prompt (the Elixir REPL). This is surprisingly fine once you get used to it (and once you've built a few helpers for your usecase).
It's absolute magic and allows for very rapid development and ease of deploying fixes and updates.
We do use have to use distillery though and have had to resort to a bunch of custom glue bash scripts which I wish was more standardized because it's such a killer feature.
Due to Elixirs efficiency, everything is running on a single node despite thousands of concurrents so haven't really experienced how it handles multiple nodes.
Most of the benefits of hot code updates (with better understanding of the boundaries of changes) can be found through judicious rolling restarts that things like k8s make easier these days. Any time you have the capacity to hot patch code on a node, you probably have the capacity to hot patch the node's setup as well.
That said I think that someone could use the code reloading abilities of erlang to make a genuinely unparalleled production problem diagnostic toolkit - where you can take apart a problem as it is happening in real time. The same kinds of people who are excited about time traveling debugging should be excited about this imo.
The tooling would let us patch multiple modules at a time, which basically wrapped `:rpc.call/4` and `Code.eval_string/1` to propagate the update across the cluster, which is to say, the hot-patch was entirely deployed over erlang's built-in distribution.
As for relups, I once tried starting a project to make them easier but eventually decided that the number of bazookas pointed at each and every toe made them basically a non-starter for anything that isn’t trivial. And if its trivial it was already covered by the nl (network load, send a local module to all nodes in the cluster and hot load it) style tooling.
This and everything else said sounds so much like PHP+FTP workflow. It's so good.
Are you suggesting that hot code replacement is somehow a attack vector? Ericsson has been using this method for decades on critical infrastructure to patch switches without dropping live calls/connections it works.
No need to fear Erlang/BEAM.
As a security person, this seems inherently dangerous. I asked why it is safe, because I presumed I’m missing something due to the lack of ever hearing about exploitation in the wild.
If someone were to exploit a running Erlang process, the description of this feature sounds to me like they would have access to code paths that allow pushing new code to other Erlang processes on cooperating nodes.
If someone can exploit one Erlang node, they can easily take over the cluster. But in a more typical horizontally scaled system, usually if they can get into one node, they can get into all the other nodes running the same software the same way.
Security wise, I think of the whole cluster as one unit. There's no meaningful way to separate it, so it's just one thing. Best not to let anyone in who can't be trusted, because either they have access or they don't; there's no limited access.
But given that, may as well push code updates over dist in a straight forward way, because it's possible, so it may as well be straight forward.
In a use case like clustering together identical web servers, or message broker nodes like RabbitMQ, I don't think it's all that scary. It gives an attacker easier lateral movement, but that doesn't gain them a whole lot if all the nodes have the same permissions, operate on the same data, etc.
Depending on risk appetite and latency requirements you can also isolate clusters at the deployment / datacenter level. RabbitMQ for instance uses Erlang clustering within a deployment (nodes physically close together, in the same or nearly the same configuration) and a separate federation protocol between clusters. This acts as a bulkhead to isolate problems and attackers.
Hot patching is a bit unsatisfying because you are still running the same VM afterwards. WIth socket migration you can launch a new VM if you want to upgrade your Erlang version. I don't know of a way to do it with existing software, but in principle using something like HAProxy with suitable extensions, it should be possible to even migrate connections across machines.
Migrating socket state across machines is possible too, but I don't think it's anywhere close to mainstream. HAProxy is a lovely tool, but I'm pretty sure I saw something in its documentation that explicitly states that sort of thing is out of scope; they want to deal with user level sockets.
Linux has a TCP Repair feature which can be used as part of socket migration; but you'll also need to do something to forward packets to the new destination. Could be arping for the address from a new machine, or something fancier that can switch proportionally or ??? there's lots of options, depending on your network.
As much as I'd love to have a use case for TCP migration, it's a little bit too esoteric for me ... reconnecting is best avoided when possible, but I'm counting TCP migration as non-possible for purposes of the rule of thumb.
The saying in the Erlang crowd is that a non-distributed system can't be really reliable, since the power cord is a single point of failure. So a non-painful way to migrate across machines would be great. It just hasn't been important enough (I guess) for make anyone willing to deal with the technical obstacles.
I wonder whether other OS's have supported anything like that.
I worked on a phone switch (programmed in C) a long time ago that let you do both software and hardware upgrades (swap CPU boards etc.) while keeping connections intact, but the hardware was specially designed for that.
I don't think I've seen it, but I don't see everything, and it'd be pretty esoteric. From my memory of working with the FreeBSD tcp stack, I suspect it wouldn't be too hard to make something like this work there, too; other than the security aspects, but could probably do something like ok to 'repair' a connection that matches a listen socket you also pass or something. But you'd really need the use case to make the hassle worth it, and I don't think most regular server applications are enough to warrant it.
While I can't imagine hot reloading is super practicle in production, it does highlight that erlang/beam/otp has great primitives for building reliable production systems.
Yet I've never been able to find it again.
I literally used hot code reloading a few weeks back to fix a 4-20 mA circuit on a new beta firmware while a client was watching in remote Colorado. Told them I was “fixing a config”. Tested it on our device and then they checked it out over a satellite PLC system. Then I made an update Nerves FW, uploaded it. Made the client happy!
Note that I’ve found that using scp to copy the files to /tmp and then use Code.compile to work better than copy and paste in IEx. The error messages get proper line numbers.
It’s also very simple to write a helper function to compile all the code in /tmp and then delete it. I’ve got a similar one in my project that scp’s any changed elixir files in my project over. It’s pretty nice.