> This is the story of how (not) to build development environments in the cloud.
I'd like to request that the comment thread not turn into a bunch of generic k8s complaints. This is a legitimately interesting article about complicated engineering trade-offs faced by an organization with a very unique workload. Let's talk about that instead of talking about the title!
Super useful negative example, and the lengths they pursued to make it fit! And no knock on the initial choice or impressive engineering, as many of the k8s problems they hit likely weren't understood gaps at the time they chose k8s.
Which makes sense, given k8s roots in (a) not being a security isolation tool & (b) targeting up-front configurability over runtime flexibility.
Neither of which mesh well with the co-hosted dev environment use case.
Because I don't understand most of the article if it's the former. How are things like performance are a concern for internal development environments? And why are so many things stateful - ideally there should be some kind of configuration/secret management solution so that deployments are consistent.
If it's the latter, then this is incredibly niche and maybe interesting, but unlikely to be applicable to anyone else.
> This is not a story of whether or not to use Kubernetes for production workloads that’s a whole separate conversation. As is the topic of how to build a comprehensive soup-to-nuts developer experience for shipping applications on Kubernetes.
> This is the story of how (not) to build development environments in the cloud.
their question isn't asking anything. It's both about development environments AND a service they sell, which is dev envs.
Even so
> This is the story of how (not) to build development environments in the cloud.
that is what the article is about. Their words. The person asked what the article is about, this it it, from the authors themselves.
read the damn article
I ended up with a mix of nix and it's vm build system which is based on qemu. The issue is too tied to NixOS and all services run in the same place which forces you to manage ports and other things.
How I wish it could work is having a flake that defines certain services, these services could or could not run in different µVMs sharing an isolated linux network layer. Your flake could define your versions, your commands to interact and manage the lifecyle of those µVM's. As the nix store can be cached/shared, it can be provide fast and reproducible builds after the first build.
Can you expand on this? Are you talking about containers you create?
The infrastructure now incredibly understandable and simple and cost effective.
Kubernetes cost us >$million in both DevOps time and actually Google Cloud costs unnecessarily, and even worse it cost us time to market. Stay off of Kubernetes as long as you can in your company, unless you are basically forced onto it. You should view it as an unnecessary evil that comes with massive downsides in terms of complexity and cost.
But this is really a spurious concern. I myself used to care about it years ago. But in practice, rarely do people switch between cloud providers because the incremental benefits are minor, they are nearly equivalent, there is nothing much to be gained by moving from one to the other unless politics are involved (e.g. someone high up wants a specific provider.)
I feel like Kubernetes' downfall, for me, is the number of "enterprise" features it (got convinced into) supporting and enterprise features doing what they do best: turning the simplest of operations into a disaster.
Github Actions CI. Take this and make a few more dependencies and a matrix strategy and you are good to go: https://github.com/bhouston/template-typescript-monorepo/blo... For dev environments, you can add post-fixes to the services based on branches.
> How do you share storage?
I use managed DBs and Cloud Storage for shared storage. I think that provisioning your own SSDs/HDs to the cloud is indicative of an anti-pattern in your architecture.
> How do the docker containers know how to find each other?
I try to avoid too much communication between services directly, rather try to go through pub-sub or similar. But you can set up each service with a domain name and access them that way. With https://web3dsurvey.com, I have an api on https://api.web3dsurvey.com and then a review environment (connected to the main branch) with https://preview.web3dsurvey.com / https://api.preview.web3dsurvey.com.
> How does security work?
You can configure Cloud Run services to be internal only and not to accept outside connections. Otherwise one can just use JWT or whatever is normal on your routes in your web server.
I have been converging on a similar stack, but trying to avoid using a load balancer in an effort to keep fixed costs low.
Yeah I definitely want to also avoid a load balancer or gateway or end points as well for cost purposes.
https://github.com/bhouston/template-typescript-monorepo
This is my living template of best practices.
If you want some guidance, shoot me an email (in profile). You can run most stuff for peanuts.
Or just https://github.com/mightymoud/sidekick or coolify or dokku or dockify , like there are million of such things , oh just remembered kamala deploy from DHH and docker swarm IIRC (though people have seemed to forget docker swarm !)
I like this idea very much !
I appreciate the offer! But it is not as robust and it is more expensive and misses a lot of benefits.
Back in the 1990s I did FTP my website to a VPS after I graduated from Geocities.
Google Cloud charges based on CPU used. Thus my servers have no traffic, they cost less than a $1/month. If they have traffic, they are still cost effective. https://web3dsurvey.com has about 500,000 hits per month and it costs me $4/month to run both the Remix web server and the Fastify API server. Details here: https://x.com/benhouston3d/status/1840811854911668641
Also it will autoscale under load. Thus when one of my posts was briefly the top story on Hacker News last month, Google Cloud Run added more instances to my server to handle the load (because I do not run my personal site behind a CDN, it cost too much, I prefer to pay $1/month for hosting.)
Also deploying Docker containers that build on Github Actions CI in a few minutes is a great automated experience.
I do also use Google services like Cloud Storage, Firestore, BigQuery etc. And it is easier to just run it on GCP infrastructure for speed.
I also have to version various tools that get installed in the docker like Blender, Chromium, etc. This is the perfect use case for Docker.
I feel this is pretty close to optimal. Fast, cheap, scalable, automated and robust.
Why are you actually using google cloud for blog post hosting.
Also you said a million $ kubernetes
wait a second ,have you converted those million $ to 4$ per month
what tom foolery is this
But I am also the Founder/CTO of Threekit.com.
I hope that makes sense now.
Yup. Isn't it Knative Serving or a home grown Google alternative to it? https://knative.dev/docs/serving/
The key is I am not managing Kubernetes and I am not paying for it - it is a fool's errand, and incredibly rarely needed. Who cares what is underneath the simple Cloud Run developer UX? What matters for me is cost, simplicity, speed and understandability. You get that with Cloud Run, and you don't with Kubernetes.
Kubernetes can be as complex or as expensive as you'd like but it's also fairly possible to run a pretty bulletproof simple Kube cluster.
Here are my concerns:
With Kubernetes is that you need to pay for a few node just to keep it up, and then you need to pay for your nodes, no matter how much you use them.
Remember that Cloud Run charges based usage, so if a service sits unused for a while, which often happens in a heterogeneous microservices environment, you don't pay for it.
Also autoscaling is slow (Cloud Run autoscales really quickly, about as fast as your docker can be loaded and started, which for me is 1-3 seconds, where as I found Kubernetes auto-scales on the order of minutes) unless you over-provision, which is costly. This lets one scale to zero even without much of a hit.
I also ran into massive issues trying to get GPUs to work in Kubernetes - it was a driver nightmare that has wasted weeks of time collectively over the years. Whereas they are auto-provisioned properly on Cloud Run if you request them.
Lastly job systems on Kubernetes are a nightmare of configuration. The built-in scheduler cannot handle a lot of jobs but Argo also has its own issues if you actually try to use it. We've wasted weeks of effort on this. Cloud Run Tasks just skips this and is ultra fast too and handles scaling up to do a lot of jobs in such a simple fashion.
Honestly, managing Kubernetes is just overall a pain that has little benefit.
It is really hard to figure out what the benefits of Kubernetes is from my point of view. It has been a massive source of pain and costs and lost developer time.
1.) What would you think of things like hetzner / linode / digitalocean (if stable work exists)
2.) What do you think of https://sst.dev/ or https://encore.dev/ ? (They support rather easier migration)
3.) Could you please indicate the split of that 1 million $ in devops time and google cloud costs unnecessarily & were there some outliers (like oh our intern didn't add this specific variable and this misconfigured cloud and wasted 10k on gcloud oops! or was it , that bandwidth causes this much more in gcloud (I don't think latter to be the case though))
Looking forward to chatting with you!
Perhaps a followup article will go into detail about their replacement.
Gitpod Flex is runner-based. The runner interface is intentionally generic so that we can support different clouds, on-prem or just Linux in future.
The first implemented runner is built around AWS primitives like EC2, EBS and ECS. But because of the more generic interface Gitpod now supports local / desktop environments on MacOS. And again, future OS support will come.
There’s a bit more information in the docs, but we will do some follow ups!
- https://www.gitpod.io/docs/flex/runners/aws/setup-aws-runner... - https://www.gitpod.io/docs/flex/gitpod-desktop
(I work at Gitpod)
>> We’ll be posting a lot more about Gitpod Flex architecture in the coming weeks or months.
Cramming more detail into this post would have exceeded the average user read time ceiling.
Did you use consul?
You also have github and gitlab VCS's that have their own hosted runners for pipelines, but also enable you to configure a runner to use private resources to offload jobs to.
Based on this information, it is hard to justify to even consider k8s for the problem that gitpod has.
https://static.googleusercontent.com/media/research.google.c...
I am not sure what differences k8s has compare to Borg. At the concept level these are pretty comparable.
In my opinion, k8s is great for stable and consistent deployment/orchestration of applications. Dev environments by default are in a constant state of flux.
I don’t understand the need for “cloud development environments” though. Isn’t the point of containerized apps is to avoid the need for synchronizing dev envs amongst teams?
Or maybe this product is supposed to decrease onboarding friction?
It's also much cheaper to hire contractors and give them the CDE that can be terminated on a moment notice.
The rest of our eng team just did dev on their laptops though. I do think there was a level of batteries-included-ness that came with the ephemeral dev envs which our less technical data scientists appreciated, but the rest of our developers did not. Just my 2c
Oddly, I left with a funny alternate takeaway: One by one, their clever inhouse tweaks & scheduling preferences were recognized by the community and turned into standard k8s knobs
So I'm back to the original question... What is fundamentally left? It sounds like one part is maintaining a clean container path to simplify a local deploy, which a lot of k8s teams do (ex: most of our enterprise customers prefer our docker compose & AMIs over k8s). But more importantly, something fundamental architecturally about how envs run that k8s cannot do, but they do not identify?
Bottom of the post.
Still, some of the core challenges remain: - the flexibility Kubernetes affords makes it hard to build and distribute a product with such specific requirements across the broad swath of differently set up Kubernetes installations. Managed Kubernetes services help, but come with their own restrictions (e.g. Kernel versions on GKE). - state handling and storage remains unsolved. PVCs are not reliable enough, subject to a lot of variance (see point above), and depending on the backing storage have vastly different behaviour. Local disks (which we use to this day), make workspace startup and backup expensive from a resource perspective and hard to predict timing wise. - user namespaces have come a long way in Kubernetes, but by themselves are not enough. /proc is still masked, FUSE is still not usable. - startup times, specifically container pulls and backup restoration, are hard to optimize because they depend on a lot of factors outside of our control (image homogeneity, cluster configuration)
Fundamentally, Kubernetes simply isn't the right choice here. It's possible to make it work, but at some point the ROI of running on Kubernetes simply isn't there.
AFAICT, a lot of that comes down to storage abstractions, which I'll be curious to see the answer on! Pinned localstorage <> cloud native is frustrating.
I sense another big chunk is the fast secure start problems that firecracker (noted in the blogpost) solve but k8s is not currently equipped for. Our team has been puzzling that one for awhile, and part of our guess is incentives. It's been 5+ years since firecracker came out, so likewise been frustrating to see.
What we have seen works especially when you are building developer centric product is expose these native issues around network, memory, compute and storage to engineers and they are more willing to work around it. Abstracting those issues leads to shift in responsibility on the product.
Having said that, I still think k8s is an upgrade when you have a large team.
If you really need consistency for the environment - Let them own the machine, and then give them a stable base VM image, and pay for decent virtualization tooling that they run... on their own machine.
I have seen several attempts to move dev environments to a remote host. They invariably suck.
Yes - that means you need to pay for decent hardware for your devs, it's usually cheaper than remote resources (for a lot of reasons).
Yes - that means you need to support running your stack locally. This is a good constraint (and a place where containers are your friend for consistency).
Yes - that means you need data generation tooling to populate a local env. This can be automated relatively well, and it's something you need with a remote env anyways.
---
The only real downside is data control (ie - the company has less control over how a developer manages assets like source code). I'm my experience, the vast majority of companies should worry less about this - your value as a company isn't your source code in 99.5% of cases, it's the team that executes that source code in production.
If you're in the 0.5% of other cases... you know it and you should be in an air-gapped closed room anyways (and I've worked in those too...)
The developers also lack knowledge about the environment; can't evolve the environment; can't test the environment for bugs; and invariably interfere with each other because it's never isolated well. And also, yes, it adds lag.
Anyway, yes, working locally on false data that little resemblance to production still beats remote environments.
I'm not recommending this as a best practice. I just believe that we, as developers, end up creating some myths to ourselves of what works and what doesn't. It's good to re-evaluate these beliefs now and then.
python -m venv .venv
It's trivial to setup a venv, but sometimes it's just not worth it for me.
I watched it last week. With 4 (I hope junior) Devs in a "pair programming" session that forced me to figure out how VSCode does virtual envs, and still I had to tell them like 3 times "stop opening a damn new terminal, it's obviously not setup with our python version, run the command inside the one that has the virtual env activated".
When it comes to opening a new terminal, you would have the exact same problem by... running commands in a terminal, cant see how vscode related that is.
If you stick to the tried and true libs and change your function kwargs or method names when getting warnings, then I’ve had pretty rock steady reproducibility using even an un-versioned “python -m pip install -r requirements.txt” experience
I could also be a slob or just not working at the bleeding edge of python lib deployment tho so take it with a grain of salt.
One of the benefits of moving away from Kubernetes, to a runner-based architecture , is that we can now seamlessly support cloud-based and local environments (https://www.gitpod.io/blog/introducing-gitpod-desktop).
What's really nice about this is that with this kind of integration there's very little difference in setting up a dev env in the cloud or locally. The behaviour and qualities of those environments can differ vastly though (network bandwidth, latency, GPU, RAM, CPUs, ARM/x86).
Kubernetes is another mess of userspace ops tools. Userspace is for composable UI not backend. Kube and Chef and all those other ops tools are backend functionality being used like UI by leet haxxors
For example, when you're running on your local machine you've actually got the amount of RAM and CPU advertised :)
https://github.com/89luca89/distrobox
It is sorta like Vagrant, but instead of using virtualbox virtual machines you use podman containers. This way you get to use OCI images for your "dev environment" that integrates directly into your desktop.
There is some challenges related to usermode networking for non-root-managed controllers and desktop integration has some additional complications. But besides that it has almost no overhead and you can have unfettered access to things like GPUs.
Also it is usually pretty easy to convert your normal docker or kubernetes containers over to something you can run on your desktop.
Also it is possible to use things like Kubernetes pods definitions to deploy sets of containers with podman and manage it with systemd and such things. So you can have "clouds of containers" that your dev container needs access to locally.
If there is a corporate need for window-specific applications then running Windows VMs or doing remote applications over RDP is a possible work around.
If everything you are targeting as a deployment is going to be Linux-everything then it doesn't make a lot of sense to jump through a bunch of hoops and cause a bunch of headaches just to avoid having it as workstation OS.
You'll run into occasional issues (e.g. if everyone is trying to run default node.js on default port) but with some basic guardrails it feels like it should be OK?
I'm remembering back to when my old company ran a lot of PHP projects. Each user just had their own development environment and their own Apache vhost. They wrote their code and tested it in their own vhost. Then we'd merge to a single separate vhost for further testing.
I am trying to remember anything about what was painful about it but it all basically Just Worked. Everyone had remote access via VPN; the worst case scenario for them was they'd have to work from home with a bit of extra latency.
Distrobox and podman are such a charm to use, and so easily integrated into dev environments and production environments.
The intentional daemon free concept is so much easier to setup in practice, as there's no fiddly group management necessary anymore.
Just a 5 line systemd service file and that's it. Easy as pie.
I have worked on developing VMs for other developers that rely on a local IDE such. The main sticking point is syncing and schlepping source code (something my setup avoids because the source code and editor is on the remote machine). I have tried a number of approaches, and I sympathize with the article author. So, in response to "Devs need to create the software tooling to make remote dev less painful. I mean, they're devs... making software is kind of their whole thing." <-- syncing and schlepping source code is by no means a solved problem.
I can also say that, my spacemacs config is very vanilla. Like my phone, I don't want to be messing with it when I want to code. Writing tooling for my editor environment is a sideshow for the work I am trying to finish.
It was never an issue to use X Windows on them, with hummingbird on my Windows thin client.
I guess a new generation has to learn the ways of timesharing development.
Nowadays Web frontends and SSH/cloud shell, have replaced what used to be X Windows / telnet / rsh, but the underlying workflows aren't much different than running an IDE / emacs /vi / joe /... from a UNIX development server in a 1990's office.
I honestly don't understand why nobody has simply invented some software to solve this problem, after 50 years.
We did, it is called GUI and language REPLs, like Smalltalk and Interlisp-D development enviroments, with graphical based terminals, not dependent on replicating virtual teletypes.
Still something that seems problematic to take off the way it should.
That's probably the difference. Throw elasticsearch, kafka and a bunch of Java services in and you'll be easily exhausting your RAM (at least at startup).
So the solution here is to not have that kind of "stack".
I mean, if it's all so big and complex that it can't be run on a laptop then you almost certainly got a lot of problems regardless. What typically happens is tons of interconnected services without clear abstractions or interfaces, and no one really understands this spaghetti mess, and people just keep piling crap on top of it.
This leads to all sorts of problems. Everywhere I've seen this happen they had real problems running stuff in production too, because it was a complex spaghetti mess. The abstracted "easy" dev-env (in whatever form that came) is then also incredibly complex, finicky, and brittle. Never mind running tests, which is typically even worse. It's not uncommon for it all to be broken for every other new person who joins because changes somewhere broke the setup steps which are only run for new people. Everyone else is afraid to do anything with their machine "because it now works'.
There are some exceptions where you really need a big beefy machine for a dev env and tests, maybe, but they're few and far between.
Reminds me of my favorites debugging technique. It's super fast: Don't write any bugs!
With things that messy it's fairly likely there would be dependency loops or problems (thundering herd, etc) trying to get things going from a cold start.
ie after a complete outage or similar for whatever reason
Isn't this problem solved by CICD? When the developer is ready to test, they make a commit, and the pipeline deploys the code to a dev/test environment. That's how my teams have been doing it.
I don't quite understand how people get into the situation where their work can't fit on their workstation. I've worked on huge projects at huge tech companies, and I could run everything on my workstation. I've worked at startups where the CI situation was passing 5% of the time and required 3 hours to run, that you can now run on your workstation in seconds. What you do is fix the stuff that doesn't fit.
The most insidious source of slowness I've encountered is tests that use test databases set to fsync = on. This severely limits parallelism and speed in a way that's difficult to diagnose; you have plenty of CPU and memory available, but the tests just aren't going very fast. (I don't remember how I stumbled upon this insight. I think I must have straced Postgres and been like "ohhhhhhhhh, of course".)
When you are working on a software project that has 1,000 active developers checking in code daily and require a stable system build you need lots of compute.
Also, if you're booting kernel or device drivers you need the hardware. Some of this is not desktop hardware.
When you're developing and only need to touch 0.1% of the product and 0.001% of the code, that's a total and complete waste of time.
How tightly coupled are these systems?
It doesn't have to be like that. I've worked on a 10MLOC codebase with 500+ committers - all perfectly runnable locally, on admittedly slightly beefy dev machines. It's true that systems will grow without limit unless some force exists to counter this, but keeping your stack something you can sanely run on a development machine is well worth spending some actual effort on.
But that's only the resource problem. Another problem I have seen my entire career, is devs can't keep their machines configured the same. They have different model laptops, they don't pin their app versions, they configure and install things by hand. Each time they change something by accident it takes them hours, days, sometimes weeks, to get it working again. That also can lead to bugs developing the app, which wastes a huge amount of time.
And then there's the fact that their local copy runs completely differently than it does in production. This leads to the app being written with certain assumptions about how it runs, that turn out to be false in production. I've seen this lead to catastrophe, as well as just weeks to months of wasted time, trying to track down issues. This is an undeniable, existential issue.
Finally, it's rare for local setups to be secure. Often devs get too much access from their local machines, and this is stolen by infostealer malware and compromise happens. A protected remote environment is easier to secure. A lot of development is hampered by all the crappy corporate security tools that's on laptops now. Remote dev allows you to bypass all that and have a fully working yet protected network without restrictions.
Is remote dev a pain? Right now, yeah, because nobody has made it be less painful. So of course it's easier on the local machine. But it's not ideal. I'm sure eating with a spoon was more painful than eating with your hands, until forks were popularized in the 18th century. Change took a long time, but I think most of us prefer the change once new tools became widely available.
Yes and no. Realistically, the range between the beefiest possible remote server and the beefiest possible workstation is what, one order of magnitude? So in a growing environment doing remote dev will maybe let you kick the can down the road a year or two, but you'll still have to deal with whatever was causing your requirements to grow pretty soon.
> But that's only the resource problem. Another problem I have seen my entire career, is devs can't keep their machines configured the same. They have different model laptops, they don't pin their app versions, they configure and install things by hand. Each time they change something by accident it takes them hours, days, sometimes weeks, to get it working again.
> Finally, it's rare for local setups to be secure. Often devs get too much access from their local machines, and this is stolen by infostealer malware and compromise happens. A protected remote environment is easier to secure. A lot of development is hampered by all the crappy corporate security tools that's on laptops now. Remote dev allows you to bypass all that and have a fully working yet protected network without restrictions.
This isn't a remote versus local question, it's a question of how much control developers have over their environment and how much you manage and standardise what's installed. You can have a fully locked down local machine where developers can't install anything except a short whitelist (of course you may get some pushback) and you can have a remote VM where developers curl|sh whatever random repack of Python they wanted this week - I've seen both these things happen in practice.
Security junkware I sort of agree with you, but I think that's more of an artifact of bad laws/policies and if and when remote dev takes off we'll see just as much junkware on remote dev machines as on local ones.
Sounds like you have a different problem.
CPU resources required to run your stack should be very minimal if it's a single user accessing it for local testing idle threads don't consume oodles of cpu cycles to do nothing.
Memory use may be significant even in that case (depending on your stack) but let's be realistic. If your stack is so large that it alone requires more memory than a dev machine can spare with an IDE open, the cost of providing developers with capable workstations will pale in comparison to the cost of running the prod environment.
I have a client whose prod environment is 2x load balancer; 2x app server; 3x DB cluster node - all rented virtual machines. We just upgraded to higher spec machines to give headroom over the next couple of years (ie most machines doubled the RAM from the previous generation).
My old workstation bought in 2018 had enough memory that it could virtualise the current prod environment with the same amounts of RAM as prod, and still have 20GB free. My current workstation would have 80+ GB free.
In 95% of cases if you can't run the stack for a single user testing it, on a single physical machine, you're doing something drastically wrong somewhere.
I once had to burn a ton of political capital (including some on credit), because someone who didn't understand software thought that cutting-edge tech startup software developers, even including systems programmers working close to metal, could work effectively using only virtual remote desktops... with a terrible VM configuration... from servers literally halfway around the world... through a very dodgy firewall and VPN... of 10Mb/s total bandwidth... for the entire office of dozens of developers.
(And no other Internet access from the VMs. Administrators would copy whatever files from the Internet that are needed for work. And there was a bureaucratic form for a human process, if you wanted to request any code/data to go in or out. And the laptops/workstations used only as thin-clients for the remote VMs would have to be Windows and run this ridiculous obscure 'endpoint security' software that had changed hands from its ancient developer, and hadn't even updated the marketing materials (e.g., a top bulletpoint was keeping your employees from wasting time on a Web site that famously was wiped out over a decade earlier), and presumably was littered with introduced vulnerabilities and instabilities.)
Note that this was not something like DoD, nor HIPAA, nor finance. Just cutting-edge tech on which (ironically) we wanted first-mover advantage.
This escalated to the other top-titled software engineer and I together doing a presentation to C-suite, on why not only would this kill working productivity (especially in a startup that needed to do creative work fast!), but the bad actors someone was paranoid about could easily circumvent it anyway to exfiltrate data (using methods obvious to the skilled software people like they hired, some undetectable by any security product or even human monitoring they imagined), and all the good rule-following people would quit in incredulous frustration.
Unfortunately, it might not have been even the CEO's call, but a crazy investor.
> I have seen several attempts to move dev environments to a remote host. They invariably suck.
To “therefore they will always suck and have no benefits and nobody should ever use them ever”. Apologies for the hyperbole but I’m making a point that comments like these tend to shut down interesting explorations of the state of the art of remote computing and what the pros/cons are.
Edit: In a world where users demand that companies implement excellent security then we must allow those same companies to limit physical access to their machines as much as possible.
Ex - even on a VERY good connection, RTT on the network is going to exceed your frame latency for a computer sitting in front of you (before we even get into the latency of the actual frame rendering of that remote computer). There's just not a solution for "make the light go faster".
Then we get into the issues the author actually laid out quite compellingly - Shared resources are unpredictable. Is my code running slowly right now because I just introduced an issue, or is it because I'm sharing an env and my neighbor just ate 99% of the CPU/IO, or my network provider has picked a different route and my latency just went up 500ms?
And that's before we even touch the "My machine is down/unreachable, I don't know why and I have no visibility into resolving the issue, when was my last commit again?" style problems...
> Edit: In a world where users demand that companies implement excellent security then we must allow those same companies to limit physical access to their machines as much as possible.
And this... is just bogus. We're not talking about machines running production data. We're talking about a developer environment. Sure - limit access to prod machines all you like, while you're at it, don't give me any production user data either - I sure as hell don't want it for local dev. What I do want is a fast system that I control so that I can actually tweak it as needed to develop and debug the system - it is almost impossible to give a developer "the least access needed" to do development locally because if you know what that access was you wouldn't be developing still.
They do suck due to lack of effort or investment. FANG companies have remote dev experiences that are decent - or even great - because they invest obscene amounts into dev tooling.
There physical constraints on the flipside: especially for gigantic codebases or datasets that don't fit on dev laptops or have need lower latencies to other services in the DC.
Added bonus: smaller attack surface area for adversaries who want to gain access to your code.
At least with Google, they also have a data center near where most developers work, so that they have much lower latency.
They can't make the light go faster, but they can make it so it doesn't go as far. Smaller companies usually don't have a lot of flexibility with that though.
Are you imagining the implementation as some kind of Remote Desktop setup, where no software runs on the local machine (except the Remote Desktop, of course)? This is not the state of the art for using remote developer machines: Typically some editor/IDE components run locally, for example.
> Shared resources are unpredictable.
Then don’t share them! We should do something akin to physically moving your local computer into the data center next to the servers.
> Is my code running slowly right now because I just introduced an issue, or is it because I'm sharing an env and my neighbor just ate 99% of the CPU/IO, or my network provider has picked a different route and my latency just went up 500ms?
If the software you’re developing is going to run in a shared environment then it’s better you experience these issues while developing otherwise you’re asking for a lot of “works on my machine” problems.
> And that's before we even touch the "My machine is down/unreachable
This seems less about remote and more about stability/software design?
> I don't know why and I have no visibility into resolving the issue…
This seems more about observability/software design?
> We're not talking about machines running production data.
The source code itself is something that a company should protect for multiple reasons, not the least of which is preventing an attacker from reading your source to find exploits. There are also various legal and compliance reasons for limiting the distribution of source as much as possible.
> What I do want is a fast system that I control so that I can actually tweak it as needed to develop and debug the system
I don’t understand why this is impossible on a remote machine. Can you elaborate?
> it is almost impossible to give a developer "the least access needed" to do development locally because if you know what that access was you wouldn't be developing still.
I’m sure we can imagine all sorts of setups, from free-for-all root access to so locked down it’s impossible to do work. The sweet spot is typically “you can sudo within reason but we’re logging your activity.”
It'll work if the company can offer something similar to EC2. Unfortunately most of the companies are not capable of doing so if they are not on cloud.
Overall I agree with you that this is how it should be, but as DevOps working with so many development teams, I can tell you that too many developers know a language or two but beyond that barely know how to use a computer. Most developers (yes even most of the ones in Silicon Valley or the larger Bay Area) with Macbooks will smile and nod at when you tell them that Docker Desktop runs a virtual machine to run a copy of Linux to run oci images, and then not too much later reveal themselves to have been clueless.
Commenters on this site are generally expected to be in a different category. Just wanted to share that, as a seasoned DevOps pro, I can tell you it's pretty rough out there.
Even when provided a means to instantiate virtual machines where they can have root access within the virtual machine, a lot of them will bitch.
Well, yeah. I spent a year or so doing all my work in a VM (for other reasons) and it sucked.
> proudly ignorant to how fast the sensitive medical data or financial data they're working on can fly out of the machine
Hey, this is an easy choice! If I can have local root XOR sensitive production data on my machine, I pick local root. Keep that PII the fuck away from my disk, please!! (Hell, do that whether I have root or not.)
Unfortunately, after a few hires (hand-picked by me), this is what happened:
1) People didn't want to learn Nix, neither did they want to ask me how to make something work with Nix, neither did they tell me they didn't want to learn Nix. In essence, I told them to set the project up with it, which they'd do (and which would be successful, at least initially), but forgot that I also had to sell them on it. In one case, a developer spent all weekend (of HIS time) uninstalling Nix and making things work using the "usual crap" (as I would call it), all because of an issue I could have fixed in probably 5 minutes if he had just reached out to me (which he did not, to my chagrin). The first time I heard them comment their true feelings on it was when I pushed back regarding this because I would have gladly helped... I've mentioned this on various Slacks to get feedback and people have basically said "you either insist on it and say it's the only supported developer-environment-defining framework, or you will lose control over it" /shrug
2) Developers really like to have control over their own machines (but I failed to assume they'd also want this control over the project dependencies, since, after all, I was the one who decided to control mine with the flake.nix in the first place!)
3) At a startup, execution is everything and time is possibly too short (especially if you have kids) to learn new things that aren't simple, even if better... that unfortunately may include Nix.
4) Nix would also be perfect for deployments... except that there is no (to my knowledge) general-purpose, broadly-accepted way to deploy via Nix, except to convert it to a Docker image and deploy that, which (almost) defeats most of the purpose of Nix.
I still believe in Nix but actually trying to use it to "perfectly control" a team's project dependencies (which I will insist it does do, pretty much, better than anything else) has been a mixed bag. And I will still insist that for every 5 minutes spent wrestling with Nix trying to get it to do what you need it to do, you are saving at least an order of magnitude more time spent debugging non-deterministic dependency issues that (as it turns out) were only "accidentally" working in the first place.
Worse is better, sadly.
which is why you gotta find your peeps that believe in the better world and are thus believing it into existence. (OT: reminds me of this song title: https://soundcloud.com/anjunabeats/mat-zo-see-it-when-i-beli...)
How so? With what other software does Nix interfere?
> Nix requires a broad set of changes to your system, from creating new users to installing and running a daemon to creating a root volume and beyond
What the post is trying to do there is motivate the creation of a new installer, including to the existing Nix community. The snippet you've highlighted is essentially correct, but I still wouldn't characterize Nix as particularly invasive.
The only that Nix strictly needs is to be plugged into your shell. That's it. It doesn't need deep or special hooks into a system just to function.
But including the daemon enables sandboxing for builds that Nix performs, which improves both the security and isolation of those builds, and it also lets Nix be shared nicely between unprivileged users on multiuser systems. For those reasons, daemonful installs are the default and with them come the system users.
(Adding system users is pretty much bog standard stuff for Unix system software, since the main kind of security boundary designed into that system is boundaries between users. Indeed, that's exactly what that's used for with Nix, too.)
The two things I described above comprise the totality of what is required to enable all of Nix's functionality. Everything else that the Determinate Nix installer does as of now is to work around or avoid macOS quirks, and is totally unnecessary for using Nix on any other OS.
The 'root volume' stuff is the result of a collision between the historical and conventional location of the Nix store at `/nix` and Apple's later imposition of a read-only root partition. So Nix installers do a little Apple-specific dance that creates a kind of filesystem volume that doesn't take up any real space or involve any physical partitioning of the disk when they run on macOS.
The other thing this installer does is build in an attempt to self-repair the damage that Apple inflicts upon Nix's sole real requirement by having macOS unconditionally clobber the shell config files under /etc during major macOS updates.
That's it. That's an exhaustive list of all the things a Nix installer does and why. It's not particularly tricky, or hard to remember or figure out. It's not even hard to undo manually— before the Determinate Nix installer existed, I sometimes uninstalled Nix by hand while manually testing the macOS bootstrap scripts for my dotfiles. It was annoying to do, and the uninstallation functionality of the Determinate Nix installer is extremely reliable and convenient and nice. But anyone who knows what `$PATH` is and has ever run `man` before could completely uninstall Nix even if some joker walked over to their machine and deleted the uninstaller.
At the same time, none of the changes Nix installers make on your system affect the behavior of outside programs at all, except by exposing what you choose to install via Nix through standard Unix environment variables like PATH.
Lacking things like kernel components, automatic self-updates, or the requirement for privileged APIs (e.g., on macOS, the endpoint security APIs and accessibility APIs), Nix is not only far less invasive than any endpoint security software, monitoring software, or MDM software you are likely to run on a work machine, but I'd argue tons of common desktop software like Zoom, Discord, DisplayLink and tons of popular macOS powertools like Amphetamine, SteerMouse, SoundSource, etc.
Plus the uninstall procedure with the DetSys installer and its forks is totally conventional and leaves nothing behind: run uninstaller, thing gone.
Nix on macOS is admittedly not an installer-free, drag-and-drop app bundle like some lovely applications get to be. But at most workplaces it's not likely to crack the top 10 most invasive applications installed on the average developer machine, either. Nix installers are just very up front about the things they do set up.
All that said, there are reasonable people who find having a daemon at all offensive. People who are deeply committed to minimalism or simplicity might prefer a single-user install or to use some other tool. But I think for most people, Nix is imo more than fine in terms of invasiveness.
I'm writing a test to check whether a tool I'm writing can work without Nix (it works with it perfectly, but I want it to also work without it because there are a lot of folks like you, and like me about 3 years ago, who still think they'd rather struggle with manually installing the right glibc that goes with the right python dependency installed with the right pip and venv versions, to the right location, that goes with the right python version that makes Whisper models work (literally the thing I'm currently working on), instead of just running `nix develop` and getting a coffee and then done.
And all I have to do to simulate "no Nix" is to remove all the nix paths from PATH (I suppose I could purge it from the linker paths as well, now that I think about it). But that's it.
What Nix does is put its entire repo into a separate part of your hard drive owned by root, and create a few build users for security reasons. That's (to me) not particularly "invasive," but YMMV (and if you use the Determinate Nix installer, it's even more trivial to uninstall than the official way). Also, when you run `nix develop`, the environment changes it does to make everything "just work" (like PATH changes etc) are only valid for that terminal session. Again, this is the least intrusive thing possible while also providing the guarantees it does, and is also (more or less) guaranteed to work.
The Nix whitepaper is pretty readable and not that long. I recommend it to understand why it's important and useful: https://edolstra.github.io/pubs/nspfssd-lisa2004-final.pdf
There is also Guix, which is like Nix but uses Guile (a Scheme dialect) as its scripting language all the way down to the bare metal (literally, the boot loader is written in it, I believe, as soon as the interpreter is loaded somehow). Their strategy seems to be to let Nix take the lead and make all the mistakes and then implement the way that seems to work the best, in its own ecosystem/tooling: https://guix.gnu.org/ But they have a lot fewer packages than Nix does.
Both of these let you define an entire machine with a single configuration file that is far more guaranteed to work than running a Dockerfile.
Fewer packages, yes, but the packages are by far the most common ones. It's easy to add packages for yourself, if needed. Nonguix channel and others for stuff upstream won't accept.
I believe Guix is innovating on a number of things in relation to Nix these days. So I've heard. I don't know much about Nix, honestly.
I heartily agree with your last paragraph in the case of Guix.
In my current work project, we use Windows and .NET with some libs and tools. Nothing too complicated, but automating it would be nice. I could probably push for it a bit, but I'm not familiar with automating Windows environments, since I mainly use Linux at home.
I really, really struggle to deal with the fact that people don't know as much as I do (I wrote my first program when I was 4 and I'm 39 now), but I have accepted that it's not a weakness on their part, it's a weakness on my part. I wouldn't lower my standards (as a manager once suggested), but I do feel like it's my obligation to lead them on a journey of learning. That is to say, people don't learn without teaching, so be a teacher.
Where "job" is defined in a narrowest way possible to assume minimum responsibility. Still want to get 200k+ salaries though...
This may sound extreme (it really isn't) but as Dr of Eng TP's job was to sus those folks out as early as possible and part ways (the kind where they go work for someone else). Some folks are completely irrational about their setups and no amount of appeasement in the form of "whys" and training is usually sufficient.
I don't think you know what you're talking about. Just because you know people who do not want to waste their time on a set of unproductive chores you arbitrarily singled out, that does not mean they are against learning.
Your take is particularly absurd considering the topic: engineers working on distributed services.
Do you actually believe that you build up enough knowledge on this topic to become a professional in the field if you "straight-up don't want to learn"? There is not a single developer in the field who, at least to some degree, is not self-taught.
> They want to do their job, and get paid.
Everyone wants to get paid. Do you know anyone who works non-profit?
What you're failing to understand is the "do their job" part. Software developers are trained to solve the problems they face, and not waste time with the problems they do not have. Time is precious, and they invest it where it has the largest return on investment.
> Reading man pages is sadly not in that list.
Man pages are notoriously a colossal waste of time. In general they are poorly thought out, they are incomplete, they were written with complete disregard for user experience, and more often than not they are way out of date.
Why do you think sites like Stack overflow is so popular? Because all those "incurious" people in tech feels the need to ask questions and dig through answers on how to solve problems?
I think you're just picking a very personal definition of competence which conveniently boils down to "do the things I do, and do not do the things I don't". Except the bulk of the people in the field is smart, and some have already solved problems that you aren't aware exist, such as wasting precious time deciphering unreadable documents that are systematically out of date.
Uh, what? What man pages are you reading? I read manpages all the time, and I've never run into an issue where one contained info that was untrue because outdated. The only manpages I've ever read that I'd characterize as incomplete are Apple's.¹
> Why do you think sites like Stack overflow is so popular? Because all those "incurious" people in tech feels the need to ask questions and dig through answers on how to solve problems?
One of the reasons Stack Overflow is so popular is that people who can't/won't read docs can use it to have answers spoonfed to them, often by people who only differ from them in being more willing/able to read the docs. Isn't that extremely obvious?
> unreadable documents
Reading isn't a singular skill— each genre requires its own skills, and you gradually pick those up by reading in that genre. Reading novels doesn't much prepare you to read math textbooks, but that doesn't make all math textbooks 'unreadable'.
The same things goes for skimming. Skimming a text is likewise a (set of) genre-specific skill(s), built up through practice.
Frankly, moving from your terminal to your web browser to look up how to use a CLI tool is only consistently faster than working with the docs native to that CLI environment (man pages, info pages, usage messages, --help flags, help subcommands, tldr pages, etc.) if you have don't have very good reading skills in the genres of those native docs.
As someone who does not have difficulty skimming or navigating manpages quickly, when someone tells me that digging through StackOverflow seems like less of a waste of time than reading docs and so they never read docs, I have to wonder if the real issue is that a reading skills deficit is caught in a self-reinforcing loop.
And indeed, a trip to StackOverflow never ends at StackOverflow for a person with much curiosity. Because even if a curious person finds a solution to their immediate problem, they will wonder things like:
- is this solution outmoded by some other fix?
- how is the feature/option/change used in this solution actually supposed to work?
- are there any alternatives I should know about?
- if I wanted to do things slightly differently, could I still use the method/feature/option referenced in this solution? does it have any parameters that are easy to swap or tweak?
- is this scenario what the feature/method/option in the solution is actually intended for? should I care?
... and the quickest way to answer questions like that is usually a glance at a manual.-----
1: In some cases with GNU stuff the literal `man` pages are abridged versions of the `info` pages. But even then, the `man` pages direct you to `info` pages. It's not like they leave you having.
Every single man page out there leads to a user experience that is at best subpar.
> One of the reasons Stack Overflow is so popular is that people who can't/won't read docs can use it to have answers spoonfed to them (...)
Pause and look at what you're saying. Your only criticism of SO is how it improves the task of providing meaningful information to users.
The way you opt to spin improvements to user experience as "spoonfed" speaks volumes of your inability to understand the problem and the value you place on gratuitous ladder-pulling. You even contradict your remarks on man pages.
> Reading isn't a singular skill— each genre requires its own skills (...)
No. Writing is a skill. Producing content that the target audience is able to consume and brings value is a skill. The moment you, as a end-user, feel the need to hunker down and decipher arcane texts is the moment you should realize the documentation is bad.
Again, Stack overflow is widely used as ad-hoc crowd-sourced documentation for a reason. Some project maintainers even go as far as to make it their own channel to provide technical support. Why so? Do you honestly believe its because the whole world is not smart enough to read man pages?
Again, those who do not waste their time on man pages are the smart ones who put their own time to better use.
How is wanting others to learn ladder-pulling? Also, how do you assume people will have this kind of information handed to them when the people who are interested in deeply learning stop doing so, die off, etc.? If you say AI, first of all, best of luck with the hallucinations, but secondly, who is going to work on and train the AI?
> No. Writing is a skill. Producing content that the target audience is able to consume and brings value is a skill. The moment you, as an end-user, feel the need to hunker down and decipher arcane texts is the moment you should realize the documentation is bad.
I think I see the root disagreement here. You continue to mention "value," as though reading is itself not valuable. Sitting down to read a work of fiction arguably brings no value to anyone (except perhaps the author and publisher), yet millions do it anyway. Similarly, if I find a way to do something, I usually want to know if there are also other ways, and if so, if they're better. There's not much "value" there most of the time, but it brings me happiness, and enhances my knowledge of the subject.
My favorite variety of SO question is "how do I do X in $LANGUAGE," because inevitably, people pile in with various answers, and then someone starts benchmarking all of them and providing graphs. Occasionally someone even breaks down the assembly instructions for each solution and explains why one is superior to the other. All in all, a fanatical obsession over something small and relatively unimportant, because they like to learn, and they like to share what they've learned.
Given the modern hiring practice of "can you pass Leetcode," and "can you memorize and regurgitate how to architect a link shortener," yes, yes I do. There is a vast difference between learning to pass a test, and learning because you're sincerely interested in the topic.
> Everyone wants to get paid. Do you know anyone who works non-profit?
Of course we all want to get paid. The intent of the sentence, as I think you know, was that many lack intrinsic motivation, of learning for the sake of learning.
> What you're failing to understand is the "do their job" part. Software developers are trained to solve the problems they face, and not waste time with the problems they do not have.
I think what you're failing to understand is that there is a difference between a factory worker and a craftsman. There is absolutely nothing wrong with factory work, to be clear here – I in no way intend to disparage honest work – I just personally find it a difficult personality to work alongside.
> Time is precious, and they invest it where it has the largest return on investment.
To me, this reads as "be selfish." The fastest way to get an answer is to ask someone who knows. This is not, however, the best way to retain knowledge, nor is it considerate of others' time. That's not to say you shouldn't ask for help, but it's a much different ask when you come to someone saying, "this is what I'm trying to do, this is what I've done, and this has been my result – can you help?"
I can't tell you the number of times someone has DM'd me asking for help on something I've never touched, but by reading docs, have solved. I always try to reinforce that by linking to the docs in the answer, but it hasn't proven to be a successful method of deterring future LMGTFY.
> Man pages are notoriously a colossal waste of time.
Citation needed.
> In general they are poorly thought out
Do you have some specific examples?
> They are incomplete
See above; also, if you've found this to be true, have you considered giving back by updating them?
> They were written with complete disregard for user experience
They were and are written for people who wish to understand their tools, not for people who want a 5 minute Medium post that contains the code necessary to complete a task.
> And more often than not they are way out of date.
I can't think of a time where the man pages _included with a tool_ were out of date. If your system is itself out of date, I can see where this could be true. Again, do you have some specific examples?
> Why do you think sites like Stack overflow is so popular? Because all those "incurious" people in tech feels the need to ask questions and dig through answers on how to solve problems?
SO is a great site, with a dizzying variety of quality in its questions and answers. Take one of (the?) most upvoted answers ever, on branch prediction [0]. The question itself isn't easily answerable via reading docs, and as the answer shows, is surprisingly deep. Next, a highly-upvoted question about how to reset local git commits [1]. This is a question that _is_ easily answerable by reading docs [2]. Or a question on what `__main__` is [3] in Python. A fair question (it is somewhat odd from the outside, especially if you have no experience in Python, have no idea what dunder methods are, etc.), but again, one that's easily answerable by reading docs [4].
> I think you're just picking a very personal definition of competence which conveniently boils down to "do the things I do, and do not do the things I don't".
Of course I think that the way I do things is mostly correct; otherwise why would I be doing them?
> Except the bulk of the people in the field is smart, and some have already solved problems that you aren't aware exist, such as wasting precious time deciphering unreadable documents that are systematically out of date.
Strawman aside, I never said people in tech aren't smart, I said they're largely incurious. Words matter.
[0]: https://stackoverflow.com/a/11227902/4221094
[1]: https://stackoverflow.com/questions/927358/how-do-i-undo-the...
[2]: https://git-scm.com/docs/git-reset#Documentation/git-reset.t...
[3]: https://stackoverflow.com/questions/419163/what-does-if-name...
You are contradicting yourself. If there's anything that requires studying and preparation, that's leetcode.
Also, "memorize and regurgitate how to architect a link shortener" is also known as learning and knowing the basics of systems architecture and software architecture. That's an odd way of criticising others for being more competent than you.
"My way is correct" doesn't necessarily imply "other ways are incorrect". Sometimes there's not one single solution to a problem. I love using Linux for my personal machines, but I don't think that people who don't are doing things wrong; they just have different preferences on how to do things, and that's fine.
I ended up going with Bazel, not because of this particular problem alone (though it was part of it; people we hired spent WEEKS trying to get a happy edit/test/debug cycle going), but because proper dependency-based test caching was sorely needed. Using Bazel and Buildbuddy brought CI down from about 17 minutes per run to 3-4 minutes for a typical change, which meant that even if people didn't want to get a local setup going, they could at least be slightly productive. I also made sure that every dependency / tool useful for developing the product was versioned in the repository, so if something needs `psql` you can `bazel run //tools/postgres/psql` and have it just work. (Hate that Postgres can't be statically linked, though.)
It was a lot of work for me, and people do gripe about some things ("I liked `go test ./...`, I can't adjust to `bazel test ...`"), but all in all, it does work well. I would do it again. Day 1 at the company; git clone our thing, install bazelisk, and your environment setup is done. All the tests pass. You can run the app locally with a simple `bazel run`. I'm pretty happy with the outcome.
Nix is something I looked into for our container images, but they just end up being too big. I never figured out why; I think a lot of things are dynamically linked and they include their own /usr/lib tree with the entire transitive dependency chain for that particular app, even if other things you have installed have some overlap with that dependency chain. I prefer the approach of statically linking everything and only including what you need. I compromised by basing things on Debian and rules_distroless, which at least lets you build a container image with the exact same sha256 on two different machines. (We previously just did "FROM scratch; COPY <statically linked binary> /app; ENTRYPOINT /app", but then started needing things like pg_dump in our image. If you can just have a single statically-linked binary be your entire app, great. Sometimes you can't, and then you need some sort of reasonable solution. Also everything ends up growing a dependency on ca-certificates...)
Wherever there's such overlap, those dependencies are already shared. Static linking in such a situation means more disk usage, not less.
Packages in Nixpkgs have large closure sizes for entirely other reasons, like not splitting packages as aggressively as they could be split, or enabling/including most optional dependencies by default. Distros like Alpine typically lean the other way for their defaults.
It's true that if you're willing to manually mangle them, static binaries are nice because you can very easily strip all docs and examples or even executables that you don't need, and still know your executable will have the libs it's linked against. In one place at work I actually do this with Nixpkgs— there's a pkgsStatic that includes only statically compiled packages. I pull just the tiny parts of some package I need out and copy them onto a blank OCI image because it was the path of least resistance.
But Nix also has some really nice tools for inspecting dependency graphs to figure out why large packages are getting pulled in. nix-tree is my favorite, but there's also the older nix-du that gives the same info via graphviz instead of the terminal, and the built-in `nix why-depends`.
-----
Edited to add: wait, are you saying you used some other base distro to create Docker images where some things were supplied by Nix and others came from the base distro? If so, yeah, Nix is going to bring all the dependencies along, all the way down to libc or whatever. That's required for the kind of hermeticity that is its goal.
Mixed images like that are always going to be larger. But you also don't need a base distro at all with Nix. You can use one of the existing Nix libraries for Docker/OCI stuff to generate a complete image from scratch, or just copy your Nix packages' dependency closure onto an empty image with a FROM SCRATCH Dockerfile.
If you can't do that, you can do various things to try to slim things down but it's best to just Nixify whatever other packages you're using so you don't need a base distro. (And if you're trying to save space, Nix itself doesn't need to be in your Docker images either, which can also cut out some deps.)
It's not the learning new things that's a problem, but rather the fact that every little issue turns into a 2-day marathon that's eventually solved with a 1-line fix. And that's because the feedback loop and general UX is just awful - I really started to feel like I needed a sacrificial chicken.
Docker may be a dumpster fire, but at least it's generally easy to see what you did wrong and fix it.
You're not actually "fixing" anything, you're just passing the ball of shit down the responsibility chain to the ops/infra team.
Which is fine if you work in a large corporation where this is a valid strategy.
Unfortunately though the software supply chain problem is a) very difficult and b) unavoidable.
Nix is the best (or maybe only) attempt to solve this problem with programmatic (vs organizational) tooling.
(See npm or the clusterfuck of Python packaging for proof.)
If Docker builds were as deterministic as Nix, then all that would need to be distributed would be Dockerfiles and perhaps a cache of base images somewhere.
Looking at a build as a pure function where each dependency (including any compiler(s), plus the environment), are "input arguments" to it, was a revelation (since I already realized the advantages of pure functions while working in functional languages).
Running a Dockerfile and hoping to get a working image out of it is like running a function which checks the time when it runs and errors when the seconds end in 0 due to a bug.
> every little issue turns into a 2-day marathon that's eventually solved with a 1-line fix
There is spotty education in the space. Did you ever take this (very cool) Nix tutorial? Not actually understanding Nix is going to make any troubleshooting of Nix much harder. https://nixcloud.io/tour/
> I really started to feel like I needed a sacrificial chicken.
Have you looked at Guix? A lot of people think it's "Nix without the warts." Plus it uses a Lisp, which some people prefer, or can at least grok better than the Nix language. https://guix.gnu.org/
I'll take a look at guix, though...
I think Nix fits a pattern that's happened in plenty of other domains where the technology that focused on doing things "right" failed to win out against a competitor doing things "wrong" but optimizing for a lower barrier of entry. The logic that a perfect solution is worth an up-front cost is compelling, since having an imperfect solution has a long-term cost that never goes away, but this misses the fact that pushing the cost until later has value of its own; making things easier today at the cost of tomorrow buys time to improve things before the cost is incurred.
At the risk of a convoluted metaphor, imagine that someone moves into a new house and calls two plumbers asks them to hook up the water in their bathroom. The first plumber says that they can get it done so the bathroom can be used today, but they'll need to come by again in a week or two since they might need to make additional adjustments. The second plumber says they've come to with a way to make sure that they never need to come back to make adjustments, but it will take them a full week to finish setting it up before anyone can use it.
For most people, it doesn't matter if the second plumber's solution will be better next week if they need to use the bathroom today, as long as the first plumber's solution can last long enough before it needs to be fixed.
Having a kid has drastically altered my ability to learn new things outside of work, simply due to lack of time. I never could have imagined how big of an impact having a kid would be, its crazy!
The worst thing is when you actually manage to carve out some time to do some learning or experimentation with a new tool, library, etc only to find out that it sucks or you just don't have the time to pick up or whatever.
yeah, I could have written this verbatim. Either I was not warned enough, or I did not pay enough attention/heed whatever I was warned of. I don't have a large family, so I've basically had ZERO kid experience since I was a kid... yikes... almost 50 years ago LOL. What worries me though is that it's kind of been an assumption at this job that you DO spend some off-duty time learning/tinkering. And I enjoyed it!
> The worst thing is when you actually manage to carve out some time to do some learning or experimentation with a new tool, library, etc only to find out that it sucks
I got briefly excited about the V language to maybe use for little utility scripts and maybe even as a first teaching language for my kid, then realized that when you scratch the surface of it it's basically kind of ugly underneath. (Example- The "this should never happen" error was literally most of the errors, lol.) It looks like something with a lot of great ideas but slipshod not-deeply-thought-out implementation. And the final nail in the coffin was all the evidence that the language creator simply bans anyone with valid criticism- I'm a free-speech near-absolutist so that one was the killer for me.
One of a few examples of what you're referring to. The thing is, before kids, we could afford to waste that time. Now we cannot. :/
`nix copy .#my-crap --to ssh://remote`
What you do with it then on the remote depends on your environment. At the minimum do a `nix-store --add-root` to make a symlink to whatever you just copied.
(The most painless path is if you're deploying an entire NixOS system, but that requires converting the remote host to NixOS first.)
It took me a couple of days to get a supervisor-based setup working locally. I was the only person on the team who would run the backend and frontend when trying things out, because nobody was actually using the dev environments fully anyways. There was no buy-in for the dev environment!
I really feel like if you are in a position to determine tooling, it's so much more helpful to lean into whatever people on the ground want to use. Obviously there are times when the people on the ground don't care, but if you're spending your sweat and tears to put the square peg into the square hole suddenly you're the person with superpowers, and not the person pushing their pet project.
And sometimes that's just "wrap my thing with your thing".
This might mean picking something that you think/know kind of sucks for the task, but that will be easier for most people to grok - while it might subjectively feel unfortunate, it's probably the right thing to do, for the sake of the majority of the team having an easier time.
Pushing your interests more strongly, or even in a top down fashion, might work, but that's more risky both in regards to letting everyone get things done, as well as team cohesion and turnover.
That's true for any architectural decision in an organization with more than 1 person.
It's really not something that should make you reconsider a decision. At the end of the day, an architecture that "people" actually want to use doesn't exist, "people" doesn't want any singular thing.
I wonder if Microsoft's approach for Dev Box is the right one.
If it doesn't fit on one machine, though, you don't have another option: Meta, for example, will never have a local dev env for Instagram or Blue. Then you need to make some hard choices.
Personally, my ideal cloud dev env is:
1. Local checkout of the code you're working on. You can use whatever IDE or text editor you prefer. For large monorepos, you'll need some special tooling to make sure it's easy to only check out slices of the repo.
2. Sync the code to the remote execution environment automatically, with hot-reloading.
3. Auto-port-forward from your local machine to the remote.
4. Optionally be able to run dependent services on your personal remote to debug/test their interactions with each other, and optionally be able to connect to a well-maintained shared environment for dependencies you aren't working on. If you have a shared environment, it can't be viewed as less-important than production: if it's broken, it's a SEV and the team that broke it needs to drop everything and fix it immediately. (Otherwise the shared env will be broken all the time, and your shipping speed will either drop, or you'll constantly be shipping bugs to prod due to lack of dev care.)
At Meta we didn't have (1): everyone had to use VSCode, with special in-house plugins that synced to the remote environment. It was okay but honestly a little soul-sucking; I think customizing your tooling is part of a lot of people's craft and helps maintain their flow state. Thankfully we had the rest, so it was tolerable if not enjoyable. At Airbnb we didn't have the political will to enforce (4), so the dev env was always broken. I think (4) is actually the most critical part: it doesn't matter how good the rest of it is, if the org doesn't care about it working.
But yeah — if you don't need it, that's a lot of work and politics. Use local environments as long as you possibly can.
Trying to boot the full service on a single machine required every single developer in the company installing ~50ish microservices on their machine, for things to work correctly. Became totally intractable.
I guess one can grumble about bad architecture all day but this had to be solved. we had to move to remote development environments which restored everyone’s sanity.
Both FAANG companies I’ve worked at had remote dev environments that were built in house.
Why not break a microservice into a series of microservices, its microservices all the way down.
Multiple services connecting to the same database has been considered a bad idea for a long time. I don't necessarily agree, but I have no experience in that department. It does mean more of your business logic lives in the database (rules, triggers, etc).
Not true at all.
You're conflating the need for distributed transactions with the definition of microservices. That's not it.
> Multiple services connecting to the same database has been considered a bad idea for a long time.
Not the same thing at all. Microservices do have the database per service pattern, and even the database instance per service instance pattern, but shared database pattern is also something that exists in the real world. That's not what makes a microservice a microservice.
You should read up on microservices because that's definitely not what they are not anything resembling one of their traits.
No, this does indeed match reality. At least for those who work with microservices. This is microservices 101. It's baffling how this is even being argued.
We have industry behemoths building their whole development experience around this fact. Look at Microsoft. They even went to the extents of supporting Connected Services in Visual Studio 2022. Why on earth do you believe one of the most basic traits of backend development is unreal?
> I work at ~50 ish employee company and we have layers of dependencies between at least 6 or 7 various microservices.
Irrelevant. Each service has dependencies and consumers. When you need to run an instance of one of those services locally, you point it to it's dependencies and you unplug it from it's consumers. Done. This is not rocket science.
I'm talking about how Microsoft added support for connected services to Visual Studio. It's literally a tool that anyone in the world can use. They added the feature to address existing customer needs.
How does that solve the problem of a mess of interconnected services where you may have to change 3 or more of them simultaneously in order to implement a change?
Yes. That's the point.
> How does that solve the problem of a mess of interconnected services (...)
I don't think you got the point.
The whole point is that you only need to connect your local deployment to services that are up and running. There is absolutely no need to launch a set of ad-hoc self-contained services to run a service locally and work on it. That is the whole point.
No. My whole argument is open your eyes, and look at what you're doing. Make it make sense.
Does it make sense to launch 50 instances locally to be able do work on a service? No. That's a stupid way of going about a problem.
What would make sense? Launch the services you need to change, of course. Whatever you need to debug, that's what you need to run locally. Everything else you consume it from a cloud environment that's up and running.
That's it. Simple.
If there's something preventing you from doing just that then that's an artiicifal constraint that you created for yourself, and thus that you need to fix. We're talking about things like auth. Once you fix that, go back to square one.
True. Your run-of-the-mill shop should have a simpler and more straight-forward system.
But you seem to want the reverse.
You have to test the changes you want to push. That's the whole basis of CI/CD. The question is at which stage are you ok with seeing your pipeline build.
If you accept that you can block your whole pipeline by merging a bad PR then that's ok.
In the meantime, it is customary to configure pipelines to run unit, integration tests, and sometimes even contract tests when creating a feature branch. Some platforms even provide high-level support for spinning up sandbox environment as part of their pipeline infrastructure.
No, not really. You only find yourself in that spot if you completely failed to do any semblance of integration test, or any acceptance test whatsoever.
That's not a microservices problem. That's a you problem.
You talk about feedback look. Other than automated tests, what do you believe that is?
I just also like to have an option to run service locally and connect to either cloud instances (test) or local instances depending on what I am troubleshooting/testing. Much better than debugging on prod which may still be required at some point but hopefully not often.
Ouch. Where they using macOS at the time with laptops having not-enough-ram?
I've seen that go poorly on macOS with java based microservices. Largely due to java VMs wanting ram pre-assigned for each, which really chews though ram that mostly sits around unused.
This was a few years ago though, at the tail end of Intel based mac's where 32GB ram in a mac laptop wasn't really an option.
"Oh it's not running locally, you need to also run service_18_v2.js, and include the right env variables"
I think nobody really talks about this, but unless you have a docker-compose.yml that includes everything you need for local development, it's increasingly more likely that you'll end up coupling things to Kubernetes to such a degree that running without it (and its abstractions) will become more effort than a person can muster.
So while people try to create services that are decoupled from one another, they end up instead coupling them to Kubernetes concepts and a service mesh, service discovery, configuration and secret management mechanisms, persistent storage abstractions and so on.
Which you can obviously do if you want to, but which might make running things locally in a minimalistic fashion that much more complex.
It's the same as with for example using a web server as a reverse proxy for my applications and ending up putting some logic in there (e.g. route rewrites, headers etc.) and then realizing that I must also run a similar web server locally for 1:1 compatibility because something like Vue dev server proxy to the locally running API won't be able to give me all that.
Is this a problem?
I mean, in this scenario docker compose serves two main purposes: launch a few mock services, and configure those services according to your needs. This means configuring them to consume services already deployed to a cloud environment of your choice. This is something you control.
Then all that's left is the service (or set if services) you need to modify. You run those locally and configure them to consume a mix of the mocked services and the services deployed to a cloud environment.
> (...) it's increasingly more likely that you'll end up coupling things to Kubernetes to such a degree that running without it (and its abstractions) will become more effort than a person can muster.
That coupling can only happen if you intentionally add it yourself.
If you need to run your services in isolation, you will be more mindful of the need to not introduce that sort of coupling.
Running the dev environments remotely (or rewriting in Go) were the options being considered before the whole project was canned and people redistributed to other things.
> Largely due to java VMs wanting ram pre-assigned for each
Do you mean the JVM min heap size was rather later? Otherwise, there is no need for a very large min heap size on a modern JVM (11+).Not really. It all depends on what are your needs.
> It defeats a major feature of the jvm.
You're confusing things. Just because Java addressed the deployability problem for Java applications before containerization was even a word, this does not mean that deploying a JVM per service is a bad idea. Just think about it for a second. Why do you need to deploy and scale services independently? Do you mention the JVM in your response? No.
I did mention the jvm in my response. You quoted me doing so, in fact.
Your dev environment is expected to mimick your production environment, not the other way around.
This is certainly one of the critical mistakes you did.
No developer needs to launch half of the company's services to work on a local deployment. That's crazy, and awfully short-sighted.
The only services a developer ever needs to launch locally are the ones that are being changed. Anything else they can consume straight out of a non-prod development environment. That's what non-prod environments are for. You launch your local service locally, you consume whatever you need to consume straight from a cloud environment, you test the contract with a local test set, and you deploy the service. That's it.
> I guess one can grumble about bad architecture all day but this had to be solved.
Yes, it needs to be solved. You need to launch your service locally while consuming dependencies deployed to any cloud environment. That's not a company problem. That's a problem plaguing that particular service, and one which is trivial to solve.
> Both FAANG companies I’ve worked at had remote dev environments that were built in house.
All FANG companies I personally know had indeed remote dev environments. They also had their own custom tool sets to deploy services locally, either in isolation or consuming dependencies deployed to the cloud.
This is not a FANG cargo cult problem. This is a problem you created for yourself out of short-sightedness and for thinking you're too smart for your own good. Newbies know very well they need to launch one service instance alone because that's what they are changing. Veterans know that too well. Why on earth would anyone believe it's reasonable to launch 50 services to do anything at all? Just launch the one service you're working on. That's it. If you believe something prevents you from doing that, that's the problem you need to fix. Simple. Crazy.
Yes what you're saying is correct, but why many words when few do trick:
This hypothetical IT department isn't able to host its own development environment, yet suddenly they do have the skills if they switched to gitpod.
If your services are mostly stateless and/or your development team is very small that can work. If not, you will quickly run into problems sharing the data. Making schema changes to the shared cloud services. Cleaning up dev/test/etc data that has accumulated, etc. Then you are back to thinking of provisioning isolated cloud environment per dev.
Why do you think that other people using a cloud environment prevents you from using the environment?
At a previous place of work I worked with a monolith structure, and it was actually perfectly fine. Development got done separately on several large substructures in the monolith, and devs could install the whole project locally and run it just fine.
I'm really wondering why we're all using microservice architecture if we're all convinced that to actually develop on them, devs need to reproduce 50odd of those services locally for debugging. Then what was the point?
Resume chasing and trying to paper over the fact that you don't understand architecture, plus bad tooling that e.g. doesn't properly support incremental compilation and so makes monoliths painful.
No one does microservices for resume chasing anymore, because everyone is already doing it for practical reasons. I never came across a monolith that wasn't in the process of peeling responsibilities to either microservices or function-as-a-service. For project managers to open their eyes, all that's needed is something like a deployment going wrong due to a single bad commit, or things scaling weird because a background process caused a brownout, or even external teams screwing up a deployment after pushing bad code.
No, not really. The size of your teams have zero to do with whether your services can corrupt data. That's on you. Don't pin the blame on a service architecture for the design errors you introduced. Everyone else does not have that problem. Why are you having it and blaming the same system architecture used by everyone else?
> Making schema changes to the shared cloud services.
What are you talking about? There is absolute zero system architectures where you can change schemas willy nilly without consequences. Why are you trying to pin the blame on microservices for not knowing the basics of how to work with databases?
In all the times I had to work on schema changes, the development was done with a db deployed in a sandbox environment, and when work was done we had to go with a full database migration with blue-green deployments along with gradual rollout. Why on earth are you expecting you can just drop by and change a schema?
> Cleaning up dev/test/etc data that has accumulated, etc.
Isn't this a non-issue to anyone who works with databases? I mean, in extreme scenarios you can spin up a fake database with your local service, but don't even try to argue this is the justification you need to launch dozens of services.
The truth of the matter is that this is extremely simple: if you want to work on a service, launch that service locally configuring it to consume all dependencies running in a non-prod environment. That's what they are for. If you have extremely specialized needs, stub specific dependencies locally. That's it.
Make is just a framework for you to do your builds. Sure you can cram anything into it, but that is exactly the kind of area that other tools like Ansible or even Terraform shine.
Make isn't a silver bullet.
EDIT: Just to make sure, I'm using fetish as something you spend an unreasonable amount of time with it.
You would list the services you need (or service groups) in a config file, start a command and all services would start in containers. Sure, you need a lot of RAM with that but on 32Gb it was working fine.
So 64 G to include the ram for the ide and web browsing / javascript apps like slack?
Imagine the goal is to fix the problems (e.g. make it possible to run less of the services or something like that): How do you do that without first running all the services, making the proper changes, and then testing those changes? You need to be able to run all the services in that interim period.
So, wouldn't it be nice if there were a solution for this in-general? And, maybe, it would lead to better conditions later on. But in the meantime there is really no way around the existing design/decisions/etc. You simply have to deal with that reality and engineer around it.
Yeah, I get that, I was deliberate about the phrasing of "your company" rather than just "your".
Obviously we don't know anything about the parent commenters company and situation, perhaps 12 people per microservice genuinely is the right solution for them, but it seems like it would be better not to get into this situation in the first place, though once there you obviously have to tackle the problem as it presents itself.
Why do you need to run all services in isolation to be able to troubleshoot and isolate a problem?
This is certainly not universal among FAANGs though.
Requiring 50 services to be up is absolutely nuts, but it’s actually pretty trivial using something like Nomad locally.
I've worked in a remote, secured development environment and it sucked, but to their credit the company did it for exactly this reason - control over the source. But bear in mind that source control is a two-way street.
Losing proprietary source can be harmful (especially in compiled languages where the source might carry much more information than the distributable). But they were mostly worried about the opposite way...that something malicious gets INTO the source which could pose an existential threat. You'd be correct to say "well that should be the domain of source control, peer review etc", but in this case the company assessed the risk high enough to do both.
IMO there are some workloads, where it is beneficial for a developer to have access to a local repository with at least some snippets based on previous projects.
Having a leftover PoC of some concept written for a previous employer but never elevated to team use/production is both handy (at least to confirm that the build environment is still viable after an unspecified period of toolchain updates) and ethical (copying production code is not ethical - even if the old and new products are vastly different e.g. last job was taxi app, new app is banking app).
Making it all 'remote' and 'cloud' will eventually result in a bike reinvention penalty on each new employment - not everything can be rebuilt from memory only, especially things that are done 1-2 times a year; sure there is open-source documentation/examples, but at some point it'll just introduce even heavier penalty for a need to either know a lot of opensource stuff to have some reference points, or to work on a pet projects to get the same amount of references.
And the new company would also be liable for using trade secrets that they shouldn’t.
However I do write 1-2 hour PoCs on my spare time and my own equipment, using only publicly available stuff - they sometimes come handy at some point later. If we assume 'remote first' development is okay - with no possibility to test stuff locally, well, we're back to either bookmark managers or pet projects to keep at least a bit of knowledge between jobs.
From a resource provider productive, the only way to squeeze a margin out of that space would be to reverse engineer 100% of human developer behavior so that you can ~perfectly predict "slack" in the system that could be reallocated to other users. Otherwise it's just a worse DX, like TFA gives examples of. Not a business I'm envious too be in... Just give everyone a dedicated VM or desktop, and make sure there's a batch system for big workloads.
I think this approach works best in small teams where everyone agrees to drink the Nix juice. Otherwise, it's caused nothing but strife in my company.
Also, there is a long tail of issues to be fixed if you do it with Kubernetes.
Kubernetes does not just give you scaling, it gives you many things: run on any architecture, be close to your deployment etc.
Anyway, as always it depends on what you want to use it for.
And that they're desperate to tell customers that they've fixed their problems.
Kubernetes is absolutely the wrong tool for this use case, and I argue that this should be obvious to someone in a CTO-level position, or their immediate advisors.
Kubernetes excels as a microservices platform, running reasonably trustworthy workloads. The key features of Kubernetes are rollout (highly available upgrades), elasticity (horizontal scaleout), bin packing (resource limits), CSI (dynamically mounted block storage), and so on. All this relates to a highly dynamic environment.
This is not at all what Gitpod needs. They need high performance disks, ballooning memory, live migrations, and isolated workloads.
Kubernetes does not provide you sufficient security boundaries for untrusted workloads. You need virtualization for that, and ideally physically separate machines.
Another major mistake they made was trying to build this on public cloud infrastructure. Of course the performance will be ridiculous.
However, one major reason for using Kubernetes is sharing the GPU. That is, to my knowledge, not possible with virtualization. But again, do you want to risk sharing your data, on a shared GPU?
Are you aware of the limits? It must run as root and privileged?
Example: What performance do you get out of your NVMe disks? Because these days you can build storage that delivers 100-200 GB/s.
https://www.graidtech.com/wp-content/uploads/2023/04/Results...
I bet few public cloud customers are seeing that kind of performance.
To clarify on one of your points, Kubernetes itself has nothing to do with actually setting the security boundaries. It only providers a schema to describe resources and policies, and then an underlying system (perhaps Cilium for networking, or Kata Containers for micro VMs) can ensure that the resources created actually follow those schemas and policies.
For example, Neon have built https://github.com/neondatabase/autoscaling which manages Neon Instances with Kubernetes by running them with QEMU instead. This allows them to do live migrations and resource (de)allocation while the service is running, without having to replace Kubernetes. These workloads are, as far as I understand it, stateless.
What Neon is doing is quite a feat: Live migration (of a VM) while preserving TCP connections. It also took a lot of customization to achieve that.
But I agree that Kubernetes can indeed be used this way.
If anything, it further cements my original point about the Gitpod leadership.
The problem was never Kubernetes, but the dimwitted notion of using containers.
And then blaming Kubernetes for it: We're leaving you.
We've always had issues with stateful kubernetes setups. Can you share what makes it easier today than before? Genuinely interested.
I guess team just wants to rewrite everything, it happens. Manager should prevent that.
Does anyone have any links for cluster-autoscaler plugins? Searching drawing a blank, even in the cluster-autoscaler repo itself. Did this concept get ditched/removed?
For anything stateful, monolithic, or that doesn't require autoscaling, I find LXC more appropriate:
- it can be clusterized (LXD/Incus), like K8S but unlike Compose
- it exposes some tooling to the data plane, especially a load balancer, like K8S
- it offers system instances with a complete distribution and a init system, like a VM but unlike a Docker container
- it can orchestrate both VMs (including Windows VMs) and LXC containers at the same time in the same cluster
- LXC containers have the same performance as Docker containers unlike a VM
- it uses a declarative syntax
- it can be used as a foundation layer for anything stateful or stateless, including the Kubernetes cluster
LXD/Incus sits somewhere between Docker Swarm and a vCenter cluster, which makes it one of the most versatile platform. Nomad is also a nice contender, it cannot orchestrate LXC containers but can autoscale a variety of workloads, including Java apps and qemu VMs.
> A simpler version of this setup is to use a single SSD attached to the node. This approach provides lower IOPS and bandwidth, and still binds the data to individual nodes.
Are you sure SSD is that slow? NVMe devices are so fast that I hardly believe there's any need for RAID 0.
You're running hot pods for crypto miners and against people who really want to see the rest of the code that box has ever seen. You should be isolating with something purpose built like firecracker, and do your own dispatch & shred for security.
So if you started with kubernetes and fought the whole process of why it's not a great solution to the problem, I have to assume you didn't understand the problem. I :heart: kubernetes, its complexity pays my bills - but it's barely a good CI solution when you trust everyone involved, it's definitely not a good one where you're trying to be general-purpose to everyone with a makefile.
Kubernetes has never ever struck me as a good idea for a development environment. I'm surprised it took the author this long to figure out.
K8s can be a lifesaver for production, staging, testing, ... depending on your requirements and infrastructure.
Sounds sane. Am i missing anything?
A heterogeneous architecture with multi-tenancy poses some unique challenges because, as mentioned in the article, you get highly inconsistent usage patterns across different services. Also, arbitrary code execution (with sandboxing) can present a signifiant challenge. For security, you ideally need full isolation between services which belong to different users; this isolation wasn't a primary design goal of Kubernetes.
That said, you can probably still use K8s, but in a different way. For smaller customers, you could co-locate on the same cluster, but for larger customers which have high scalability requirements, you could have a separate K8s cluster for each one. Surely for such customers, it's worth the extra effort.
So in conclusion, I don't think the problems which were identified necessarily warrant abandoning K8s entirely, but maybe just a rethinking of how K8s is used. K8s still provides a lot of value in treating a whole cluster of computers as a single machine, especially if all your architecture is already set up for it. In addition to scheduling/orchestration, K8s offers a lot of very nice-to-have features like performance monitoring, dashboards, aggregated logs, ingress, health checks, ...
1. Some operations on remote in local oriented way are time consuming and unmanageable.
2. With vendor specific way, our skill would be deprecated, having dependency to the vendors.
3. Kubernetes is not the best tools but it it popular.
As always, custom solution is the most powerful but should be replaced with more unified way for the stability of the development.
To the people saying ultra modern hardware could handle it: worth remembering the companies on question started on this path X years ago with Y set of technologies and Z set of experiences.
Because it made sense for Google in 2012 or whatever doesn't necessarily mean they would choose it again --or not-- given a do over (but there's basically no way back).
>Kubernetes seems like the obvious choice for building out remote, standardized and automated development environments
- Is it really Obvious Choice™ though Fred?- Hmm, let's consult the graphs.
>Kubernetes is a container orchestration system for automating software deployment.
- It's about automating deployment Carl, not development environments! >Kubernetes is not the right choice for building development environments, as we’ve found.
All the problems in the article also seem self-imposed. k8s can run stateful workloads just fine. Don't start and stop them. Figure out the math on how much it costs to run a container 24/7, add your margin, and pass that cost to the customer. Customer can decide to stop the containers to save $$, so the latency won't hurt, they'll accept it because they know they're saving money.
Glad someone said it out loud. So true. Apptainer has been a far better development experience for us.