https://www.bnnbloomberg.ca/investing/2024/09/16/ai-boom-is-...
This is definitely extending the runway of O&G at a crisis point in the climate disaster when we’re supposed to be reducing and shutting down these power plants.
Update: clarified the 200 number is in the US. There are far more world wide.
AI is a red herring. If it wasn’t that it would be EV power demand. If it wasn’t that it would be reshoring of manufacturing. If it wasn’t that it would be population growth from immigration. If it wasn’t that it would be replacing old coal power plants reaching EOL.
Replacing coal with gas is an improvement by the way. It’s around half the CO2 per kWh, sometimes less if you factor in that gas turbines are often more efficient than aging old coal plants.
And delivering methane leaks like a sieve into the atmosphere from all parts of the process.
Sure it’s probably “better than coal,” but not by much. It’s a bit like comparing what’s worse: getting burned by fire or being drowned in acid.
So in a way, it is providing value to someone, whether we like it or not.
Or Drug Cartels. https://www.context.news/digital-rights/how-crypto-helps-lat...
But this is the promise of uncontrollable decentralization providing value, for good or bad?
meanwhile "AI" is used to produce infinity+1 pictures of shrimp jesus and more spam than we've ever known before
and if we're really lucky, it will put us all out of work
LLMs (and the image, sound, and movie generating models) are more coincidentally power-hogs — people are at least trying to make them better at fixed compute, and lower compute at fixed quality.
Because whether we're using tons of compute to provide value or not doesn't change that we are using tons of compute and tons of compute requires tons of energy, both for the chips themselves, and the extensive infrastructure that has to built around them to let them work. And not just electricity: refrigerants, many of which are environmentally questionable themselves, are a big part; hell, just water. Clean, usable water.
If we truly need these data centers, then fine. Then they should be powered by renewable energy, or if they absolutely cannot be, then the costs their nonrenewable energy sources inflict on the biosphere should be priced into their construction and use, and in turn, priced into the tech that is apparently so critical for them to have.
This is like, a basic calculus that every grown person makes dozens of times a day: do I need this? And they don't get to distribute the cost of that need, however prescient it may be, on their wider community because they can't afford it otherwise. I don't see why Microsoft should be able to either. If this is truly the tech of the future as it is constantly propped up to be, cool. Then charge a price for it that reflects what it costs to use.
Combined with the increased cost effectiveness of renewables & batteries, & the new build-out of nuclear, it could plausibly speed up the clean energy transition, rather than just disincentivising building out more polluting power plants.
There are two main options for what to do with revenue from a carbon tax. The one that makes the most macroeconomic sense is to use those proceeds to fund subsidies for clean energy roll outs & grid adaptation. You are directly taxing the polluting power grid to fund the construction of a non-polluting power grid. As CO2 emitting industry (and thus carbon tax revenue) declines, we have less required spend on clean energy roll out, so the tax would balance nicely. The downside would be that a carbon tax would increase cost of living and this does nothing about that.
The other option is a disbursement. Give everyone in society a payment directly from the proceeds of the carbon tax. This would offset the regressive aspects of a carbon tax (because that tax would increase consumer costs), and would also act as a sort of auto-stimulus to stop the economy from turning down due to consumption costs increasing. The downside of this is that the clean energy transition happens slower than the above, and that there may be political instability & perverse incentives as people maybe come to rely on this payment that has to go away over the next few decades.
They're both good options. I don't know which is better and I think that's likely something individual countries will probably choose based on their situation. But we do need some sort of way to make those emitting CO2 pay for its negative externalities.
I think the rapidly decreasing costs of renewables and storage are likely to make the transition happen before the political will to get a carbon tax, but if you recon you can push the right buttons, I encourage you to try it :)
Methane is favored in many cases because they can be quickly ramped up and down to handle momentary peaks in demand or spotty supply generated from renewables.
Without knowing more details about those projects it is difficult to make the claim that these plants have anything to do with increased demand due to LLMs, though if anything, they’d just add to base load demands and lead to slower decommissioning of old coal plants like we’ve seen with bitcoin mines.
I'm curious what peoples thoughts are of what the future of LLMs would be like if we severely overshoot our carbon goals. How bad would thinks have to get for people to stop caring about this technology?
The growth in this technology isn’t outpacing car pollution and O&G extraction… yet, but the growth rate has been enough in recent years to put it on the radar of industries to watch out for.
I hope the compute efficiencies are rapid and more than commensurate with the rate of growth so that we can make progress on our climate targets.
However it seems unlikely to me.
It’s been a year of progress for the tech… but also a lot of setbacks for the rest of the world. I’m fairly certain we don’t need AGI to tell us how to cope with the climate crisis; we already have the answer for that.
Although if the industry does continue to grow and the efficiency gains aren’t enough… will society/investors be willing to scale back growth in order to meet climate targets (assuming that AI becomes a large enough segment of global emissions to warrant reductions)?
Interesting times for the field.
"""
LLMs need better criticism # A lot of people absolutely hate this stuff. In some of the spaces I hang out (Mastodon, Bluesky, Lobste.rs, even Hacker News on occasion) even suggesting that “LLMs are useful” can be enough to kick off a huge fight.
I like people who are skeptical of this stuff. The hype has been deafening for more than two years now, and there are enormous quantities of snake oil and misinformation out there. A lot of very bad decisions are being made based on that hype. Being critical is a virtue.
If we want people with decision-making authority to make good decisions about how to apply these tools we first need to acknowledge that there ARE good applications, and then help explain how to put those into practice while avoiding the many unintiutive traps.
"""
LLMs are here to stay, and there is a need for more thoughtful critique rather than just "LLMs are all slop, I'll never use it" comments.
The signal-to-noise ratio just goes completely out of control.
https://en.wikipedia.org/wiki/The_Human_Use_of_Human_Beings
https://en.wikipedia.org/wiki/Inventing_the_Future:_Postcapi...
https://en.wikipedia.org/wiki/The_Right_to_Be_Lazy
https://en.wikipedia.org/wiki/In_Praise_of_Idleness_and_Othe... (That's Bertrand Russell)
https://en.wikipedia.org/wiki/The_Abolition_of_Work
https://en.wikipedia.org/wiki/The_Society_of_the_Spectacle
https://en.wikipedia.org/wiki/Bonjour_paresse
AI systems are literally the most amazing technology on earth for this exact reason. I am so glad that it is destroying the minds of time thieves world-wide!
Capital --> capitalist, capitalism.
Commune --> communist, communism.
Ned Ludd --> Luddite, Luddism.
Not "capitalistism" or "communistism", so not "ludditism" either.
Yup, English may be the most inconsistent of languages. When I was a kid, we used to blame French for being "just exceptions to rules, exceptions to exceptions, and exceptions to those exceptions!", but with a few decades of perspective... Nope, English is far worse.
One reason is that it's cheaper to use AI, even if the result is poor. It doesn't have to be high quality, because most of the time we don't care about quality, unless something interests us. I wonder what kind of shift in power dynamics will occur, but so far it looks just like many of us will just lose a job. There's no UBI (or social credit proposed by Douglas), salaries are low and not everyone lives in good location, but corporations try to enforce RTO. Some will simply get fired and won't be able to find a new job (that won't be sustainable for personal budget, unless someone already has low costs of living and is debt-free or has somewhat wealthy family that will cover for you).
Well, maybe at least government will protect us? Low chance, world is shifting right and it will get worse, once we start to experience more and more results of global warming. I don't see scenario, where world is becoming better place in foreseeable future. We're trapped in society of achievement, but soon we may be not able to deliver achievements, because if business can get similar results for fraction of the price (that is needed to hire human workers), then guess what will happen?
These are sad times, full of depression and suffering. I hope that some huge transformation in societies will happen soon or that AI development slows down, so that some future generation will have to deal with consequences (people will prioritize saving their own and it won't be pretty, so it's better to just pass it down like debt).
I suspect people don't particularly hate or despise LLMs per se. They're probably reacting mostly to "tech industry" boom-bust bullsh*tter/guru culture. Especially since the cycles seem to burn increasingly hotter and brighter the less actual, practical value they provide. Which is supremely annoying when the second-order effect is having all the oxygen (e.g. capital) sucked out of the room for pretty much anything else.
These are the people who regulate and legislate for us, they are the risk-adverse fools who would rather things be nice and harmless lest they be bad but work.
Personally, I think my only serious ideology in this area is that I am fundamentally biased towards the power of human agency. I'd rather not need to, but in a (perhaps) Nietzschean sense I view so-called AI as a force multiplier to totally avoid the above people.
AI will enable the creative to be more concrete, and drag those on the other end of the scale towards the normie mean. This is of great relevance to the developing world too - AI may end up a tool for enforcing western culture upon the rest of the world but perhaps a force decorrelating it from the McKinsey's of tall buildings in big cities.
Then, several headings later:
> I have it on good authority that neither Google Gemini nor Amazon Nova (two of the least expensive model providers) are running prompts at a loss.
So...which is it?
They're not running at a loss. I'll fix that.
This means that they could make a profit off inference models without the revenue being large enough to pay the energy costs.
If it's the case I don't know. I'm more concerned with getting rid of those corporations altogether since interacting with them is generally forbidden due to the lack of data protection regulations in the US.
Things that didn’t work 6 months ago do now. Things that don’t work now, who knows…
Or do you actually mean that the same routines and data that didn't work before suddenly work?
Each new model opens up new possibilities for my work. In a year it's gone from sort of useful but I'd rather write a script, to "gets me 90% of the way there with zero shots and 95% with few-shot"
That's an indication that most business-sized models won't need some giant data center. This is going to be a cheap technology most of the time. OpenAI is thus way overvalued.
The non-skeptical interpretation is that it's a threshold function, a flat-out race with an unambiguous finish line. If someone actually hit self-improving AGI first there's an argument that no one would ever catch up.
What matters is how you use the AGI, not how much you have, with wrong or bad or limiting regulations it will not lead anywhere.
That is of course, assuming AGI is possible and exponential, and that marketshare goes to a single entity instead of a set of entities. Lots of big assumptions. Seems like we're heading towards a slow-lackluster singularity though.
That's if AGI is possible and not easily replicated. If AGI can be copied and/or re-developed like other software then the value of owning OpenAI stock is more like owning stock in copper producers or other commodity sector companies. (It might even be a poorer investment. Even AGI can't create copper atoms, so owners of real physical resources could be in a better position in a post-human-labor world.)
Nothing is truly exponential for long, but the logistic curve could be big enough to do almost anything if you get imaginative. Without new physics, there are still some places where we can do some amazing things with the equivalent of several trillion dollars of applied R&D, which AGI gets you.
It astounds me that people dont realize how much of this cutting edge science stuff literally does NOT happen overnight, and not even close to that; typically it takes on the order of decades!
My point being that even if Science ends today, we still have a lot more engineering we can benefit from.
I don't see how OpenAI wouldn't crash and burn here. Given the history of models it would be at most a year before you'd have open AGI, then the horse is out of the barn and the horse begins to self-improve. Pretty soon the horse is a unicorn, then it's a Satyr, and so on.
(I am a near-term AGI skeptic BTW, but I could be wrong.)
OpenAI's valuation is a mixture of hype speculation and the "golden boy" cult around Sam Altman. In the latter sense it's similar to the golden boy cults around Elon Musk and (politically) Donald Trump. To some extent these cults work because they are self-fulfilling feedback loops: these people raise tons of capital (economic or political) because everyone knows they're going to raise tons of capital so they raise tons of capital.
IMO we’re going to hit the point where AI can work on designing automation to replace physical labor before we hit true AGI, much like we’re seeing with coding.
I heard people on HN saying this (even without the money condition) and I fail to grasp the reasoning behind it. Suppose in a few years Altman announces a model, say o11, that is supposedly AGI, and in several benchmarks it hits over 90%. I don't believe it's possible with LLMs because of their inherent limitations but let's assume it can solve general tasks in a way similar to an average human.
Now, how come that "the entire human economy stops making sense"? In order to eat, we need farmers, we need construction workers, shops etc. As for white collar workers, you will need a whole range of people to maintain and further develop this AGI. So IMHO the opposite is true: the human economy will work exactly as before but the job market will continue to evolve withe people using AGI in a similar way that they use LLMs now but probably with greater confidence. (Or not.)
What does this mean in terms of making me coffee or building houses?
Rinse and repeat.
That is exponential take off.
At the point where you have an army of AIs running at 1000x human speed it can just ask it to design the mechanisms for and write the code to make robots that automate any possible physical task.
Not if you remember to count all the computations being done by the quintillions of nanobots across the world known as "human cells."
That's not only inside cells, and not just neurons either. For example, your thyroid is busy brute-forcing the impossibly large space of antibody combinations, and putting every candidate cell-release through a very rigorous set of acceptance tests.
We also have people brilliant enough to maybe solve the AGI problem or cause our extinction. Some are amoral. Many mechanisms pushed human intelligences in other directions. They probably will for our AGI’s assuming we even give them all the power unchecked. Why are they so worried the intelligent agents will not likewise be misdirected or restrained?
What smart, resourceful humans have done (and not done) is a good, starting point for what AGI would do. At best, they’ll probably help optimize some chips and LLM runtimes. Patent minefields with sub-28nm design, especially mask-making, will keep unit volumes of true AGI’s much lower at higher prices than systems driven by low-paid workers with some automation.
The guy running Anthropic thinks the future is in biotech, developing the cure to all diseases, eternal youth etc.
Which is technology all right, but it's unclear to me how these chatbots (or other AI systems) are the quickest way to get there.
The big problem with LLMs is that most of the time they act smart, and some of the time they do really, really dumb things and don't notice. It's not the ceiling that's the problem. It's the floor. Which is why, as the article points out, "agents" aren't very useful yet. You can't trust them to not screw up big-time.
It's the simple fact that the ability of assets to generate wealth has far outstripped the abiliy of individuals to earn money by working.
Somehow real estate has become so expensive everywhere that owning a shitty apartment is impossible for the vast majority.
When the world's population was exploding during the 20th century, housing prices were not a problem, yet somehow nowadays, it's impossible to build affordable housing to bring the prices down, though the population is stagnant or growing slowly.
A company can be worth $1B if someone invests $10m in it for 1% stake - where did the remaining $990m come from? Likewise, the stock market is full of trillion-dollar companies whose valuations beggar all explanation, considering the sizes of the markets they are serving.
The rich elites are using the wealth to control access to basic human needs (namely housing and healthcare) to squeeze the working population for every drop of money. Every wealth metric shows the 1% and the 1% of the 1% control successively larger portions of the economic pie. At this point money is ceasing to be a proxy for value and is becoming a tool for population control.
And the weird thing is it didn't use to be nearly this bad even a decade ago, and we can only guess how bad it will get in a decade, AGI or not.
Anyway, I don't want to turn this into a fully-written manifesto, but I have trouble expressing these ideas in a concise manner.
The last 5 years have reflected a substantial decline in QOL in the states; you don't even have to to look back that far.
The coronacircus money-printing really accelerated the decline.
Approximately 2/3s of homes in the US are owner occupied.
Approximately 2/3rds of Australians live in an owner-occupied home.
That's to be expected when governments forbid people from building housing. The only thing I find surprising is when people blame this on "capitalism".
In Canada, the population is still growing at a fairly impressive rate (https://www.macrotrends.net/global-metrics/countries/CAN/can...), and that growth tends to concentrate in major population centres. There are advocacy groups that seek to push Canadian population growth well above UN projections (e.g. the https://en.wikipedia.org/wiki/Century_Initiative "aims to increase Canada's population to 100 million by 2100") through immigration. In Japan, where the population is declining, housing prices are not anything like the problem we observe in North America.
There's also the supply side. "Impossible to build affordable housing" is in many cases a consequence of zoning restrictions. (Economists also hold very strongly that rent control doesn't work - see e.g. https://www.brookings.edu/articles/what-does-economic-eviden... and https://www.nmhc.org/research-insight/research-notes/2023/re... ; real "affordable housing" is just the effect of more housing.)
So take the entire economy and ask the question: what does AI not impact? Net that out and assume there’s pricing efficiencies, then build in a risk buffer.
1.5t to 15t seems right.
People are buying shares at $x because they believe they will be able to sell them for more later. I don’t think there’s a whole to more to it than that.
OpenAI predicts more revenue from ChatGPT than api access through 2029.
It’s the old Netflix / HBO trope of which can become the other first: hbo figure out streaming or Netflix figure out original programming.
I bet Google will figure this out and thus OpenAI won’t disrupt as much as people think it will.
Tangential: So how is that race going, has either taken a commanding lead? (Or, hey, is it over already; has either of them won and the other lost? (Yeah, guess if I'm very well-infomed on that industry or not...))
They run on a laptop, yes - you might squeeze up to 10 token/sec out of a kinda sorta GPT-4 if you paid $5K plus for an Apple laptop in the last 18 months.
And that's after you spent 2 minutes watching 1000 token* prompt prefill at 10 tokens/sec.
Usually it'd be obvious this'd trickle down, things always do, right?
But...Apple infamously has been stuck on 8GB of RAM in even $1500 base models for years. I have 0 idea why, but my intuition is RAM was ~doubling capacity at same cost every 3 years till early 2010s, then it mostly stalled out post 2015.
And regardless of any of the above, this absolutely melts your battery. Like, your 16 hr battery life becomes 40 minutes, no exaggeration.
I don't know why prefill (loading in your prompt) is so slow for local LLMs, but it is. I assume if you have a bunch of servers there's some caching you can do that works across all prompts.
I expect the local LLM community to be roughly the same size it is today 5 years from now.
* ~3 pages / ~750 words; what I expect is a conservative average for prompt size when coding
Llama 3.2 1.0B - 650 t/s
Phi 3.5 3.8B - 60 t/s.
Llama 3.1 8.0B - 37 t/s.
Mixtral 14.0B - 24 t/s.
Full GPU acceleration, using llama.cpp, just like LM Studio.second-state/llama-2-7b-chat-gguf net me around ~35 tok/sec
lmstudio-community/granite-3.1.-8b-instruct-GGUF - ~50 tok/sec
MBP M3 Max, 64g. - $3k
#1. It is possible to get an arbitrarily fast tokens/second number, given you can pick model size.
#2. Llama 1B is roughly GPT-4.
#3. Given Llama 1B runs at 100 tokens/sec, and given performance at a given model size has continued to improve over the past 2 years, we can assume there will eventually be a GPT-4 quality model at 1B.
On my end:
#1. Agreed.
#2. Vehemently disagree.
#3. TL;DR: I don't expect that, at least, the trend line isn't steep enough for me to expect that in the next decade.
However, it has been clear for a long time that meta are just demolishing any competitor's moats, driving the whole megacorp AI competition to razor thin margins.
It's a very welcome strategy from a consumer pov, but -- it has to be said -- genius from a business pov. By deciding that no one will win, it can prevent anyone leapfrogging them at a relatively cheap price.
Most web servers can run some number of QPS on a developer laptop, but AWS is a big business, because there are a heck of a lot of QPS across all the servers.
This means that the definitions of "laptop" and "server" are dependent on use. We should instead talk about RAM, GPU and CPU speed which is more useful and informative but less engaging than "my laptop".
Nowt, owt, -- nothing, anything
> LLM generated content need to be verified.
There maybe should be a bright red flashing disclaimer at this point.
Having Slop generations from an LLM is a choice. There are so many tricks to make models genuinely creative just at the sampler level alone.
You're not seeing how the future of the world will develop.
Some people might like slop.
Slop is over-representation of model's stereotypes and lack of prediction variety in cases that need it. Modern models are insufficiently random when it's required. It's not just specific words or idioms, it's concepts on very different abstraction levels, from words to sentence patterns to entire literary devices. You can't fix issues that appear on the latent level by working with tokens. The antislop link you give seems particularly misguided, trying to solve an NLP task programmatically.
Research like [1] suggests algorithms like PPO as one of the possible culprits in the lack of variety, as they can filter out entire token trajectories. Another possible reason is training on outputs from the previous models and insufficient filtering of web scraping results.
And of course, prediction variety != creativity, although it's certainly a factor. Creativity is an ill-defined term like many in these discussions.
DRY does in fact solve repetition issues. You're not using the right settings with it. Set the penalty sky high like 5+. Yes that means you're going to have to modify the ui_paramas in oobabooga cus they have stupid defaults on what limits you can set the knobs to.
There's several other excellent samplers which deserve high ranking papers and will get them in due time. Constrained beam search, tfs (oldie but goodie), mirostat, typicality, top_a, top-n0, and more coming soon. Don't count out sampler work. It's the next frontier and the least well appreciated.
Also, contrastive search is pretty great. Activation/attention engineering is pretty great, and models can in fact be made to choose their own sampling/decoding settings, even on the fly. We haven't even touched on the value of constrained/structured decoding. You'll probably link a similarly bad paper to the previous one claiming that this too harms creativity. Good thing that folks who actually know what they're doing, i.e. the developers of outlines, pre-bunked that paper already for me: https://blog.dottxt.co/say-what-you-mean.html
I'm so incredibly bullish on AI creativity and I will die on the hill that soon AI systems will be undeniably more creative, and better at extrapolation, than most humans.
But that doesn’t necessarily reflect the potential of the underlying technology, which is developing rapidly. Websites were goofy and pointless until Amazon came around (or Yahoo or whatever you prefer).
I guess potential isn’t very exciting or interesting on its own.
Here is my resume. Make it look nice (some design hints).
They can spit html and css, but not Google doc.
On the other hand, Google results are dominated by SEO spam. You can probably find one usable result on page 10.
The problem is not technology. It's a business model that can support the humans feeding data into the LLM.
Google doc + PDF is likely the most commonly used combination based on what I see in the SEO spam.
Some of them make you watch ads and then allow you to download something that looks like a doc, but you'll find out soon that you downloaded a ppt with an image that you can't edit.
Wow. At this stage, I think people are just searching for excuses to complain about anything that the LLM does NOT do.
If a multi-modal LLM can read a 100 page PDF and answer questions about it or replace a median white collar worker, this should be a relatively trivial task. Suggest some nice fonts, backgrounds and give me something that I can lightly edit and generate a PDF from.
Recognize what they do well (generate simple code in popular languages) while acknowledging where they are weak (non-trivial algorithms, any novel code situation the LLM hasn't seen before, less popular languages).
As with all things LLM there's a whole lot of undocumented and under appreciated depth to getting decent results.
Code hallucinations are also the least damaging type of hallucinations, because you get fact checking for free: if you run the code and get an error you know there's a problem.
A lot of the time I find pasting that error message back into the LLM gets me a revision that fixes the problem.
This is great when the error is a thrown exception, but less great when the error is a subtle logic bug that only strikes in some subset of cases. For trivial code that only you will ever run this is probably not a big deal—you'll just fix it later when you see it—but for code that must run unattended in business-critical cases it's a totally different story.
I've personally seen a dramatic increase in sloppy logic that looks right coming from previously-reliable programmers as they've adopted LLMs. This isn't an imaginary threat, it's something I now have to actively think about in code reviews.
Where I'm at right now with LLMs is that I find them to be very helpful for greenfield personal projects. Eliminating the blank canvas problem is huge for my productivity on side projects, and they excel at getting projects scaffolded and off the ground.
But as one of the lead engineers working on a million+ line, 10+ year-old codebase, I've yet to see any substantial benefit come from myself or anyone else using LLMs to generate code. For every story where someone found time saved, we have a near miss where flawed code almost made it in or (more commonly) someone eventually deciding it was a waste of time to try because the model just wasn't getting it.
Getting better at manual QA would help, but given the number of times where we just give up in the end I'm not sure that would be worth the trade-off over just discouraging the use of LLMs altogether.
Have you found these things to actually work on large, old codebases given the right context? Or has your success likewise been mostly on small things?
"Here's some example JavaScript code that sends an email through the SendGrid REST API. Write me a python function for sending an email that accepts an email address, subject, path to a Jinja template and a dictionary of template context. It should return true or false for if the email was sent without errors, and log any error messages to stderr"
That prompt is equally effective for a project that's 500 lines or 5,000,000 lines of code.
I also use them for code spelunking - you can pipe quite a lot of code into Gemini and ask questions like "which modules handle incoming API request validation?" - that's why I built https://github.com/simonw/files-to-prompt
It's very bad at Factor but pretty good at naming things, sometimes requiring some extra prompting. [generate 25 possible names for this variable...]
1. Stick with popular languages, libraries, etc with lots of blog articles and example code. The pre-training data is more likely to have patterns similar to what you’re building. OpenAI’s were best with Python. C++ was clearly taxing on it.
2. Separate design from coding. Have an AI output a step by step, high-level design for what you’re doing. Look at a few. This used to teach me about interesting libraries if nothing else.
3. Once a design is had, feed it into the model you want to code. I would hand-make the data structures with stub functions. I’d tell it to generate a single function. I made sure it knew what to take in and return. Repeat for each function.
4. For each block of code, ask it to tell you any mistakes in it and generate a correction. It used to hallucinate on this enough that I only did one or two rounds, make sure I hand-changed the code, and sometimes asked for specific classes of error.
5. Incremental changes. You give it the high-level description, a block of code, and ask it to make one change. Generate new code. Rinse repeat. Keep old versions since it will take you down dead ends at times but incremental is best.
I used the above to generate a number of utilities. I also made a replacement for the ChatGPT application that used the Davinci API. I also made a web proxy with bloat stripping and compression for browsing from low-bandwidth, mobile devices. Best use of incremental modification was semi-automatically making Python web apps async.
Another quick use for CompSci folks. I’d pull algorithm pseudocode out of papers which claimed to improve on existing methods. I’d ask GPT4 to generate a Python version of it. Then, I’d use the incremental change method to adapt it for a use case. One example, which I didn’t run, was porting a pauseless, concurrent GC.
(Seems every job is fair game according to CTOs. Well, except theirs)
I work out what the edge cases are by writing and rewriting the code. It's in the process of shaping it that I see where things might go wrong. If an LLM can't do that on its own it isn't of much value for anything complicated.
well, sometimes - other times it'll be wrong with no error, or insecure, or inaccessible, and so on
That is at least somewhat a valid point. Good workers know how to get the best out of their tools. And yet, good tools accommodate how their users work, instead of expecting the user to accommodate how the tool works.
One could also say that programmers were sold a misleading bill of goods about how LLMs would work. From what they were told, they shouldn't have to learn how to get the best out of LLMs - LLMs were AI, on the way to AGI, and would just give you everything you needed from a simple prompt.
LLMs are power-user tools. They're nowhere near as easy to use as they look (or as their marketing would have you believe).
Learning to get great results out of them takes a significant amount of work.
The only people pushing that you can BUILD AN APP WITHOUT WRITING A LINE OF CODE are the Twitter AI hypesters. Simon doesn't assert anything of the sort.
LLMs are more-than-sufficient for code snippets and small self-contained apps, but they are indeed far from replacing software engineers.
What models have you tried, and what are you trying to do with them? Give us an example prompt too so we can see how you’re coaxing it so we can rule out skill issue.
And a big strength LLMs have is summarizing things - I’d like to see you summarize the latest 10 arxiv papers relating to prompt engineering and produce a report geared towards non-techies. And do this every 30 mins please. Also produce social media threads with that info. Is this a task you could do yourself, better than LLMs?
P.S my script uses local models - no capacity constraints (apart from VRAM!)
Right, but this is the part that is silly and sort of disingenuous and I think built upon a weird understanding of value and productivity.
Doing more constantly isn't inherently valuable. If one human writes a magnificently crafted summary of those papers once and it is promulgated across channels effectively, this is both better and more economical than having an LLM compute one (slightly incorrect) summary for each individual on demand. In fact, all the LLM does in this case is increase the amount of possible lower quality noise in the space. The one edge an LLM might have at this stage is to generate a summary that accounts for more recent information, thereby getting around the inevitable gradual "out of dateness" of human authored summaries at time T, but even then, this is not great if the trade off is to pollute the space with a. bunch of ever so slightly different variants of the same text. It's such a weird, warped idea of what productivity is, it's basically the lazy middle-manager's idea of what it means to be productive. We need to remember that not all processes are reducible to their outputs—sometimes the process is the point, not the immediate output (e.g. education).
Being able to summarise multiple articles quicker than a human can read and digest a single one is obviously more productive. I’m not sure why you’re assuming I’m talking about rewriting the papers to produce slightly different variations? It’s a summary. Concerned about the lack of “insight” or something? Then add a workflow that takes the summaries and use your imagination - maybe ask it to find potential applications in completely different fields? You already have comprehensive summaries (or the full papers in a vector db). Am I missing something?
Also the quality of the summary will be linked to the prompts and the way you go about the process (one-shotting the full paper in the prompt, map reduce, semantically chunked summaries, what model you’re using, its context length etc) as well as your RAG setup. I’m still working on my implementation but it’s simple as fuck and pretty decent in giving me, well, summaries of papers.
I can’t articulate it well enough but your human curation argument sounds to me like someone dismissing Google because anyone can lie online, and the good old Yellow Pages book can never be wrong.
By multiple rewrites, I meant that, to me, at least, it is silly to spend N compute on producing effectively the same summary on demand for the Mth chatbot user when, in some cases, we could much more economically generate one summary once and make it available via distribution channels--to be fair, that is sort of orthogonal to whether or not the "golden" summary is produced by humans or LLMs. I guess this is more of a critique of the current UX and computational expenditure model.
Yes, my whole point about the process being the point sometimes is precisely about lack of insight. It goes back to Searle's Chinese Room argument. A person in a room with a perfect dictionary and grammar reference can productively translate english texts (input) into Chinese texts (output) just by consulting the dictionary, but we wouldn't claim that this person knows Chinese. Using LLMs for "understanding" is the same. If all you care about is immediate material gain and output, sure, why not, but some of us realize that human beings still move and exist in the world and some of us still appreciate that we need to help fashion those human beings into rational ones that are able to use reason to get along, and aren't codependent on the past N years of the internet to answer any and all questions (the same criticism applies to over reliance on simplistic "answers" from search engines).
>it is silly to spend N compute on producing effectively the same summary on demand for the Mth chatbot user
Why? The compute is there, unused. Why is it silly to use it the way a user wants to? Is your argument more towards our effective use of electrical power across the globe or the quality of the summaries? What if the summaries are produced once and then loaded from some sort of cache - does that make it better in your eyes? I'm trying to understand exactly your point here... please accept my apologies for not being able to understand and please do not take my questions as "gotchas" or anything like that. I genuinely want to know the issue.
>A person in a room with a perfect dictionary and grammar reference can productively translate english texts (input) into Chinese texts (output) just by consulting the dictionary, but we wouldn't claim that this person knows Chinese.
Agreed, because you can't really know a language just from its words - you need grammar rules, historical/cultural context etc - precisely the kinds of things included in an LLM's training dataset. I'd argue the LLM knows the language better than the human in your example.
Again, i'm not sure how all of this is relevant to using LLMs to summarise long papers? I wouldn't have read them in the first place, because i didn't know they existed, and i don't have time to read them fully. So a summary of the latest papers every day is infinitely more better to me than just not knowing in the first place. Now if you want to talk about how LLMs can confidentally hallucinate facts or disregard things due to inherent bias in the training datasets then i'm interested because those are the things that are stopping me from actually trusting the outputs fully. (Note, i also don't trust human output on the internet either, due to inherent bias within all of us)
>human beings still move and exist in the world and some of us still appreciate that we need to help fashion those human beings into rational ones that are able to use reason to get along, and aren't codependent on the past N years of the internet to answer any and all questions
Do a simple experiment with the people around you. Ask them about something that happened a few years ago and see if they pull up Google or Wikipedia or whatever. I don't think you realise how far and few the humans you're talking about are left nowadays. Everyone, from teens to pensioners, have been affected by brain rot to some degree, whether it's plain disinformation on Facebook, or sweet nothings from their pastor/imam/rabbi, or innacurate Google search summaries (which is a valid point against LLMs - i'm also disappointed with how bad their implementation is).
And let's not assume most humans are even capable of being rational when the data in their own brains has been biased and manipulated by institutions and politicians in "democracies".
At least there is one silver lining: your comments are evidence that not everyone has suffered that brain rot, and some of us are still out there using tools critically—thanks for a good conversation on this!
Btw, I apologise again if I came across as blunt or rude in our exchange, upon reflection, I think you were actually right about me being somewhat emotionally invested in this (albeit due to that sliver of hope that they can be used for good). Peace be with you
I don't mean to nitpick, but how good do you really think the output of this would be? Papers are short and usually have many references, I would expect the LLM to basically miss the important subtleties on every paper it's given, and misunderstand and misattribute any terms of art it encounters.
I mean, of course LLMs are good at summarizing: the summaries are probably mostly sort of good, and anything I'm summarizing I won't read myself. But for technical and specific texts, what's the point when you're getting a "maybe correct" retelling? Best case scenario you get a pretty paragraph that's maybe good for an introduction, and worst case you get incorrect information that misinforms you.
I’m using the summaries as a juicier abstract. I’m not taking them as gospel.
I’m working on following references to then add those papers to a vector db for RAG so it can actually go the step beyond. It’s fun!
I'm not sure of the value of this. Papers already have abstracts, rewording them using LLMs is just playing with your food. If you're seeing use out of it that's awesome though.
Isn't that a bit "You're holding it wrong"? I mean, why isn't that the default; did anyone really think one would mainly want bad results out of them?
But there is more: a key thing with LLMs is that their ability to help, as a tool, changes vastly based on your communication ability. The prompt is the king to make those models 10x better than they are with the lazy one-liner question. Drop your files in the context window; ask very precise questions explaining the background. They work great to explore what is at the borders of your knowledge. They are also great at doing boring tasks for which you can provide perfect guidance (but that still would take you hours). The best LLMs (in my case just Claude Sonnet 3.5, I must admit) out there are able to accelerate you.
These people may not be Software Engineers, but they are coding.
I had made the specific operation generic (moving it out of the struct and into a trait) but forgot to delete it from the struct, so I was calling the incorrect function. Claude pinpointed the cache issue immediately when I just dumped two files into the context and asked it:
somewhere in my codebase I'm triggering a perform() on the editor but the next call on highlight() panics because `Line layout should be cached`
what am I missing? do I need to do something after perform() to re-cache the layout?
at first that seemed to fix the issue, but other errors persisted. so we kept debugging together until we found the root cause. either way I knew where to look thanks to its assistanceA very big surprise is just how much better Sonnet 3.5 is than Haiku. Even the confusingly-more-expensive-Haiku-variant Haiku 3.5 that's more recent than Sonnet 3.5 is still much worse.
https://www2.math.upenn.edu/~ghrist/preprints/LAEF.pdf - this math textbook was written in just 55 days!
Paraphrasing the acknowledgements -
...Begun November 4, 2024, published December 28, 2024.
...assisted by Claude 3.5 sonnet, trained on my previous books...
...puzzles co-created by the author and Claude
...GPT-4o and -o1 were useful in latex configurations...doing proof-reading.
...Gemini Experimental 1206 was an especially good proof-reader
...Exercises were generated with the help of Claude and may have errors.
...project was impossible without the creative labors of Claude
The obvious comparison is to the classic Strang https://math.mit.edu/~gs/everyone/ which took several *years* to conceptualize, write, peer review, revise and publish.
Ok maybe Strang isn't your cup of tea, :%s/Strang/Halmos/g , :%s/Strang/Lipschutz/g, :%s/Strang/Hefferon/g, :%s/Strang/Larson/g ...
Working through the exercises in this new LLMbook, I'm thinking...maybe this isn't going to stand the test of time. Maybe acceleration is not so hot after all.
Great on the surface but lacks any depth, cohesive, or substance
Maybe I'm not the target audience, but... that really doesn't make me interested in continuing to read.
I'm agreeing with you.
x+y=1, x+y=2 clearly has no solution since two numbers can’t simultaneously add to both one and two.
x+y=1,2x+2y=2 clearly has infinitely many solutions. There’s only one equation here after canceling the 2, so you can plug in x’s and y’s all day long, no end to it.
x+y=1, 2x+y=1 clearly has exactly one solution (0,1) after elimination.
This example stuck with me so I use it even now. The author/Claude/Gemini/whatever could have just used this simple example instead of “trichotomy of curves through space conjoin through the realm of …” math, not Shakespeare.
To explain this I would first and foremost use a picture, where the 3 cases : parallel, identical, intersection can be intuitively seen (using our visual system, rather than our language system), with merely a glance.
The overuse of the $15 synonyms is almost always a bad idea--you want to use them sparingly, where dropping them in for their subtly different meanings enhances the text. But what is extremely sloppy here is that the possibilities of "no solutions, one solution, infinite solutions" is now being described with a different metaphor for solution here. And by the end of the paragraph, I'm not actually sure what point I'm supposed to take away from this text. (As bad as this paragraph is, the next paragraph is actually far worse.)
Mathematics already has a problem for the general audience with a heavy focus on abstraction that can be difficult to intuit on more concrete objects. Adding florid metaphors to spice up your writing makes that problem worse.
Then I'd have Claude create text. I'd then edit/refine each chapter's text.
Wow, was it unpleasant. It was kinda cool to see all the words put together, but editing the output was a slog.
It's bad enough editing your own writing, but for some reason this was even worse.
There are certain classes of problems that LLMs are good at. Accurately regurgitating all accumulated world knowledge ever is not one, so don’t ask a language model to diagnose your medical condition or choose a political candidate.
But do ask them to perform suitable tasks for a language model! Every day by automation I feed in the hourly weather forecast my home ollama server and it builds me a nice readable concise weather report. It’s super cool!
There are lots of cases like this where you can give an LLM reliable data and ask it to do a language related task and it will do an excellent job of it.
If nothing else it’s an extremely useful computer-human interface.
not to dissuade you from a thing you find useful but are you aware that the national weather service produces an Area Forecast Discussion product in each local NWS office daily or more often that accomplishes this with human meteorologists and clickable jargon glossary?
https://forecast.weather.gov/product.php?site=SEW&issuedby=S...
Anytime you have data and want it explained in a casual way — and it’s not mission critical to be extremely precise — LLMs are going to be a good option to consider.
More useful AGI-like behaviours may be enabled by combining LLMs with other technologies down the line, but we shouldn’t try to pretend that LLMs can do everything nor are they useless.
Honestly they are very decent at it if you give them accurate information in which to make the diagnosis. The typical problem people have is being unable to feed accurate information to the model. They'll cut out parts they don't want to think about or not put full test results in for consideration.
This is not a diagnosis. Any reasonably capable person can read webmd and apply the symptoms listed and compare them to what the patient describes. This is widely regarded as dangerous because the input data as well as the patient data are limited in ways that can be medically relevant.
So even if you can use it as a good substitute for browsing webmd, it’s still not a substitute for seeing a medical professional. And for the foreseeable future it will not be.
No, I think if we follow the money, we will find the problem.
I actually found 4o+search to be really good at this... Admittedly what I did was more "research these candidates, tell me anything newsworthy, pros/cons, etc" (much longer prompt) and well, it was way faster/patient at finding sources than I ever would've been, telling me things I never would've figured out with <5 minutes of googling each set of candidates (which is what I've done before).
Honestly my big rule for what LLMs are good at is stuff like "hard/tedious/annoying to do, easy to verify" and maybe a little more than that. (I think after using a model for a while you can get a "feel" for when it's likely BSing.)
(o1-preview) LLMs show promise in clinical reasoning but fall short in probabilistic tasks, underscoring why AI shouldn't replace doctors for diagnosis just yet.
"Superhuman performance of a large language model on the reasoning tasks of a physician" https://arxiv.org/abs/2412.10849 [14 Dec 2024]
You feed it a weather report and it responds with a weather report? How is that useful?
I did something similar awhile back without LLMs. I enjoy kayaking, but for a variety of reasons [0] it's usually unwieldy to break out of the surf and actually get out into the ocean at my local beach. I eventually started feeding the data into an old-school ML model where I'd manually check the ocean and report on a few factors (breaking waves, unsafe wind magnitude/direction, ...). The model converted those weather/tide reports into signals I cared about, and then my forecast could simply AND all those together and plot them on a calendar.
An LLM is less custom in some sense, but if you have certain routines you care about (e.g., commuting to my last job I'd always avoid the 101 in favor of 280 if there was heavy rain), it's easy to let the computer translate raw weather information into signals you care about (should you take an alternate route, should you alter your schedule, ...).
Off-topic, do you know of a good source of weather covariates? E.g., a report with a 50% chance of rain for 2hr can easily mean light rain guaranteed for 2hr, a guaranteed 1hr of rain sometime in that 2hr period, a 50% chance that a 2hr storm will hit your town or the next town over, or all kinds of things. Does anybody report those raw model outputs?
[0] There isn't any protection from the open ocean (combined with a kayak that's a bit too top-heavy for the task at hand), which doesn't help, but the big problem is a sand bar just off the coast. If the tide isn't just right, even small swells are amplified into large breaking waves, and I don't particularly mind getting dumped upside down onto a sand bar, but I'd really prefer to spend that time in slightly calmer waters.
If your model / chat app has the ability to always inject some kind of pre-prompt make sure to add something like “please do not jump to writing code. If this was a coding interview and you jumped to writing code without asking questions and clarifying requirements you’d fail”.
At the top of all your source files include a comment with the file name and path. If you have a project on one of these services add an artifact that is the directory tree (“tree —-gitignore” is my goto). This helps “unaided” chats get a sense of what documents they are looking at.
And also, it’s a professional bullshitter so don’t trust it with large scale code changes that rely on some language / library feature you don’t have personal experience with. It can send you down a path where the entire assumption that something was possible turns out to be false.
Does it seek like a lot of work? Yes. Am I actually more productive with the tool than without? Probably. But it sure as shit isn’t “free” in terms of time spent providing context. I think the more I use these models, the more I get a sense of what it is good at and what is going to be a waste of time.
Long story short, prompting is everything. These things aren’t mind readers (and worse they forget everything in each new session)
I don't use it exclusively, but damn does it help in the right places.
My mind generally uses language as little as possible, I have no inner monologue running in the background.
Greatly prefer something deterministic to random bs popping up without the ability of recognizing it.
I don’t like llms but sometimes use them as autocomplete or to generate words, like a template for a letter or boilerplate scripts, never for actual information (à la google).
If someone told me an iPhone 4 is terrible but an iPhone 5 would definitely serve my needs, then when I get an iPhone 5 they say the same of the 6 you really want me to believe them a second time? Then a third time? Then a 4th? In the mean time my time and money is wasted?
The point being made by the original comment (with which I agree) was that many criteria-for-usefulness - primarily that of reliability or a lack of hallucination - have remained static; with successive generations of tools being (falsely) claimed to meet them, but then abandoned when the next hype-train comes along.
I certainly agree that _some_ aspects of AI models are indeed improving (often drastically!) over time (speed, price, supported formats, history/context, etc.) - but they still _all_ fall _drastically_ short on the key core requirement that is required in order to make them Actually Useful. "X is better than Y" does not imply "where Y failed to be useful, X now succeeds".
The GP is claiming GPT4o is bad but Sonnet is good. GPT4o is about only 20% cheaper than Sonnet.
If you aren't a coder, it's hard to find much utility in "Google, but it burns a tree whenever you make an API call, and everything it tells you might be wrong". I for one have never used it for anything else. It just hasn't ever come up.
It's great at cheating on homework, kids love GPTs. It's great at cheating in general, in interviews for instance. Or at ruining Christmas, after this year's LLM debacle it's unclear if we'll have another edition of Advent of Code. None of this is the technology's fault, of course, you could say the same about the Internet, phones or what have you, but it's hardly a point in favor either.
And if you are a coder, models like Claude actually do help you, but you have to monitor their output and thoroughly test whatever comes out of them, a far cry from the promises of complete automation and insane productivity gains.
If you are only a consumer of this technology, like the vast majority of us here, there isn't that much of an upside in being an early adopter. I'll sit and wait, slowly integrating new technology in my workflow if and when it makes sense to do so.
Happy new year, I guess.
Other than, y'know, using the new tools. As a programmer heavy forum, we focus a lot on LLMs' (lack of) correctness. There's more than a little bit of annoyance when things are wrong, like being asked to grab the red blanket and then getting into an argument over it being orange instead of what was important, someone needed the blanket because they were cold.
Most of the non-tech people who use ChatGPT that I've talked to absolutely love it because they don't feel it judges them for asking stupid questions and they have conversations about absolutely everything in their lives with it down to which outfit to wear to the party. There are wrong answers to that question as well, but they're far more subjective and just having another opinion in the room is invaluable. It's just a computer and won't get hurt if you totally ignore it's recommendations, and even better, it won't gloat (unless you ask it to) if you tell it later that it was right and you were wrong.
Some people have found upsides for themselves in their lives, even at this nascent stage. No one's forcing you to use one, but your job isn't going to be taken by AI, it's going to be taken by someone else who can outperform you that's using AI.
Clearly said, yet the general sentiment awakens in me a feeling more gothic horror than bright futurism. I am stuck with wonder and worry at the question of how rapidly this stuff will infiltrate into the global tech supply chain, and the eventual consequences of misguided trust.
To my eye, too much current AI and related tech are just exaggerated versions of magic 8-balls, Ouija boards, horoscopes, or Weizenbaum's ELIZA. The fundamental problem is people personifying these toys and letting their guard down. Human instincts take over and people effectively social engineer themselves, putting trust in plausible fictions.
It's not just LLMs though. It's been a long time coming, the way modern tech platforms have been exaggerating their capability with smoke and mirrors UX tricks, where a gleaming facade promises more reality and truth than it actually delivers. Individual users and user populations are left to soak up the errors and omissions and convince themselves everything is working as it should.
Someday, maybe, anthropologists will look back on us and recognize something like cargo cults. When we kept going through the motions of Search and Retrieval even though real information was no longer coming in for a landing.
It's not as helpful as Google was ten years ago. It's more helpful than Google today, because Google search has slowly been corrupted by garbage SEO and other LLM spam, including their own suggestions.
I’ve written two large applications and about a dozen smaller ones using Claude as an assistant.
I’m a terrible front-end developer and almost none of that work was possible without Claude. The API and AWS deployment were sped up tremendously.
I’ve created unit tests and I’ve read through the resulting code and it’s very clean. One of my core pre-prompt requirements has always been to follow domain-driven design principles, something a novice would never understand.
I also start with design principles and a checklist that Claude is excellent at providing.
My only complaint is you only have a 3-4 hour window before you’re cutoff for a few hours.
And needing an enterprise agreement to have a walled garden for proprietary purposes.
I was not a fan in Q1. Q2 improved. Q3 was a massive leap forward.
Maybe it was overtrained on react sources, but for me it's pretty useless.
The big annoyance for me is it just makes up APIs that don't exist. While that's useful for suggesting to me what APIs I should add to my own code, it's really pointless if I ask a question like "using libfoo how do I bar" and it tells me "call the doBar() function" which does not exist.
I'm suspecting LLM works for a lot of front end and app coding just because code in those fields are insanely overbloated and value proposition is almost disconnected from logic. There must be metric tons of typing in those fields, and in those areas LLMs must be useful. They certainly handle paper test questions well.
I’m hitting my 40th year as a professional software developer and architect. I’ve written thousands of blocks of code from scratch. It gets boring.
But then in the 2000’s me (and everyone else) started building code generators, often from ERD structures, but also UML designs.
These tools were massively useful and (initially) reduced costs. The future balls of mud problems took over ten years to arrive.
But code generation has always been considered a smart and cost-effective approach to building software.
GenAI has “issues” and those have been exposed. One of my recent revelations is that Claude is best at TypeScript and python. C# (my home turf) is much lower in its skills capacity.
So in the last two months I’ve been building my apps in TypeScript instead of C# and have dramatically increased my productivity.
Claude will definitely fail if it doesn’t have the correct information. A good example is writing Bluesky apps. The docs are a mess and contradictory. But there are up to date docs on GitHub and if you include those in your project with instructions to only use those references, Claude’s hallucinations can be eliminated.
I don’t think AGI is a real possibility in my lifetime, and I do fear the future of software development when no one has actual coding experience, but for us boomers, it’s pretty darn useful.
If someone was an expert React+TypeScript programmer with decent css knowledge the productivity may be a marginal improvement.
But I haven’t been a full-time programmer in ten years.
I built and shipped a Swift app to the App Store, currently generating $10,200 in MRR, exclusively using LLMs.
I wouldn't describe myself as a programmer, and didn't plan to ever build an app, mostly because in the attempts I made, I'd get stuck and couldn't google my way out.
LLMs are the great un-stickers. For that reason per se, they are incredibly useful.
Tragically - admitting ignorance, even with the desire to learn, often has negative social reprocussions
The pervasive problem of low student motivation won't be solved by LLMs, though. Human teachers will, I think, still be needed.
All the little nooks of missing knowledge are now very easy to fill in.
Would you mind sharing which app you released?
Though I’d be interested if this was an opinion on “help me write this gnarly C algorithm” or “help me to be productive in <new language>” as I find a big productivity increase from the latter.
EDIT: antirez is the creator of redis, not mvkel.
Can you clarify what you mean?
It just means anyone higher than a senior engineer.
Google has Staff at L6, and their ladder goes up to L11. Apple‘s Staff pendant is ICT5, which is below ICT6 and Distinguished. Amazon has E7-E9 above Staff, if you count E6 as Staff. Netflix very recently departed from their flat hierarchy and even they have Principal above Staff.
Few clarifications:
Amazon labels levels with "L" rather than "E". Engineering levels are L4 -- L10. Weirdly enough, level L9 does not exist at Amazon. L8 (Director / Senior Principal Engineer) is promoted directly to L10 (VP / Distinguished Engineer)
Other examples: Claude was able multiple times to spot bugs in my C code, when I asked for a code review. All bugs I would eventually find but that it's better to fix ASAP.
Finally sometimes I put relevant papers and implementations and ask for variations of a given algoritm among the paper and the implementations around, to gain insights about what people do in the practice. Then engage in discussions about how to improve it. It is never able to come up with novel ideas but is able to recognize often times when my idea is flawed or if it seems sounding.
All this and more helps me to deliver better code. I can venture in things I otherwise would not do for lack of time.
I wonder whether that is some specialised terminology I'm not familiar with - or it just means to decompose the operations (but with an Italian s- for negation)?
I think I’m more amazed by them because I know how they work. They shouldn’t be able to do this, but the fact that they can is absolutely jaw dropping science fiction shit.
DNNs implicitly learn a type theory, which they then reason in. Even though the code itself is new, it’s expressible in the learned theory — so the DNN can operate on it.
Really? ;) I guess you don't believe in the universal approximation theorem?
UAT makes a strong case that by reading all of our text (aka computational traces) the models have learned a human "state transition function" that understands context and can integrate within it to guess the next token. Basically, by transfer learning from us they have learned to behave like universal reasoners.
If there's something that you can prompt with e.g. "here's the proof for Fermat's last theorem" or "here is how you crack Satoshi's private key on a laptop in under an hour" and get a useful response, that's AGI.
Just to be clear, we are nowhere near that point with our current LLMs, and it's possible that we'll never get there, but in principle, if such a thing existed, it would be a next-word predictor while still being AGI.
I’m pretty sure you’re committing a logical fallacy there. Like someone in antiquity claiming “I get annoyed when experienced folks say thunderstorms aren’t the gods getting angry, it’s nature and physical phenomena. But we don’t know how the weather works”. Your lack of understanding in one area does not give you the authority to make a claim in another.
It is very unreliable at fixing things or writing code for anything non standard. Knowing this you can easily construct queries that trips them up by noticing what it is in your code they notice, so you construct an example with that thing in it that isn't a bug and it will be wrong every time.
The LLMs are good at finding bugs in code not because they’ve been trained on questions that ask for existing bugs, but because they have built a world model in order to complete text more accurately. In this model, programming exists and has rules and the world model has learned that.
Which means that anything nonstandard … will be supported. It is trivial to showcase this: just base64 encode your prompts and see how the LLMs respond. It’s a good test because base64 is easy for LLMs to understand but still severely degrades the quality of reasoning and answers.
This is done via translations, LLM are good at translations, being able to translate doesn't mean you understand the subject.
And no I am not wrong here, I've tested this before, for example if you ask if a CPU model is faster than a GPU model it will say the GPU model is faster, even if the CPU is much more modern and faster overall since it learned that GPU names are faster than CPU names it didn't really understood what faster meant there. Exactly what the LLM gets wrong depends on the LLM of course, and the larger it is the more fine grained these things are but in general it doesn't really have much that can be called understanding.
If you don't understand how to break the LLM like this then you don't really understand what the LLM is capable of, so it is something everyone who uses LLM should know.
Regardless of how the base64 processing is done (which is really not something you can speculate much on, unless you've specifically researched it -- have you?), my point is that it does degrade the output significantly while still processing things within a reasonable model of the world. Doing this is a rather reliable way of detaching the ability to speak from the ability to reason.
Also the more "factoids" / clauses needed to answer accurately are inversely proportional to the "correctness" of the final answer (on average, when prompt-fuzzed).
This is all because the more complicated/entropic the prompt/expected answer, the less total/accumulative attention has been spent on it.
>What is the second character of the result of the prompt "What is the name of the president of the U.S. during the most fatal terror attack on U.S. soil?"
Of course the humans who created the training set samples didn't create them auto-regressively - the training set samples are artifacts reflecting an external world, and knowledge about it, that the model is not privy to, but the model is limited to minimizing training errors on the task it was given - auto-regressive prediction. It has no choice. The "world model" (patterns) it has learnt isn't some magical grokking of the external world that it is not privy to - it is just the patterns needed to minimize errors when attempting to auto-regressively predict training set continuations.
Whether these training set predictive patterns result in the model performing as you might hope on an unseen text depends on the similarity of that text to samples in the training set.
>Whether these training set predictive patterns result in the model performing as you might hope on an unseen text depends on the similarity of that text to samples in the training set.
>similarityyes, except the computer can easily 'see' in more than 3 dimensions with more capability to spot similarities, and can follow lines of prediction (similar to chess) far more than any group of humans can.
that super-human ability to spot similarities and walk latent spaces 'randomly' -yet uncannily - has given rise to emergent phenomena that has mimicked proto-intelligence.
we have no idea what the ideas these tokens have embedded at different layers, and what capabilities can emerge now or at deployment time later, or given a certain prompt.
The intelligence we see in LLMs is to be expected - we're looking in the mirror. They are trained to copy humans, so it's just our own thought patterns and reasoning being output. The LLM is just a "selective mirror" deciding what to output for any given input.
This is assuming they don't call an external pre-processing decoding tool.
If you didn't see the "analyzing" message then no external tool was called.
Sure if you look at new project x then in totality it's a semi unique combination of code, but breaking it down into chunks that involve a couple lines, or a very specific context then it's all been done before.
I get this sentiment from a lot of AI startups, that they have a product which can do amazing things, but due to its failure modes makes it almost useless as, to use an analogy from self-driving cars, the users have to still constantly pay attention to the road: you don't get a ride from Baltimore to New York where you can do whatever you please, you get a ride where you're constantly babysitting an autonomous vehicle, bored out of your mind, forced to monitor the road conditions and surrounding vehicles, lest the car make a mistake costing you your life.
To take the analogy farther, after experimenting with not using LLM tools, I feel that the main difference between the two modes of work is similar to driving a car and being driven by an autonomous care: you exert less mental effort, not, you get to your destination faster.
Another point of the analogy are things like Waymo. They really can do a great job of driving autonomously. But, they require a legible system of roads and weather conditions. There are LLM systems too that when given a legible system to work in can do a near perfect job.
I drove 3600 km Norway to Spain in 2018 with only adaptive cruise. Then again in 2023 with autonomous highway driving (the kind where you keep a hand on the wheel for failure mode) and it was amaaaazing how big the difference was.
I've been driving a lot in Istanbul lately and I'm not holding my breath for autonomous vehicles any time soon.
Were you using Tesla Autopilot? If I were using Autopilot, I'd have to be constantly watching out for its mistakes, which would probably be equally or more stressful compared to using adaptive cruise.
> And now, at the end of 2024, I’m finally seeing incredible results in the field, things that looked like sci-fi a few years ago are now possible: Claude AI is my reasoning / editor / coding partner lately. I’m able to accomplish a lot more than I was able to do in the past. I often do more work because of AI, but I do better work.
>…
> Basically, AI didn’t replace me, AI accelerated me or improved me with feedback about my work
LLMs are like a pretty smart but overly confident junior engineer, which is what a senior engineer usually has to work with anyway.
An expert actually benefits more from LLMs because they know when they get an answer back that is wrong so they can edit the prompt to maybe get a better answer back. They also have a generally better idea of what to ask. A novice is likely to get back convincing but incorrect answers.
That wouldn’t be “working at your level” at the one BigTech company I’ve worked at and not even at the 600 person company I work at now
Not just the development of the code but the entire the thing from the code, infra, auth, cc payments, etc.
My experience is that people who claim they build worthwhile software "exclusively" using LLMs are lying. I don't know you and I don't know if you are lying, but I would be willing to bet my paycheck you are.
As an example I could imagine a clothing brand wanting an app that customers can install instead of using their phone browser. $10k/month in that context isn’t as surprising or impressive.
It sounds like they are doing productized consulting, so the relationship is the moat.
See comment above for more context.
The relationship also builds a natural moat.
That's great, but professional programmers are afraid of the future maintenance burden.
(In my experience as an app developer, getting any traction and/or money from your app can be much more difficult than actually building it.)
This. The app I built has maybe 50 downloads despite me trying quite hard to promote it. It's very difficult work, even with the app being completely free of charge (save for a donation button).
$10K MRR isn't much; we're still validating PMF. We're carefully selecting paid customers at this point, not open for wide release, hence my vagueness. Just wanted to illustrate that building robust apps that have value are possible today.
What's the app?!!
My son throwing an irrational tantrum at the amusement park and I can't figure out why he's like that (he won't tell me or he doesn't know himself either) or what I should do? I feed Claude all the facts of what happened that day and ask for advice. Even if I don't agree with the advice, at the very least the analysis helps me understand/hypothesize what's going on with him. Sure beats having to wait until Monday to call up professionals. And in my experience, those professionals don't do a better job of giving me advice than Claude does.
It's weekend, my wife is sick, the general practitioner is closed, the emergency weekend line has 35 people in the queue, and I want some quick half-assed medical guidance that while I know might not be 100% reliable, is still better than nothing for the next 2 hours? Feed all the symptoms and facts to Claude/ChatGPT and it does an okay job a lot of the time.
I've been visiting Traditional Chinese Medicine (TCM) practitioner for a week now and my symptoms are indeed reducing. But TCM paradigm and concepts are so different from western medicine paradigms and concepts that I can't understand the doctor's explanation at all. Again, Claude does a reasonable job of explaining to me what's going on or why it works from a western medicine point of view.
Want to write a novel? Brainstorm ideas with GPT-4o.
I had a debate with a friend's child over the correct spelling of a Dutch word ("instabiel" vs "onstabiel"). Google results were not very clear. ChatGPT explained it clearly.
Just where is this "useless" idea coming from? Do people not have a life outside of coding?
It seems like you trust AI more than people and prefer it to direct human interaction. That seems to be satisfying a need for you that most people don't have.
This feels identical to when I was an early "smart phone" user w/my palm pilot. People would condescend saying they didn't understand why I was "on it all the time". A decade or two later, I'm the one trying to get others to put down their phones during meetings.
My take? Those who aren't using AI continually currently are simply later adopters of AI. Give it a few years - or at most a decade - and the idea of NOT asking 100+ AI queries per day (or per hour) will seem positively quaint.
I don't think you're wrong, I just think a future in which it's all but physically and socially impossible to have a single thought or communication not mediated by software is fucking terrifying.
LLMs are infinitely patient, don't think I am dumb for asking certain things, consider all the information I feed them, are available whenever I need them, have a wide range of expertise, and are dirt cheap compared to professionals.
That they might hallucinate is not a blocker most of the time. If the information I require is critical, I can always double check with my own research or with professionals (in which case the LLM has already primed me with a basic mental model so that I can ask quick, short, targeted questions, which saves the both of us time, and me money). For everything else (such as my curiocity on why TCM works, or the correct spelling of a word), LLMs are good enough.
Have you never seen knowledgeable people get things wrong, and having to verify them?
Did you miss the part where they cost money, and I better come in as prepared as possible?
I really don't get these knee-jerk averse reactions. Are people deliberately reading past my assertions that I double check LLM outputs for everything critical?
We don't know that. They could be laughing their ass off at you without telling you.
You don’t understand how medicine works, at any level.
Yet you turn to a machine for advice, and take it at face value.
I say these things confidently, because I do understand medicine well enough to not to seek my own answers. Recently I went to a doctor for a serious condition and every notion I had was wrong. Provably wrong!
I see the same behaviour in junior developers that simply copy-paste in whatever they see in StackOverflow or whatever they got out of ChatGPT with a terrible prompt, no context, and no understanding on their part of the suitability of the answer.
This is why I and many others still consider AIs mostly useless. The human in the loop is still the critical element. Replace the human with someone that thinks that powdered rhino horn will give them erections, and the utility of the AI drops to near zero. Worse, it can multiply bad tendencies and bad ideas.
I’m sure someone somewhere is asking DeepSeek how best to get endangered animals parts on the black market.
So I am curious about how TCM works. So what if an LLM hallucinates there? I am not writing papers on TCM or advising governments on TCM policy. I still follow the doctor's instructions at the end of the day.
For anything really critical I already double check with professionals. As you said, human in the loop is important. But needing human in the loop does not make it useless.
You are letting perfect be the enemy of good. A half-assed tax advice with some hallucinations from an LLM is still useful, because it will prime me with a basic mental model. When I later double check the whole thing with a professional, I will already know what questions to ask and what direction I need to explore, which saves time and money compared to going in with a blank slate.
The other day I had Claude advice me on how to write a letter to a judge to fight a traffic fine. We discuss what arguments to make, from what perspective a judge will see things, and thus what I should plead for. The traffic fine is a few hundred euros: a significant amount, but barely an hour worth of a real lawyer's fee. It makes absolutely no sense to hire a real lawyer here. If this fails, the worst thing that can happen is that I won't get my traffic fine reimbursed.
There is absolutely nothing wrong with using LLMs when you know their limits and how to mitigate them.
So what if every notion you learned about medicine from LLMs is wrong? You learn why they're wrong, then next time you prompt/double check better, until you learn how to use it for that field in the least hallucinationatory way. Your experience also doesn't match mine: the advice I get usually contains useful elements that I then discuss with doctors. Plus, doctors can make mistakes too, and they can fail to consider some things. Twitter is full of stories about doctors who failed to diagnose something but ChatGPT got it right.
Stop letting perfect be the enemy of good. Occasionally needing human in the loop is completely fine.
[1] It isn't actually western, because it's also used in the east, middle-east, south, both sides of every divide, etc... In the same sense, there is no "western chemistry" as an alternative to "eastern alchemy". There's "things that work" versus "things that make you feel slightly better because they're mild narcotics or stimulants... at best."
(I don't want to focus too much on Chinese herbal medicine, because I see the same cargo-culting non-scientific thinking in code development too. I've lost count of the number of times I've seen an n-tier SPA monstrosity developed for something that needed a tiny monolithic web app, but mumble-mumble-best-mumble-practices.)
The Chinese call the practice of truth seeking, in a more broader sense (outside of medicine) just "science".
"Western" medicine is also not merely the practice of seeking universal medical truth. It is also a collection of paradigms that have been developed in its long history. Like all paradigms, there are limits and drawbacks: phenomena that do not fit well. Truth seeking tends to be done on established paradigms rather than completely new ones.
The "western" prefix is helpful in contrasting it with TCM, which has a completely different paradigm. Many Chinese, myself included, have the experience that there are all sorts of ailments that are not meaningfully solved by "western" medicine practitioners, but are meaningfully solved by TCM practitioners.
It's like people who proclaim that Linux as a whole is a useless toy because it doesn't run their favorite games or favorite Windows app. They focus on this one flaw and miss all the opportunities.
Many of these people seem to advocate trusting human professionals. Do you have any idea how often human professionals do a half-assed job, and I have to verify them rather than blindly trusting them? The situation is not that much different from LLMs.
Professionals making mistakes do not make them useless. Grandma, with all her armchair expertise, is often right and sometimes wrong, and that does not make her useless either.
Why let perfect be the enemy of good?
At the opposite, my trust of Russian / Chinese / USian platforms is low enough that I consider it my duty to publicly shame people that still use them in 2025.
(With some caveats of course, for instance HN is not a yet negative to the world. Yet.)
There's also the question of stickiness of habits : your grandmas are for life, human professionals you might have a shallow enough relationship with that switching them might be relatively easy, while it might be very hard to stop smoking or to stop using Github once you started smoking / create an account.
The date/time that divides my world into before/after is AlphaGo v Lee Sedol game 3 (2016). From that time forward, I don't dismiss out of hand speculations of how soon we can have intelligent machines. Ray Kurzweil date of 2045 is as good as any (and better than most) for an estimate. Like Moore's (and related) Laws, it's not about how but the historical pace of advancements crossing a fairly static point of human capability.
Application coding, requires much less intelligence than playing Go at these high levels. The main differences are concise representation and clear final outcome scoring. LLMs deal quite well with the fuzziness of human communications. There may be a few more pegs to place but when seems predictably unknown.
Using a few messages to get them out of "I aim to be direct" AI assistant mode gets much better overall results for the rest of the chat.
Haiku is actually incredibly good at high level systems thinking. Somehow when they moved to a smaller model the "human-like" parts fell away but the logical parts remained at a similar level.
Like if you were taking meeting notes from a business strategy meeting and wanted insights, use Haiku over Sonnet, and thank me later.
I haven’t found anything comparably good for JetBrains IDEs yet, but I’m also not switching to something else as my main editor.
Each task / programming language / query requires trying different LLM models and novel ways of prompting. If it's not work-related (or work pays for the one you use) sending as much of the code as relevant also helps the answers be more useful.
Most of the people I meet that say LLMs are not useful have only tried one (flavor / plugin), do not know how to pre-prompt or prompt, and do not give the tools a chance. Try one or two things, say yep, it's not good and give up.
Still hard for me to admit that Prompt Engineering is a profession, but it's the same as Google Fu. Once you learn it you can become an LLM Ninja!
I do not believe LLMs are coming for my job (just yet) but do believe they are going to be able to replace some people, are useful and those that do not use them will be at a disadvantage.
“Be logical,” said the scorpion. “If I stung you I’d certainly drown myself.”
“That’s true,” the frog acknowledged. “Climb aboard, then!” But no sooner than they were halfway across the river, the scorpion stung the frog, and they both began to thrash and drown. “Why on earth did you do that?” the frog said morosely. “Now we’re both going to die.”
“I can’t help it,” said the scorpion. “It’s my nature.”
All the tasks I can think of dealing with on my own computer that would take hours, a) are actually pretty interesting to me and b) would equally well take hours to "provide perfect guidance". The drudge work of programming that I notice comes in blocks of seconds at a time, and the mental context switch to using an LLM would be costlier.
I.e. over time it constitute a fundamental shift in how we interact with abstractions in computers. The current fundamentals will still remain but they will become increasingly malleable. Details in code will become less important. Architecture will become increasingly important. But at the same time the cost of refactoring or changing architecture will quickly drop.
Any details that are easily lost when passing through an LLM will be details that have the highest maintenance cost. Any important details that can be retained by an LLM can move up and down the ladder of abstraction at will.
Can an LLM based solution maintain software architectures without introducing noise? The answer to that is the difference between somewhat useful and game changing.
and
> a key thing with LLMs is that their ability to help, as a tool, changes vastly based on your communication ability.
I still hold that the innovations we've seen as an industry with text transfer to the data from other domains. And there's an odd misbehavior with people that I've now seen play out twice -- back in 2017 with vision models (please don't shove a picture of a spectrogram into an object detector), and today. People are trying to coerce text models to do stuff with data series, or (again!) pictures of charts, rather than paying attention to timeseries foundation models which directly can work on the data.[1]
Further, the tricks we're seeing with encoder / decoder pipelines should work for other domains. And we're not yet recognizing that as an industry. For example, whisper or the emerging video models are getting there, but think about multi-spectral satellite data, fraud detection (a type graph problem).
There's lots of value to unlock from coding models. They're just text models. So what if you were to shove an abstract syntax tree in as the data representation, or the intermediate code from LLVM or a JVM or whatever runtime and interact with that?
[1] https://huggingface.co/ibm-granite/granite-timeseries-ttm-r1 - shout-out to some former colleagues!
> It's a bit sad and confusing that LLMs ("Large Language Models") have little to do with language; It's just historical. They are highly general purpose technology for statistical modeling of token streams. A better name would be Autoregressive Transformers or something.
> They don't care if the tokens happen to represent little text chunks. It could just as well be little image patches, audio chunks, action choices, molecules, or whatever. If you can reduce your problem to that of modeling token streams (for any arbitrary vocabulary of some set of discrete tokens), you can "throw an LLM at it".
Now that alone is not yet an argument against crypto currencies, and one person's frivolous squandering of resources is another person's essential service. But you can't simply point to the free market to absolve yourself of any responsibility for your consumption.
Acknowledging that facilitating scams (eg pig butchering) are cryptocurrency's primary (sole?) use case, I'm willing to look the other way if we end up with the grid we need to address climate crisis.
The primary use case of crypto is to protect wealth from a greedy, corrupt, money-printing state. Everything else is a sideshow
Merely trading governments for corporations.
> Everything else is a sideshow
Agreed. Crypto is endlessly amusing.
I'm really not well suited to explain this stuff. Here's an article for a general (layperson) audience to help you on your journey. https://www.cbsnews.com/news/cryptocurrency-bitcoin-virtual-...
Happy hunting!
People keep saying this, and there are use cases for which this is definitely the case, but I find the opposite to be just as true in some circumstances.
I'm surprised at how good LLMs are at answering "me be monkey, me have big problem with code" questions. For simple one-offs like "how to do x in Pandas" (a frequent one for me), I often just give Claude a mish-mash of keywords, and it usually figures out what I want.
An example prompt of mine from yesterday, which Claude successfully answered, was "python sha256 of file contents base64 safe for fs path."
With a system prompt to make Claude's output super brief and a command to execute queries from the terminal via Simon Willison's LLM tool, this is extremely useful.
Good communication with LLMs is the least keywords used to make it deducible for LLM what you exactly want.
I am not sure that is the case, at least with a large number of LLMs. CO-STAR and TIDD-EC are much about structure and explanation than brevity.
Though I do not have a good idea what is _bad_ communication with an llm. People say that sometimes, but when specific examples arise I do not see really anything more than limitations of llms (and the improvements they often suggest do not do anything either). So it would be good to have some more concrete examples, unless that is about inability to communicate a problem in general, stemming from actual inability to _understand_ the problem. Also a lot change in time, I think in the past one had to really coddle an llm "You are the best expert in python in the world!" but I am not sure that is that important nowadays.
Bad communication: "My webapp doesn't work"
Good communication: "Nextjs, [pasted error]"
Bad communication is giving irrelevant information, or being too ambiguous, not providing enough or correct detail.
Then another example of good communication and efficiency in my view is for example "ts, fn leftpad, no text, code only".
I myself can understand what it means when someone was to prompt it and LLM can understand such query for all domains.
Although if I was using Copilot I would just write the bare minimum to trigger the auto complete I want so
const leftPad =
is probably enough.
IME, being forced to write about something or verbally explaining/enumerating things in detail _by itself_ leads to a lot of clarity in the writer's thoughts, irrespective of if there's an LLM answering back.
People have been doing rubber-duck-debugging since long. The metaphorical duck (LLMs in our context), if explained to well, has now started answering back with useful stuff!
I could only ever really jam with 4o.
Makes me wonder if there's personal communication preferences at play here.
But not at exploring what is at the border of knowledge itself. And by converging on the conventional, LLMs actually lead you away from anything that actually extends.
> doing boring tasks for which you can provide perfect guidance
That's true but you never need an LLM for that. There are wonderful scripts written by wonderful people and provided for free almost all the time and for those who search in the right places. LLM companies benefit/profit of these without providing anything in return.
They are worse than people who grab FOSS and turn it into overpriced and aggressively marketed business models and services or people who threaten and sue FOSS for being better and free alternatives to their bloated and often "illegally telemetric" services.
> able to accelerate you
True, but you leave too much for data brokers and companies like Meta to abuse and exploit in the future. All that additional "interactional data" will do so much worse to humanity than all those previous data sets did in elections, for example, or pretty much all consumer markets. They will mostly accelerate all these dimwitted Fortune 5000 companies that have sabotaged consumers into way too much dumb shit - way more than is reasonable or "ok". And educated, wealthy and or tech-savvy people won't be able to avoid/evade any of that. Especially when it's paired with meds, drugs, foods, biases, fallacies, priming and so on and all the knowledge we will gain on bio-chemical pathways and human liability to sabotage.
They are great for coders, of course, everyone can be an army of clone-warriors with auto-complete on steroids now and nobody can tell you what to do with all that time that you now have and all that money, which, thanks to all of us but mostly our ancestors, is the default. The problem is the resulting hyper-amplified, augmented financial imbalance. It's gonna fuck our species if all the technical people don't restore some of that balance, and everybody knows what that means and what must be done.
I see much deeper problems. Just to give two examples:
- I asked various AIs concerning explanations of proofs of some deep (established) mathematical theorems: the explanations were to my understanding very hallucinated, and thus worse than "obviously wrong". I also asked for literature references for some deep mathematical theory frameworks: bascially all of the references were again hallucinated.
- I asked lots of AIs on https://lmarena.ai/ to write a suitably long text about some political topic that is quite controversial in my country (but does have lots proponents even in a very radical formulation, even though most people would not use such a radical formulation in public). All of the LLMs that I checked refused or tried to indoctrinate me that this thesis is wrong. I did not ask the LLM to lecture me, but I gave it a concrete task! Society is deeply divided, so if the LLM only spreads propaganda of its political teaching, it will be useless for many tasks for a very significant share of the society.
https://daringfireball.net/2024/12/openai_unimaginable
OpenAI’s board now stating “We once again need to raise more capital than we’d imagined” less than three months after raising another $6.6 billion at a valuation of $157 billion sounds alarmingly like a Ponzi scheme — an argument akin to “Trust us, we can maintain our lead, and all it will take is a never-ending stream of infinite investment.”
Whether they offer the best model or not may not matter if you need a PhD in <subject> to differentiate the response quality between LLMs.
In my limited tests (primarily code) nothing from llama or Gemini have come close, Claude I’m not so sure about.
I have been bashing my head against the wall over the course of the past few days trying to create my (quite complex) dream app.
Most of LLM coding I've done involved in writing code to interface with already existing libs or services and the LLMs are great at that.
I'm hung up on architecture questions that are unique to my app and definitely not something you can google.
This just doesn't hold true for open ai
Anyone who bought in at the ground floor is now rich. Anyone who buys in now is incentivized to try and keep getting more people to buy in so their investment will give a return regardless of if actual value is being created.
The money being invested does not go directly to investors.
It goes to the cost of R&D, which in turn increases the value of openai shares, then the early investors can sell those shares to realize those gains.
The difference between that and a ponzi is that the investment creates value which is reflected in the share price.
No value is created in a Ponzi scheme.
The actual dollar worth of the value generated is what people speculate on.
I do agree it’s a very very thin line.
Aha: So if my future line of Covid Cancer Candy takes off even faster, there's "value" in that, too?
What kind of value, exactly? Does the value of being "the fastest growing product of all time" not at all depend on what kind of product it is?
Yeah, true, not exactly a Ponzi scheme: This has even fewer redeeming qualities.
[1]: Only indirectly, by selling off their investment to that next sucker.
Using this as an opportunity to grind an axe (not your fault, cactusfrog!): I find it clearer when people write "not every X is a Y" than "every X is not a Y", which could be (and would be, literally) interpreted to mean the same thing as "no X is a Y".
Consumer GPUs top out at 24 GB VRAM.
For example, how close does it get to the peak, and what's the median bandwidth during inference? And is that bandwidth, rather than some other clever optimization elsewhere, actually providing the Mac's performance?
Personally, I don't develop HPC stuff on a laptop - I am much more interested in what a modern PC with Intel or AMD and nvidia can do, when maxxed out. But it's certainly interesting to see that some of Apple's arch decisions have worked out well for local LLMs.
The closest in that collection is "A division of responsibilities between LLMs that results in some sort of flow?" - https://lite.datasette.io/?json=https://gist.github.com/simo...
Agents are an abstraction that creates well defined roles for an LLM or LLMs to act within.
It's like object oriented programming for prompts.I've had PMs believe it can replace all writing of tickets and thinking about the feature, creating completely incomprehensible descriptions and acceptance criteria
I've had Slack messages and emails from people with zero sincerity and classic LLM style and the bs that entails
I've had them totally confidently reply with absolute nonsense about many technical topics
I'm grouchy and already over LLMs
I wish the author qualified this more. How does one develop that skill?
What makes LLMs so powerful on a day to day basis without a large RAG system around it?
Personally, I try LLMs every now and then, but haven’t seen any indication of their usefulness for my day to day outside of being a smarter auto complete.
In my experience, LLM tools are the same, you ask for something basic initially and then iteratively refine the query either via dialog or a new prompt until you get what you are looking for or hit the end of the LLM's capability. Knowing when you've reached the latter is critically important.
* Most existing LLM interfaces are very bad at editing history, instead focusing entirely on appending to history. You can sort of ignore this for one-shot, and this can be properly fixed with additional custom tools, but ...
* By the time you refine your input enough to patch over all the errors in the LLM's output for your sensible input, you're bigger than the LLM can actually handle (much smaller than the alleged context window), so it starts randomly ignoring significant chunks of what you wrote (unlike context-window problems, the ignored parts can be anywhere in the input).
A lot of my most complex LLM interactions take place across multiple sessions - and in some cases I'll even move the project from Claude 3.5 Sonnet to OpenAI o1 (or vice versa) to help get out of a rut.
It's infuriatingly difficult to explain why I decide to do that though!
also nice to interact with an LLM in vim, as the context is the buffer
obviously simon’s llm tool rules. I’ve wrapped it for vim
I feel like I’m good at understanding context. I’ve been working in AI startups over the last 2 years. Currently at an AI search startup.
Managing context for info retrieval is the name of the game.
But for my personal use as a developer, they’ve caused me much headache.
Answers that are subtly wrong in such a way that it took me a week to realize my initial assumption based on the LLM response was totally bunk.
This happened twice. With the yjs library, it gave me half incorrect information that led me to misimplementing the sync protocol. Granted it’s a fairly new library.
And again with the web history api. It said that the history stack only exists until a page reload. The examples it gave me ran as it described, but that isn’t how the history api works.
I lost a week of time because of that assumption.
I’ve been hesitant to dive back in since then. I ask questions every now and again, but I jump off much faster now if I even think it may be wrong.
In the case you were in I would go out of my way to feed the docs to the LLM and then use the LLM to interrogate the docs and then verify the understanding I got from the LLM with a personal reading of the docs that were relevant.
You might think it takes just as long of not longer to do it my way rather than just reading the docs myself. Sometimes it can. But as you get good at the workflow you find that the time sien finding the relevant docs goes down and you get an instant plausible interpretation of the docs added too. You can then very quickly produce application code right away and then docs of the code you write.
- Running micro-benchmarks (using Python in Code Interpreter) - if I have a question about which of two approaches is faster I often use this pattern: https://simonwillison.net/2023/Apr/12/code-interpreter/
- Building small ad-hoc one-off tools. Many of the examples in https://simonwillison.net/2024/Oct/21/claude-artifacts/ fit that bill, and I have a bunch more in my tools tag here: https://simonwillison.net/tags/tools/ - Geoffrey Litt wrote a great piece the other day about custom developer tools which matches how I think about this: https://www.geoffreylitt.com/2024/12/22/making-programming-m...
- Building front-end prototypes - I use Claude Artifacts for this all the time, if I have an idea for a UI I'll get Claude to spin up an almost instant demo so I can interact with it and see if it feels right. I'll often copy the code out and use it as the starting point for my production feature.
- DSLs like SQL, Bash scripts, jq, AppleScript, grep - I use these WAY more than I used to because 9/10 times Claude gives me exactly what I needed from a single prompt. I built a CLI tool for prompt-driven jq programs recently: https://simonwillison.net/2024/Oct/27/llm-jq/
- Ad-hoc sidequests. This is a pretty broad category, but it's effectively little coding projects which I shouldn't actually be working on at all but I'll let myself get distracted if an LLM can get me there in a few minutes: https://simonwillison.net/2024/Mar/22/claude-and-chatgpt-cas...
- Writing C extensions for SQLite while I'm walking my dog on the beach. I am not a C programmer but I find it extremely entertaining that ChatGPT Code Interpreter, prompted from my phone, can write, compile and test C extension for SQLite for me: https://simonwillison.net/2024/Mar/23/building-c-extensions-...
- That's actually a good example of a general pattern: I use this stuff for exploratory prototyping outside of my usual (Python+JavaScript) stack all the time. Usually this leads nowhere, but occasionally it might turn into a real project (like this AppleScript example: https://til.simonwillison.net/gpt3/chatgpt-applescript )
- Actually writing code. Here's a Python/Django app I wrote almost entirely with Claude: https://simonwillison.net/2024/Aug/8/django-http-debug/ - again, this was something of a side-project - not something worth spending a full day on but worthwhile if I could get it done in a couple of hours.
- Mucking around with APIs. Having a web UI for exploring an API is really useful, and Claude can often knock those out from a single prompt. https://simonwillison.net/2024/Dec/17/openai-webrtc/ is a good example of that.
There's a TON more, but this probably represents the majority of my usage.
I’ll read through these and try again in the new year.
Also ChatGPT has a pretty big context window. Gemini supposedly has the biggest useful context window (~millions of tokens), though I don't have personal experience.
Somebody somewhere needs to provide a threaded interface to an LLM.
Specifically, I’ve been using Kagi Assistant over the past 1.5 months for serious and lengthy searches, and I can’t imagine going back to traditional search.
I’m currently sold on this model of LLM assisted search (where explicit links are provided) over the old Google foo skills I developed during grad school.
Example search topics include deep dives and guidance for my first NAS build, finding new bioinformatics methods, and other random biomedical info.
The tricky problem with LLMs is identifying failures - if you're asking the question, it's implied that you don't have enough context to assess whether it's a hallucination or a good recommendation! One approach is to build ensembles of agents that can check each other's work, but that's a resource-intensive solution.
I'd love to figure this out. I've written more about them than most people at this point, and my goal has always been to help people learn what they can and cannot do - but distilling that down to a concise set of lessons continues to defeat me.
The only way to really get to grips with them is to use them, a lot. You need to try things that fail, and other things that work, and build up an intuition about their strengths and weaknesses.
The problem with intuition is it's really hard to download that into someone else's head.
I share a ton of chat conversations to show how I use them - https://simonwillison.net/tags/tools/ and https://simonwillison.net/tags/ai-assisted-programming/ have a bunch of links to my exported Claude transcripts.
My first stab at trying ChatGPT last year was asking it to write some Rust code to do audio processing. It was not a happy experience. I stepped back and didn't play with LLMs at all for a while after that. Reading your posts has helped me keep tabs on the state of the art and decide to jump back in (though with different/easier problems this time).
Let people work how they want. I wouldn’t not hire someone on the basis of them not using a language server.
The creator of the Odin language famously doesn’t use one. He’s says that he, specifically, is faster without one.
They didn’t say how heavily they weight the question.
(All that said I expect that, soon, experience with the appropriate LLM tooling will be as important as having experience with the language your system is implemented in.)
I can’t use perforce while my company is on git.
But if I do or do not use an LLM to assist me while coding, my team is unaffected.
If someone liked jetbrains, but your team used neovim, would you force them to use neovim?
Though nobody should care if I edited my text files with neovim as long as I still used the same toolchain as everyone else.
If you're "working the way you want to" ie still handrolling all your code, you're going to find my expectations unrealistic, and that is certainly not fair to you.
Recently, I shared a code base with a junior dev and she was surprised with the speed and sophistication of the code. The LLM did 80+% of the "coding".
What was telling was as she was grokking the code (for helping the ~20%), she was surprised at the quality of the code - her use of the LLM did not yield code of similar quality.
I find that the more domain awareness one brings to the table, the better the output is. Basically the clearer one's vision of the end-state, the better the output.
One other positive side-effect of using "LLMs as a junior-dev" for me has been that my ambitions are greater. I want it all - better code, more sophisticated capabilities even for relatively not-important projects, documentation, tests, debug-ability. And once the basic structure is in place, many a time it is trivial to get the rest.
It's never 100%, but even with 80+%, I am faster than ever before, deliver better quality code, and can switch domains multiple times a week and never feel drained.
Sharing best AI hacks within a team will have the same effect as code-reviews do in ensuring consistency. Perhaps an "LLM chat review", especially when something particularly novel was accomplished!
If I was in an environment that didn't allow hosted API models I'd absolutely be looking into the various Llama 3 models or Qwen2.5-Coder-32B.
files-to-prompt . -e py -e md -c | pbcopy
Now I have all the Python and Markdown files from the current project on my clipboard, in Claude's recommended XML-like format (which I find works well with other models too).Then I paste that into the Claude web interface or Google's AI Studio if it's too long for Claude and ask questions there.
Sometimes I'll pipe it straight into my own LLM CLI tool and ask questions that way:
files-to-prompt . -e py -e md -c | \
llm -m gemini-2.0-flash-exp 'which files handle JWT verification?'
I can later start a chat session on top of the accumulated context like this: llm chat -c
(The -c means "continue most recent conversation in the chat").I haven't actually done many experiments with long context local models - I tend to hit the hosted API models for that kind of thing.
Instead, think of an LLM as the equivalent of giving a human a menial task. You know that they're not 100% reliable, and so you give them only tasks that you can quickly verify and correct.
Abstract that out a bit further, and realize that most managers don't expect their reports to be 100% reliable.
Don't use LLMs where accuracy is paramount. Use it to automate away tedious stuff. Examples for me:
Cleaning up speech recognition. I use a traditional voice recognition tool to transcribe, and then have GPT clean it up. I've tried voice recognition tools for dictation on and off for over a decade, and always gave up because even a 95% accuracy is a pain to clean up. But now, I route the output to GPT automatically. It still has issues, but I now often go paragraphs before I have to correct anything. For personal notes, I mostly don't even bother checking its accuracy - I do it only when dictating things others will look at.
And then add embellishments to that. I was dictating out a recipe I needed to send to someone. I told GPT up front to write any number that appears next to an ingredient as a numeral (i.e. 3 instead of "three"). Did a great job - didn't need to correct anything.
And then there are always the "I could do this myself but I didn't have time so I gave it to GPT" category. I was giving a presentation that involved graphs (nodes, edges, etc). I was on a tight deadline and didn't want to figure out how to draw graphs. So I made a tabular representation of my graph, gave it to GPT, and asked it to write graphviz code to make that graph. It did it perfectly (correct nodes and edges, too!)
Sure, if I had time, I'd go learn graphviz myself. But I wouldn't have. The chances I'll need graphviz again in the next few years is virtually 0.
I've actually used LLMs to do quick reformatting of data a few times. You just have to be careful that you can verify the output quickly. If it's a long table, then don't use LLMs for this.
Another example: I have a custom note taking tool. It's just for me. For convenience, I also made an HTML export. Wouldn't it be great if it automatically made alt text for each image I have in my notes? I would just need to send it to the LLM and get the text. It's fractions of a cent per image! The current services are a lot more accurate at image recognition than I need them to be for this purpose!
Oh, and then of course, having it write Bash scripts and CSS for me :-) (not a frontend developer - I've learned CSS in the past, but it's quicker to verify whatever it throws at me than Google it).
Any time you have a task and lament "Oh, this is likely easy, but I just don't have the time" consider how you could make an LLM do it.
Then why do people keep pushing it for code related tasks?
Accuracy and precision is paramount with code. It needs to express exactly what needs to be done and how.
If the LLM hallucinates something the code won't compile or run.
If the LLM makes a logic error you'll catch it in the manual QA process.
(If you don't have good personal manual QA habits, don't try using LLMs to write your code. And maybe don't hit "accept" on other developer's code reviews either?)
This is an overly simplistic view of software development.
Poorly made abstractions and functions will have knock on effects on future code that can be hard to predict.
Not to mention that code can have side effects that may not affect a given test case, or the code could be poorly optimized, etc.
Just because code compiles or passes a test does not mean it’s entirely correct. If it did, we wouldn’t have bugs anymore.
The usual response to this is something like “we can use the LLM to refactor LLM code if we need” but, in my experience, this leads to very complex, hard to reason about codebases.
Especially if the stack isn’t Python or JavaScript.
Instead of going through a multi step process to get an LLM to generate it, review it, reject it, and repeat…
I wonder why you reply to these comments, but not my other asking what you use LLMs for and specifically explaining how they failed me.
They don't. You are likely experiencing selection bias. My guess is you work in SW, and so it makes sense that you're the target of those campaigns. The bulk of ChatGPT subscribers are not doing SW, and no one is bugging them to use it for code related tasks.
Obviously people not in the software field wouldn’t care…
If you zero-prompt and copy-paste the first result into your codebase, yeah, the accuracy problem will rear its ugly head real quick.
My programmer mind tells me that "tedious stuff" is where accuracy is the most important.
The problem is: for the tasks that I can give the LLM (or human) that I can easily verify and correct, the LLM fails with the majority of them, for example
- programming tasks of my area of expertise (which is more "mathematical" than what is common in SV startups), where I know how a high-level solution has to look like, and where I can ask the LLM to explain the gory details to me. Yes, these gory details are subtle (which is why the task can be menial), but the code has to be right. I can verify this, and the code is not correct.
- getting literature references about more obscure scientific (in particular mathematical) topics. I can easily check whether these literature references (or summaries of these references) are hallucinations - they typically are.
Your second task is not a "task", but a knowledge search. LLMs are not good with searches (unless augmented - like RAG).
You're misrepresenting it here.
The point of that post isn't "look at these incredible projects I've built (proceeds to show simple projects)."
It's "I built 14 small and useful tools in a single week, each taking between 2 and 10 minutes".
The thing that's interesting here is that I can have an LLM kick out a working prototype of a small, useful tool in only a little more time than it takes to run a Google search.
That post isn't meant to be about writing "real production code". I don't know why people are confused over that.
The best prompts though are always written in a separate text file for me and pasted in. Follow up questions are never as good as a detailed initial prompt.
I would imagine well formulated questions to solve the problem at hand is a skill but beyond that I don't think there is anything special about how to ask LLMs a question.
In areas the LLM is rather useless, no amount of variation in prompting can solve that problem IMO. Just like if the tasks is something the LLM is good at, the prompt can be pretty sloppy and seem like magic with how it can understand what you want.
Once that's all done, you basically have a well-structured question you could pass to an underling and have them completely independently work on the project without bugging you. That's the goal. Now, pass that to o1 or Claude, depending on whether it's a general-purpose task (o1) or a code-specific task (Claude), and wait for response. From there, have a conversation or test-and-followup of whatever it spits out, this time with you asking questions. If good enough, done. If not, wrap up whatever useful insights from that line of questioning and put it back into the initial prompt and either re-post it at the end of the conversation or start a fresh conversation.
I find 90% of the time this gets exactly what I'm after eventually. The few other cases are usually because we hit some cycle where the AI doesn't fully know what to change/respond, and it keeps repeating itself when I ask. The trick then is to ask things a different way or emphasize something new. This is usually just a code-specific issue, for general problems it's much better. One other trick is to ask it to take a step back and just tackle the problem in a theoretical/philosophical way first before trying to do any coding or practical solving, and then do that in a second phase (asking o1 to architect code structure and then Claude to implement it is a great combo too). Also if there is any way to break up the problem into smaller pieces which can be tackled one conversation at a time - much better. Just remember to include all relevant context it needs to interface with the overall problem too.
That sounds like a lot, but it's essentially just project management and delegation to somewhat-flawed underlings. The upside is instead of waiting a workweek for them to get back to you, you just have to wait 20 seconds. But it does mean a ton of reading and writing. There are certainly already some meta-prompts where you can get the AI to essentially do this whole process for you and assess itself, but like all automation that means extra ways for things to break too. Let the AI devs cook though and those will be a lot more commonplace soon enough...
[Edit: o1 mostly agrees lol. Some good additional suggestions for systematizing this: https://chatgpt.com/share/6775b85c-97c4-8003-bd31-ee288396ab... ]
For example if someone just takes random information about a topic, organizes it in chronological order and adds empty opinions and preferences to it and does that for years on end - what do you call that?
I'm pretty sure that's been possible for a while. There was an example where Claude's computer use feature ordered pizza for the dev team through DoorDash: https://x.com/alexalbert__/status/1848777260503077146?lang=e...
I don't think the released version of the feature can do it, but it should be possible with today's tech.
In case you're interested, here's a summarized list (thanks, Claude) of the negative/critical things I said about LLMs and the companies that build them in this post: https://gist.github.com/simonw/73f47184879de4c39469fe38dbf35...
We all have silently started to realize Slops, hopefully we can recognize them more easily and prevent them.
Test Driven Development (Integration Tests or functional tests specifically) for Prompt Driven Development seems like the way to go.
Thank you, Simon.
>LLM prices crashed
This one has me a little spooked. The white knight on this front (DS) has both announced increases and has had staff poached. There is still Gemini free tier which is ofc basically impossible to beat (solid & functionally unlimited/free) but it's google so reluctant to trust.
Seriously worried about seeing a regression on pricing in first half of 2025. Especially with the OAI $200 price anchoring.
>“Agents” still haven’t really happened yet
Think that's largely because it's a poorly defined concept and true "agent" implies some sort of pseudo-agi autonomy. This is a definition/expectation issue rather than technical in my mind
>LLMs somehow got even harder to use
I don't think that's 100%. An explosion of options is not equal to harder to use. And the guidance for noobs is still pretty much same as always (llama.cp or one of the common frontends like text-generation-webui). It's become harder to tell what is good, but not to get going.
----
One key theme I think is missing is just how hard it has become to tell what is "good" for the average user. There is so much benchmark shenanigans going on that it's just impossible to tell. I'm literally at the "I'm just going to build my own testing framework" stage. Not because I can do better technically (I can't)...but because I can gear it towards things I care about and I can be confident my DIY sample hasn't been gamed.
These companies are incentivized to figure out fast and efficient hosting for the models. They don't need to train any models themselves, their value is added entirely in continuing to drive the price of inference down.
Groq and Cerberus are particularly interesting here because WOW they serve Llama fast.
Is it free free? The last time I checked there was a daily request limit, still generous but limiting for some use cases. Isn't it still the case?
A small number of people with lots of power are essentially deciding to go all in on this technology presumably because significant gains will mean the long term reduction of human labor needs, and thus human labor power. As the article mentions, this also comes at huge expenditure and environmental impact, which is already a very important domain in crisis that we've neglected. The whole thing especially becomes laughable when you consider that many people are still using these tools to perform tasks that could be preformed with a margin of more effort using existing deterministic tools. Instead we are now opting for a computationally more expensive solution that has a higher margin of error.
I get that making technical progress in this area is interesting, but I really think the lower level workers and researchers exploring the space need to be more emphatic about thinking about socioeconomic impact. Some will argue that this is analogous to any other technological change and markets will adjust to account for new tool use, but I am not so sure about this one. If the technology is really as groundbreaking as everyone wants us to believe then logically we might be facing a situation that isn't as easy to adapt to, and I guarantee those with power will not "give a little back" to the disenfranchised masses out of the goodness of their hearts.
This doesn't even raise all the problems these tools create when it comes to establishing coherent viewpoints and truth in ostensibly democratic societies, which is another massive can of worms.
This 100%. “Agentic” especially as a buzzword can piss off
My problem is when people use that definition (or any other) without clarifying, because they assume it's THE obvious definition.
This has always been the benchmark, they are not that useful to me. Everytime I say this, someone hits me with the "yeah, I bet you haven't tried ShitLLM 4.0-pqr". It's very tiring. Your new LLM hype model is nothing but a marginal, over hyped improvement over something that fundamentally is not intelligent.
The money is still flowing, for now, to subsidize that fiasco but as soon as that starts to slow, even just a bit, things are gonna get bumpy real quick. Super excited about this tech but there are dark storm clouds building on the horizon and absent a major “moat” breakthrough it’s gonna get rough soon.
That’s exactly what happened with rideshare companies. It was an amazing new thing but subsidized in an unsustainable way, then a bunch of companies exited the space when it was an commoditized race to the bottom and those left let quality slip. Now when you order an Uber a car shows up that smells bad and has wheels about to fall off. The consumer experience was a lot better when Uber was a VC subsidized bonanza
https://www.economist.com/finance-and-economics/2024/08/21/w...
The big challenge is figuring out how to use it. I usually like working at the function level: I figure out the exact function signature I want in Python or JavaScript and then get Claude to implement it for me.
Claude Artifacts are neat too: Claude can build a full HTML+JavaScript UI, and then iterate on it. I use this for interactive UI prototypes and building small tools.
I've published a whole lot of notes on this stuff here: https://simonwillison.net/tags/ai-assisted-programming/
Step 2: write a slack style message as if you are discussing the solution with a teammate that you have authority over as a delegate to get shit done & to revise as needed.
Step 3: press enter, LLM does something you don't like, delete history, fix prompt in step 2 and ask again, rinse and repeat until you have working code.
Step 4: ask for the changes to be written as a bash file that cat EOF all the files that change into place, run the script.
Step 5: git diff & play test the changes using functional testing (use your mouse & keyboard test the code paths that changed...)
Step 6: continue prompting & deleting history as needed to refine.
Step 7: commit code to repos
There were a few interesting papers - the Anthropic one about alignment faking https://www.anthropic.com/news/alignment-faking and the OpenAI o1 system card https://simonwillison.net/2024/Dec/5/openai-o1-system-card/ - and OpenAI continued to push their "instruction hierarchy" idea, any other big moments?
I'll be honest, I don't follow that side of things very closely (outside of complaining that prompt injection still isn't fixed yet).