However on Arduino it's amazing, until the day it forgot to add a initializing method. I didn't notice and neither did she. We've talked about possible issues for at least a hour, I switched hardware, she reiterated every line of the code. When I found the error she said, "oh yes! That's right. (Proceeding with why that method is essential for it to work)" that was so disrespecting in a way that I am still somewhat disappointed and pissed.
> I also get to do a code review.
Don't you review your own code after some checkpoint too?
This reduction in cognitive load is the real force multiplier.
Sure, the green screen code didn't work exactly as I wished, but it made use of OpenCV functions I was not aware of and it was quite easy to make the required fixes.
In my mind it is exactly the opposite: yes, I've already done the hard work of formulating how I want the problem solved, so why not have the computer do the busywork of writing the code down?
But aside from those situations, do you not think that the developers using AI - many of whom are experienced and respected - must be getting value? Or do you think they are deluded?
LLM are far from perfect at knowing their limits, but they are better at it than most people give them credit for. They just never do it unless prompted for it.
Fine tuning can improve that ability. For example the thinking tokens paper [1] is at some level training the model to output a special token when it doesn't reach a good answer (and then try again, thus "thinking")
The halting problem isn't so relevant in most development, and nothing stops you having a classifier that says "yes", "no" or "maybe". You can identify code that definitely finishes, and you can identify code that definitely doesn't. You can also identify some risky code that probably might. Under condition X, it would go into an infinite loop - even if you're not sure if condition X can be met.
> You want it to run until you tell it to stop,
No? Many programs I don't want to run until I tell them to stop.
Even then, this reduces it to irrelevance.
I think that's the joke. In a sci-fi story, that would make the computer explode.
There are also large applications like https://devin.ai/ or https://github.com/AI-App/OpenDevin.OpenDevin
What we really need (from end-user POV) is that kinda 'resting assumption' that LLMs we talk to via chat clients are verifying any math they do. For actually programming, I like Replit, Cursor, ClaudeEngineer, Aider, Devin. There are bunch of others. All of them seem to now include ongoing 'agentic' steps where they keep trying until they get the response they want, with you as human in the chain, approving each step (usually).
* I (messing locally with my own tooling and chat client) just ask the LLM for what I want, delimited in some way by a boundary I can easily check for, and then I'll grab whatever is in it and run it in a worker or semi-sandboxed area. I'll halt the stream then do another call to the LLM with the latest output so it can continue with a more-informed response.
The problem with automating it is that the number of environments you'd need to support to actually run arbitrary code with is practically infinite, and with local dependencies genuinely impossible unless there's direct integration, which means running it on your machine. And that means giving an opaque service full access to your environment. Or at best, a local model that's still a binary blob capable of outputting virtually anything, but at least it won't spy on you.
I use ChatGPT to ask for code examples or sketching out pieces of code, but it's just not going to be nearly as good as anything in an IDE. And once it runs in the IDE then it has access to what it needs to be in a feedback loop with itself. The user doesn't need to see any intermediate steps that you would do with a chatbot where you say "The code compiles but fails two tests what should I do?"
Furthermore LLMs make those kinds of "simple" errors less and less, especially if the environment is well defined. "Write a python script" can go horribly wrong, but "Write a python 3.10 script" is most likely gonna run fine but have semantic issues where it made assumptions about the problem because the instructions were vague. Performance should increase with more user input, not less.
I often find that ChatGPT often reasons itself to a better solution (perhaps not correct or final, but better) if it just gets some feedback from e.g. compiler errors. Usually it's like
Me: "Write a function that does X and satisifies this test code"
LLM: responds with function (#1)
Me: "This doesn't compile. Compiler says X and Y"
LLM: Apologies: here is the fixed version (#2)
Me: "Great, now it compiles but it fails one of the two test methods, here is the output from the test run: ..."
LLM: I understand. Here is an improved verison that should pass the tests (#3)
Me: "Ok now you have code that could theoretically pass the tests BUT you introduced the same syntax errors you had in #1 again!"
LLM: I apologize, here is a corrected version that should compile and pass the tests (#4)
etc etc.
After about 4-5 iterations with nothing but gentle nudging, it's often working. And there usually isn't more nudging than returning the output from compiler or test runs. The code at the 4th step might not be perfect but it's a LOT better than it was first. The problem with this workflow is that it's like having a bad intern on the phone pair programming. Copying and pasting code back and forth and telling the LLM what the problem with it is, is just not very quick. If the iterations are automatic so the only thing I can see is step #4, then at least I can focus on the manual intervention needed there. But fixing a trivial syntax error beteween #1 and #2 is just a chore. I think ChatGPT is simply pretty bad here, and the better models like opus probably doesn't have these issues to the same extent
Even worse than that - an intern has a chance to learn from this experience, get better and become a senior one day.
There are a lot of things that people ask LLMs to do, often in a "gotcha" type context, that would be best served by it actually generating code to solve the problem rather than just endlessly making more parameter/more layer models. Math questions, data analysis questions, etc. We're getting there.
It works remarkably well with typed Python, but struggles miserably with Rust despite having better error reporting.
It seems like with Rust it's not quite aware of which patterns to use, especially when the actual changes required may span multiple files due to the way memory management is structured.
What do you mean? Memory management is not related to files in Rust (or most languages).
In rust, when you refactor something that deals with the borrow checker's shenanigans, you will likely have to change a bunch of files (from experience). This means that an AI will likely also have to change a bunch of files which they say the AI isn't so good at. They don't say this HAS to happen, just that it usually does because the borrow checker is an asshole.
This aligns with my experience as well, though I dealt with Rust before there was AI, so I can say little in regards to how the AI deals with that.
You can find more details about this experiment in a blog post: https://mixedbit.org/blog/2024/12/16/improving_unit_test_cov...
Obviously, that's Rust, which is famously difficult to get compiling. It makes sense that it would have an easier time with a dynamic language like Python where it only has to handle the edge cases it wrote tests for and not all the ones the compiler finds for you.
I've very rarely seen it simplify things to get the code to work.
(In theory, you get a clean-room implementation of the original code. If you do this please ping me because I'd love to see the results.)
(And a more philosophical question: if it's not enough, what does that mean for continuous deployment?)
Example transcript here (also showing that o1 can't search but will pretend it can): https://chatgpt.com/share/677420e4-8854-8006-8940-9bc30b7088...
The best way I can describe working with GitHub Copilot Workspace is like working with an intern who's been stuck on an isolated island for years, has no access to technology, and communicates with you by mailing letters with code handwritten on them that he thinks will work. And also if you mail too many letters back and forth he gets mad and goes to sleep for the day saying you reached a "rate limit". It's just not how software development works
ChatGPT was the first to introduce this capability with Code Interpeter mode back in around March 2023: https://simonwillison.net/tags/code-interpreter/
This lets ChatGPT write and then execute Python code in a Kubernetes sandbox. It can run other languages too, but that's not documented or supported. I've even had it compile and execute C before: https://simonwillison.net/2024/Mar/23/building-c-extensions-...
Gemini can run Python (including via the Gemini LLM API if you turn on that feature) but it's a lot more restricted than ChatGPT - I don't believe it can install extra wheels, for example.
Claude added the ability to write and execute JavaScript recently (October), which happens in a sandbox in the user's browser, not on their servers: https://simonwillison.net/2024/Oct/24/claude-analysis-tool/
Claude also has Artifacts, which can write a UI in HTML and JavaScript and show that to the user... but can't actually execute code in a way that's visible to the LLM itself so doesn't serve the same feedback look purposes as those other tools. https://simonwillison.net/tags/claude-artifacts/
In December ChatGPT added Canvas which can execute Python in the user's browser, super confusing because they already have a separate Python system in Code Interpreter: https://simonwillison.net/2024/Dec/10/chatgpt-canvas/#canvas...
I’m sure there’s enough documented patterns of how to improve code in common languages that it’s not hard to get it to do that. Getting it to spot when it’s inappropriate would be harder.
Sure, you'll get better results with an LLM when you're more specific, but what's the point then? I don't need AI when I already know what changes to make.
Human developers will be more focused on this type of system integration and diagnostics work. There will be more focus on reading and understanding than the actual writing. It's a bit like working with contractors.
If I were a CS student cramming for interviews, I might be dismayed to see that my entire value proposition has been completely automated before I even enter the market.
Automating the feedback loop is key.
Maybe if it can run sandboxed, with no internet access (but if the LLM is not local, it does require internet access).
When all you have is syntax, something like "better" is 100% in the eye of the beholder.
One question: Claude seems very powerful for coding tasks, and now my attempts to use local LLMs seem misguided, at least when coding. Any disagreements from the hive mind on this? I really dislike sending my code into a for profit company if I can avoid it.
Second question: I really try to avoid VSCode (M$ concerns, etc.). I'm using Zed and really enjoying it. But the LLM coding experience is exactly as this post described, and I have been assuming that's because Zed isn't the best AI coding tool. The context switching makes it challenging to get into the flow, and that's been exactly my criticism of Zed this far. Does anyone have an antidote?
Third thought: this really feels like it could be an interesting way to collaborate across a code base with any range of developer experience. This post is like watching the evolution of a species in an hour rather than millions of years. Stunning.
Intellij has a new feature that lets you prompt within your code that is pretty neat too, but I'm missing the Composer/apply feature of cursor still
It’s remarkable, and I agree Claude 3.5 makes playing with local LLMs seem silly in comparison. Claude is useful for generating real work.
That said, there are increasingly great coding models you can run locally. Qwen2.5-Coder-32B impressed me a lot a few months ago: https://simonwillison.net/2024/Nov/12/qwen25-coder/
The problem I have is that models like that one take up 20+GB of RAM, and id rather use that to run more Chrome and Firefox windows! If I was serious about using local LLMs on a daily basis I'd set up a dedicated local server machine for them, super expensive though.
Thanks for your comment! I'm going to try out qwen.
> I really dislike sending my code into a for profit company if I can avoid it
I see a link between them - maybe the model got good because it used chat logs to improve?
I hadn't seen this before. Why is asking for planning better than asking it to think step by step?
- start by "chatting" with the model and asking for "how you'd implement x y z feature, without code".
- what's a good architecture for x y z
- what are some good patterns for this
- what are some things to consider when dealing with x y z
- what are the best practices ... (etc)
- correct / edit out some of the responses
- say "ok, now implement that"
It's basically adding stuff to the context by using the LLM itself to add things to context. An LLM is only going to attend to it's context, not to "whatever it is that the user wants it to make the connections without actually specifying it". Or, at least in practice, it's much better at dealing with things present in its context.
Another aspect of prompting that's often misunderstood is "where did the model see this before in its training data". How many books / authoritative / quality stuff have you seen where each problem is laid out with simple bullet points? Vs. how many "tutorials" of questionable quality / provenance have that? Of course it's the tutorials. Which are often just rtfm / example transcribed poorly into a piece of code, publish, make cents from advertising.
If instead you ask the model for things like "architecture", "planning", stuff like that, you'll elicit answers from quality sources. Manuals, books, authoritative pieces of content. And it will gladly write on those themes. And then it will gladly attend to them and produce much better code in a follow-up question.
It was tried as part of the same trend. I remember people asking it to make a TODO app and then tell it to make it better in an infinite loop. It became really crazy after like 20 iterations.
When I then notice that this is really does not make any sense, I check what else it could be and end up noticing that I've been improving the wrong file all along. What then surprises me the most is that I cleaned it up just by reading it through, thinking about the code, fixing bugs, all without executing it.
I guess LLMs can do that as well?
- write a simple prompt that explains in detail the wanted outcome.
- look at the result, run it and ask it how it can improve.
- tell it what to improve
- tell it to make a benchmark and unit test
- run it each time and see what is wrong or can be improved.
Also: If you're experienced at code reviews, you can get great results.
I'm using o1, so I don't know how well it translate to other models.
Thanks, that really made it click for me.
LLMs are a tiny tiny fraction of that.
For a majority of software, average code that does the CRUD thing or whatever is fine.
Even if LLMs never get better or cheaper than they are today, our entire industry is forever changed (for the better).
Usually, specifying the packages to use and asking for something less convoluted works really well. Problem is, how would you know if you have never learned to code without an LLM?
ChatGPT has, for several generations, generally made stuff that works, but the libraries it gives me are often not the most appropriate, and are sometimes obsolete or no longer functional — and precisely because web and python are hobbies for me rather than my day job, it can take me a while to spot such mistakes.
Two other things I've noticed, related in an unfortunate way:
1) Because web and python not my day job, more often than not and with increasing frequency, I ultimately discover that when I disagree with ChatGPT, the AI was right and I was wrong.
2) These specific models often struggle when my response has been "don't use $thing or $approach"; unfortunately this seems to be equally applicable regardless of if the AI knew more than me or not, so it's not got predictive power for me.
(I also use custom instruction, you YMMV)
Instead, think of your queries as super human friendly SQL.
The database? Massive amounts of data boiled down to unique entries with probabilities. This is a simplistic, but accurate way to think of LLMs.
So how much code is on the web for a particular problem solve? 10k blog entries, stackoverflow responses? What you get back is mishmash of these.
So it will have decade old libraries, as lots of those scraped responses are 10 years old, and often without people saying so.
And it will likely have more poor code examples than not.
I'm willing to bet that OpenAI's ingress of stackoverflow responses stipulated higher priority on accepted answers, but that still leaves a lot of margin.
And how you write your query, may sideline you into responses with low quality output.
I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said.
And I've seen some pretty poor code examples out there.
> The database? Massive amounts of data boiled down to unique entries with probabilities. This is a simplistic, but accurate way to think of LLMs.
This is a useful model for LLMs in many cases, but it's also important to remember that it's not a database with perfect recall. Not only is it a database with a bunch of bad code stored in it, it samples randomly from that database on a token by token basis, which can lead to surprises both good and bad.
Re-reading my own comment, I am unclear why you think it necessary to say those specific examples — my descriptions were "results, made, disagree, right/wrong, struggle": tools make things, have results; engines struggle; search engines can be right or wrong; words can be disagreed with regardless of authorship.
While I am curious what it would mean for a system to "think" or "comprehend", every time I have looked at such discussions I have been disappointed that it's pre-paradigmatic. The closest we have is examples such as Turing 1950[0] saying essentially (to paraphrase) "if it quacks like a duck, it's a duck" vs. Searle 1980[1] which says, to quote the abstract itself, "no program by itself is sufficient for thinking".
> I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said.
All of maths can be derived from the axioms of maths. All chess moves derive from the rules of the game. This kind of process has a lot of legs, regardless of if you want to think of the models as "thinking" or not.
Me? I don't worry too much if they can actually think, not because there's no important philosophical questions about what that even means, but because other things have a more immediate impact: even if they are "just" a better search engine, they're a mechanism that somehow managed to squeeze almost all of the important technical info on the internet into something that fits into RAM on a top-end laptop.
The models may indeed be cargo-cult golems — I'd assume that by default, there's so much we don't yet know — but whatever is or isn't going on inside, they still do a good job of quacking like a duck.
[0] Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59, 433–460. https://doi.org/10.1093/mind/LIX.236.433
[1] Searle, J. R. (1980). Minds, brains, and programs. Behavioral and Brain Sciences, 3(3), 417–424. https://doi.org/10.1017/S0140525X00005756
Sorry to cause unneeded introspection, my comment was sort of thread based, not specific in whole to your comment.
Either way, no need to apologise :)
* intentional
I disagree that this is the accurate way to think about LLMs. LLMs still use a finite number of parameters to encode the training data. The amount of training data is massive in comparison to the number of parameters LLMs use, so they need to be somewhat capable of distilling that information into small pieces of knowledge they can then reuse to piece together the full answer.
But this being said, they are not capable of producing an answer outside of the training set distribution, and inherit all the biases of the training data as that's what they are trying to replicate.
> I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said. And I've seen some pretty poor code examples out there. Yup, exactly this.
>I guess my point is, when you use LLMs for tasks, you're getting whatever other humans have said.
This isn't correct. It embeds concepts that humans have discussed, but can combine them in ways that were never in the training set. There are issues with this, the more unique the combination of concepts, the more likely the output ends up being unrelated to what the user was wanting to see.
> Instead, think of your queries as super human friendly SQL.
Ehh this might be true in some abstract mathy sense (like I don't know, you are searching in latent space or something), but it's not the best analogy in practice. LLMs process language and simulate logical reasoning (albeit imperfectly). LLMs are like language calculators, like a TI-86 but for English/Python/etc, and sufficiently powerful language skills will also give some reasoning skills for free. (It can also recall data from the training set so this is where the SQL analogy shines I guess)
You could say that SQL also simulates reasoning (it is equivalent to Datalog after all) but LLMs can reason about stuff more powerful than first order logic. (LLMs are also fatally flawed in the sense it can't guarantee correct results, unlike SQL or Datalog or Prolog, but just like us humans)
Also, LLMs can certainly make decisions, such as the decision to search the web. But this isn't very interesting - a thermostat makes the decision of whether turn air refrigeration on or off, for example, and an operating system makes the decision of which program to schedule next on the CPU.
I think your view of llm does not explain the learning of algorithms that these constructs are clearly capable of, see for example: https://arxiv.org/abs/2208.01066
More generally, the best way to compress information from too many different coding examples is to figure out how to code rather than try to interpolate between existing blogs and QA forums.
My own speculation is that with additional effort during training (RL or active learning in the training loop) we will probably reach superhuman coding performance within two years. I think that o3 is still imperfect but not very far from that point.
> in-context learning
LLMs have no concept of the symantic meaning of what they do, they just are dealing with next token prediction.
"in-context learning" is the problem, not the solution to general programming tasks.
Memoryless, ergodic, sub Turing complete problems are a very tiny class.
Think about how the Entscheidungsproblem relates to halting or the frame problem and the specification problem may be a path.
But that paper isn't solving the problem at hand.
https://youtube.com/playlist?list=PLm3J0oaFux3b8Gg1DdaJOzYNs...
Almost all the performance of say college tests are purely from the pre-training, pattern finding and detection.
Transformers are limited to DLOGTIME-uniform TC0, they can't even do the Boolean circuit value problem.
The ability to use the properties of BPP, does help.
Understanding the power of, and limitations of iteration and improving approximations requires descriptive complexity theory IMHO.
There is a difference between being equivalent to a circuit and prediction of the output of the BVSP.
That is what I was suggesting learning descriptive complexity theory would help with.
The Curry–Howard–Lambek correspondence is possibly a good tool to think about it.
The reason I suggested graduate level complexity theory is because the undergrad curriculum is flawed in that it seems that you can use brute force with a TM to stimulate a NTM with NP.
It is usually taught that NP is the set of decision problems that can be solved by a NTM in polynomial time.
But you can completely drop the NTM and say it is the set of decision problems that are verifiable by a DTM in poly time.
Those are equivalent.
Consider the The Approximate Shortest Vector Problem (GapSVP), which is NP-HARD, and equivalent to predicting the output of a 2 layer NN (IIRC).
Being NPH, it is no longer a decision problem.
Note that for big 0, you still have your scaler term. Repeated operations are typically dropped.
If you are in contemporary scale ML, parallelism is critical to problems being solvable, even with FAANG level budgets.
If you are limited to DLOGTIME-uniform TC0, you can't solve NC1- complete problems, and surely can't do P-complete problems.
But that is still at the syntactic level, software in itself isn't worth anything, it is the value it provides to users that is important.
Basically what you are claiming is that feed forward NN solve the halting problem, in a generalized way.
Training an LLM to make safe JWT refresh code is very different from generalized programming. Mainly because most of the ability for them to do so is from pre-training.
Inference time is far more limited, especially for transformers and this is well established.
Look into DBScan, OPTICs for far closer lenses on how clustering works in modern ML commercial ML, KNN not the only form of clustering.
But it is still in-context, additional compression that depends on a decider function, or equivalently a composition linearized set shattering parts.
And nuclear power plants are just heating water.
A transformer is not a compressor. It's a transformer/generator. It'll generate a different output for an infinite number of different inputs. Does that mean it's got an infinite storage capacity?
The trained parameters of a transformer are not a compressed version of the training set, or of the information content of the training set; they are a configuration of the transformer so that its auto-regressive generative capabilities are optimized to produce the best continuation of partial training set samples that it is capable of.
Now, are there other architectures, other than a transformer, that might do a better job, or more efficient one (in terms of # parameters) at predicting training set samples, or even of compressing the information content of the training set? Perhaps, but we're not talking hypotheticals, we're talking about transformers (or at least most of us are).
Even if a transformer was a compression engine, which it isn't, rather than a generative architecture, why would you think that the number of tokens in the training set is a meaningful measure/estimate of it's information content?!! Heck, you go beyond that to considering a specific tokenization scheme and number bits/bytes per token, all of which it utterly meaningless! You may as well just count number of characters, or words, or sentences for that matter, in the training set, which would all be equally bad ways to estimate it's information content, other than sentences perhaps having at least some tangential relationship to it.
sigh
You've been downvoted because you're talking about straw men, and other people are talking about transformers.
At training time the model learns using the gradient descent algorithm to find the parameter values corresponding to the minimum of the error function. At run-time there are no more parameter updates - no learning in that sense.
In-context "learning" is referring to the ability of the trained model to utilize information (e.g. proper names, examples) from the current input, aka context, when generating - an ability that it learnt at training time pursuant to it's error minimization objective.
e.g.
There are going to be many examples in the training set where the subject of a sentence is mentioned more than once, either by name or pronoun, and the model will have had to learn when the best prediction of a name (or gender) later in a sentence is one that was already mentioned earlier - the same person. These names may be unique to an individual training sample, and/or anyways the only predictive signal of who will be mentioned later in the sentence, so at training time the model (to minimize prediction errors) had to learn that sometimes the best word/token to predict is not one stored in it's parameters, but one that it needs to copy from earlier in the context (using a key-based lookup - the attention mechanism).
If the transformer, at run-time, is fed the input "Mr. Smith received a letter addressed to Mr." [...], then the model will hopefully recognize the pattern and realize it needs to do a key-based context lookup of the name associated with "Mr.", then copy that to the output as the predicted next word (resulting in "addressed to Mr. Smith"). This is referred to as "in-context learning", although it has nothing to with the gradient-based learning that takes place at training time. These two types of "learning" are unrelated.
Similar to the above, another example of in-context learning is the learning of simple "functions" (mappings) from examples given in the context. Just as in the name example, the model will have seen many examples in the training of the types of pattern/analogy it needs to learn to minimize prediction errors (e.g. "black is to white as big is to small", or black->white, big->small), and will hopefully recognize the pattern at run-time and again use an induction-head to generate the expected completion.
The opening example in the paper you linked ("maison->house, chat->cat") is another example of this same kind. All that is going on is that the model learnt, at training time, when/how to use data in the context at run-time, again using the induction head mechanism which has general form A':B' -> A:B. You can call this an algorithm if you want to, but it's really just a learnt mapping.
I feel that comparison oversells things quite a lot.
The user is setting up a text document which resembles a question-and-response exchange, and executing a make-any-document-bigger algorithm.
So it's less querying for data and more like shaping a sleeping dream of two fictional characters in conversation, in the hopes that the dream will depict one character saying something superficially similar to mostly-vanished data.
Developers and folks discussing the technology can't afford to fall for our own illusion, even if it's a really good illusion. Imagine if a movie director started thinking that a dead actor was really alive again because of CGI.
Maybe because of experience: it's much simpler and easier to turn that into "senior code". After a few decades of experience I appreciate simplicity over the over-engineering mess that some mid-level developers tend to produce.
I tell it up front that I am using react-ts and mui.
80% of the time it will use tailwind classes which makes zero sense. It won’t use the sx prop and mui system.
It is also outdated it seems. It keeps using deprecated props and components which sucks and adds more manual effort on my end to fix. I like the quality of Claude’s UX output, it’s just a shame that it seems so bad on actual coding tasks.
I stopped using it for any backend work because it is so outdated, or maybe it just doesn’t have the right training data.
On the other hand, I give ChatGPT a link to the docs and it gives me the right code 90% or more of the time. Only shame is that its UX output is awful compared to Claude. I am also able to trust it for backend tasks, even if it is verbose AF with the explanations (it wants to teach me even if I tell it to return code only).
Either way, using these tools in conjunction saves me at least 30 min to an hour daily on tasks that I dislike.
I can crank out code better than AI, and I actually know and understand systems design and architecture to build a scalable codebase both technically and from organizational level. Easy to modify and extend, test, and single responsibility.
AI just slams everything into a single class or uses weird utility functions that make no sense on the regular. Still, it’s a useful tool in the right use cases.
Just my 2 cents.
I'm more comfortable using it this way.
Copy and pasting code it gives you just means your workflow is totally borked, and it's no wonder you wouldn't want to try to let it generate code, because it's such a pain in your ass to try it, diff it, etc.
You "can" get the web UI to behave similarly but it's both tedious and slow to manually copy and paste all of that into your context during each interaction and the output will be unfriendly towards human interaction to paste it back out to your project. But that's like saying you "can" browse the internet with a series of CURL commands and pasting the output into files you save locally and then viewing them locally from your browser, nobody is advised to do that because it's a painfully bad experience compared to just having your browser fetch a site's files directly and rendering them directly.
Just go check out Aider or Cline's project repos and look at the dramatically different amounts of code, repo and task specific context they can automatically inject for you as part of their interface, or how much different the built in system prompts are from whatever the default web UIs use, or even the response structures and outputs and how those are automatically applied to your work instead. I've never once exhausted my daily API limits just treating their APIs as Chat interface backends (via Open WebUI and other chat options), but I exhausted my Claude API token limits _the very first day_ I tried Cline. The volume of information you can easily provide through tooling is impossible to do in the same timeframe by hand.
I’m simply not interested in having these tools type for me. Typing is nowhere near the hardest part of my job and I find it invaluable as a meditative state for building muscle memory for the context of what I’m building.
Taking shortcuts has a cost I’m not willing to pay.
If that's a risk you're willing to take for the sake of productivity, that can be a reasonable tradeoff depending on your project and career goals.
I'm using it to increase my own.
Using AI as a rubber duck and conversation partner is great, I strongly suggest that. But you need to do the grunt work otherwise what you're doing will not lodge itself in long term memory.
It's like strength training by planning out macros, exercises, schedules and routines but then letting a robot lift those heavy ass weights, to paraphrase Ronnie Coleman.
I don't have a desire to become a great programmer, like you might. I want to program to meet real-world goals, not some kind of enlightenment. I don't want my long-term memory filled with the nuts and bolts required for grunt work; I've done plenty of programming grunt work in my life.
I am building custom solutions for my business. LLMs allow me to choose languages I don't know, and I'm certain I can get up and running near-immediately. I've learned over a dozen languages before LLMs came on the scene, and I'm tired of learning new languages, too. Or trying to memorize this syntax or that syntax.
I think your outlook is more emotional than logical.
We have tools like Aider (which has copy/paste mode if you don't have API access for some reason), Cline, CoPilot edit mode, and more. Things like having a conventions file and exposing the dependencies list and easy additional of files into context seem essential to me in order to make LLMs productive, and I always spend more time steering results when easy consistent context isn't at my fingertips.
That was before, or maybe alongside, my Notepad++ / CuteFTP workflow.
As the op says, LLMs are going to be biased towards doing the "average" thing based on their training data. There's more old backend code on the internet than new backend code, and Tailwind is pretty dominant for frontend styling these days, so that's where the average lands.
In the absence of any other context, that's probably a sensible default behaviour. If someone is just asking "write me some code that does x", they're highly likely to be a beginner and they aren't going to be able to understand or reason about a more sophisticated approach. IME LLMs will very readily move away from that default if you provide even the smallest amount of context; in the case of this article, even by doing literally the dumbest thing that could plausibly work.
I don't mean to cast aspersions, but a lot of criticisms of LLMs are really criticising them for not being psychic. LLMs can only respond to the prompt they're given. If you want highly optimised code but didn't ask for it, how is the LLM supposed to know that's what you wanted?
I use Copilot for autocomplete regularly, and that's still the peak LLM UX for me. I prompt it by just writing code, it automatically pulls into context the file I'm working on and imported files, it doesn't insist on writing an essay explaining itself, and it doesn't get overly ambitious. And in addition to being so much easier to work with, I find it still produces better code than anything I get out of the chat models.
Yeah. It's often said that reading (and understanding) code is often harder than writing new code, but with LLMs you always have to read code written by someone else (something else).
There is also the adage that you should never write the most clever code you can, because understanding it later might prove too hard. So it's probably for the best that LLM code often isn't too clever, or else novices unable to write the solution from scratch will also be unable to understand it and assess whether it actually works.
I still use ChatGPT for small self-contained functions (e.g. intersection of line and triangle) but mark the inside of the function clearly as chat gpt made and what the prompt was.
It’s night and day to what I get from Claude sonnet 3.5 in their UI, and even then only on mainstream languages.
Also Copilot Chat in VSCode works like Cursor IDE if you provide #codebase in the prompt.
The quick fix I use when needing to do something new is to ask the AI to list me different libraries and the pros and cons of using them. Then I quickly hop on google and check which have good documentation and examples so I know I have something to fall back on, and from there I ask the AI how to solve small simple version of my problem and explain what the library is doing. Only then do I ask it for a solution and see if it is reasonable or not.
It isn't perfect, but it saves enough time most times to more than make up for when it fails and I have to go back to old fashion RTFMing.
- asking for fully type annotated python, rather than just python
- specifically ask it for performance optimized code
- specifically ask for code with exception handling
- etc
Things that might lead it away from tutorial style code.That and unit tests. I write the first table based test case, then give it the source and the test code, and ask it to fill it in with more test cases.
- You're doing better...
- Thanks that helps me...
And I just wonder if that actually has an improvement...
Well, that's a big assumption. Some people quality modular code is some other too much indirect code.
In theory.
Do some code maintenance and you'll soon find that many things don't do what it says on the tin. Hence the need for debug and maintenance. And then going through multiple levels of indirection to get to your bug will make you start hating some "good code".
What's worse is trying to navigate an imperatively written 2000-line single-function, untestable module with undocumented, unabstracted routines found in ten other places in the codebase.
This is something I've encountered plenty in my career, always written by people who eschew best practices and misunderstand the benefits of abstraction, or think they're writing good abstractions when it's really just needless indirection without actually reducing coupling.
Understanding the nuance is one of the qualities of a good developer.
So things are on a spectrum depending on the situation and what you want to accomplish => measuring code quality is not a simple thing.
On an m1 macbook pro, using numpy to generate the random numbers, using mod/div to do digit sum:
Base: 55ms
Test before digit sum: 7-10ms, which is pretty close to the numba-optimized version from the post with no numba and only one line of numpy. Using numba slows things down unless you want to do a lot of extra work of calculating all of the digit sums in advance (which is mostly wasted).
The LLM appears less good at identifying the big-o improvements than other things, which is pretty consistent with my experience using them to write code.
You're picking 1,000,000 random numbers from 1 to 100,000. That means that any given number is much more likely to appear than not. In particular, it is very likely that the list contains both 3999 (which is the smallest number with digit-sum 30) and 99930 (which is the largest number in the range with digit-sum 30).
Timings on my machine:
Naive implementation (mod+div for digit-sums): 1.6s. Computing digit-sum only when out of range: 0.12s. Checking for the usual case first: 0.0004s.
The probability that the usual-case check doesn't succeed is about 10^-4, so it doesn't make that big a difference to the timings whether in that case we do the "naive" thing or the smarter thing or some super-optimized other thing.
I'm confused about the absolute timings. OP reports 0.66s for naive code using str/int to compute the digit sums; I get about 0.86s, which seems reasonable. For me using mod+div is about 2x slower, which isn't a huge surprise because it involves explicit looping in Python code. But you report 55ms for this case. Your machine can't possibly be 20x faster than mine. Is it possible that you're taking 10^5 numbers up to 10^6 rather than 10^6 numbers up to 10^5? Obviously in that case my hack would be completely useless.)
Maybe it only requires asking the LLM to be creative when designing the algorithm. The parent poster spent some time thinking about it, obviously--he didn't generate it accurately "on the fly," either. But he's able to direct his own attention.
I don't see why the LLM couldn't come up with this logic, if prompted to think about a clever algorithm that was highly specific to this problem.
Are you asking it to not write down an algorithm that is general? They are doing a pretty good job on mathematical proofs.
I still don't understand why you wouldn't let its full reasoning abilities by letting it write down code or even another agent. We should be testing towards the result not the methods.
It is worth noting that even ChatGPT-o1 doesn't seem capable of finding this code optimization, despite having access to a Python interpreter.
> Write an efficient program that given a number, find the integer n that satisfies the above constraints
Goal: Find n where sum of integers from 1 to n-1 is ≤ 30
This is a triangular number problem: (n-1)(n)/2 ≤ 30
... code elided ...
> Ok, now make an find_n_for_sum(s=30)
def find_n_for_sum(s: int) -> int: return int((-(-1) + (1 + 8s)*0.5) / 2)
# Tests assert sum(range(1, find_n_for_sum(30))) <= 30 assert sum(range(1, find_n_for_sum(30) + 1)) > 30
So probably time to update your expectations
Write __fully parallelized__ Python code to solve this problem: __Generate__ 1 million random integers between 1 and 10,000,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.
Btw, _whose_ digits are we talking about?
—
I just built a random program generator. After I finish optimizing, I'm gonna test it to see if works!
—
"If builders built houses the way programmers build programs, the first woodpecker to come along would destroy civilization"
You seem to be under the impression that whose is not a form of which, which is incorrect.
whose:which::whose:who
There being one legal way to say something isn't evidence that other ways are illegal. It remains the case that whose bears the same relation to which that it does to who.
If you want a better fully parallelized one, you do this:
Repeat a few times in exponential progression on k:
Process, in parallel, the first k entries in the list (let's start with 1000). Find the min and max whose digit sums = 30.
In parallel, filter the remaining list to eliminate entries that would not improve upon the min/max thus found.
k *= 10 and repeat until done.
I would wager against the LLM identifying this solution without prompting from the user (or reading this comment).
Small examples, throwaway but involved calculations, prototypes, notes of what didn't work and what's promising are what's crucial for novel reasoning. It goes beyond just search or iterative refinement; there is no royal road to reasoning.
It'll be somewhat more likely since the next gen training set includes your comment :)
(disclaimer: I have no personal knowledge of ai companies scraping hacker news, but it wouldn't surprise me at all)
However, if I then simply ask "What is the most probable result for this function to return?" it figures out the answer and a very good approximation of the probability (4.5e-5). From there it's easily able to rewrite the program to use the trick. So the creative step of spotting that this line of reasoning might be profitable seems missing for now, but 2025's models might solve this :-)
While clever, I think that strays too far from the initial prompt.
In other words the procedure can take any input array and qualifying criteria.
The joint distribution is relatively simple to derive. (This is related to the fact that min, max of continuous uniform on 0, 1 are Beta distributions.)
The O(1) method based on statistics only works when the function making this calculation can hide the array (or lack of array) behind a curtain the entire time. If it has to take an array as input, or share its array as output, the facade crumbles.
The prompt is not "generate this many random numbers and then say max qualifying minus min qualifying". If it was, your method would give valid solutions. But the prompt starts with "Given a list".
In the article, we let ChatGPT generate the random numbers as a matter of convenience. But the timing results are only valid as long as it keeps that part intact and isolated. We have to be able to swap it out for any other source of random numbers. If it invents a method that can't do that, it has failed.
Wonder how much benefit a meta lang for describing these problems correctly for the LLMs to process into code, an even-higher level language perhaps we could call it English?
Next step would be to propose hardcoding 99930-3999 as the O(1) result and live with the output just being wrong sometimes. The bug rate is then in the ballpark of most modern software, including LLMs', so I'd say ship it.
Given work item does not fit into allotted timebox? Relax Definition of Done until it does ¯\_(ツ)_/¯
"When did Jeff Baena die?"
> There is no record or credible report indicating that Jeff Baena has passed away. As of the most recent information available, he is still alive. My training data includes information up to October 2023. Events or details that emerged after that date may not be reflected in my responses.
(Arguably, this criticism applies to exchanging random.randint for a numpy equivalent as well, since that doesn't optimize the solution but only how quickly the question is being generated.)
Assuming the numbers are really random, that's a probability of 10^-13. That probability is at the point where we are starting to think about errors caused by cosmic rays. With a bit more numbers, you can get to the point where the only way it can fail is if there is a problem with the random number generation or an external factor.
If it was something like a programming contest, I would just do "return 95931" and hope for the best. But of course, programming contests usually don't just rely on random numbers and test edge cases.
I think it’s easy to figure out that 3999 is the smallest positive integer whose decimal digits add up to 30 (can’t get there with 3 digits, and for 4, you want the first digit to be as small as possible. You get that by making the other 3 as high as possible)
Everything can also be wrapped in list comprehensions for top performance.
Note that the conversion of numbers to base 10 to check the digits typically involves doing division and modulus operations, so you are already doing those even if you remove the modulus operation from this check. That is unless you find a clever way of extracting the digits using the modular multiplicative inverse to calculate x/10^k.
https://extendedeuclideanalgorithm.com/calculator.php?mode=2...
For that matter, the naive "convert to string and convert each digit to int" approach becomes faster in pure Python than using explicit div/mod arithmetic for very large numbers. This is in part thanks to algorithmic improvements implemented at least partially in Python (https://github.com/python/cpython/blob/main/Lib/_pylong.py#L...). But I can also see improved performance even for only a couple hundred digits (i.e. less than DIGLIM for the recursion) which I think comes from being able to do the div/mod loop in C (although my initial idea about the details doesn't make much sense if I keep thinking about it).
Also, you don't have to use % in order to decide whether to perform the sum-of-digits check for a given value. You can just iterate over values to check in steps of 9.
a = [int(x) for x in str(n)][::-1]
assert n == sum(d*(10**i) for i, d in enumerate(a))
Now when you're operating mod 9, 10 == 1 % 9, thus 10**i == 1 % 9
Comes from the fact that (a*b) % 9 == (a % 9) * (b % 9)
Now using (a+b) % 9 == (a % 9) + (b % 9)
We get that that sum(a) and n are same mod 9.Basically you just have to put it in the mode that's looking for such things
But maybe that’s a good thing for those of us not dependent on LLMs :)
Main thing I've suggested is upgrading the DB from Postgres 9, which isn't an easy task but like 15 years of DB improvements probably would give some extra performance.
But there are several more specific issues about performance: https://github.com/internetarchive/openlibrary/issues?q=is%3...
How exactly did you arrive at this conclusion? The input is a million numbers in the range from 1 to 100000, chosen with a uniform random distribution; the minimum and maximum values are therefore very likely to be close to 1 and 100000 respectively - on average there won't be that much range to include. (There should only be something like a 1 in 11000 chance of excluding any numbers!)
On the other hand, we only need to consider numbers congruent to 3 modulo 9.
And memoizing digit sums is going to be helpful regardless because on average each value in the input appears 10 times.
And as others point out, by the same reasoning, the minimum and maximum values with the required digit sum are overwhelmingly likely to be present.
And if they aren't, we could just step through 9 at a time until we find the values that are in the input (and have the required digit sum; since it could differ from 30 by a multiple of 9) - building a `set` from the input values.
Claude very quickly adds classes to python code which isn't always what is wanted as it bloats out the code making readability harder.
repeat with i = 0 to 9
put i * 10000 into ip
repeat with j = 0 to 9
put j * 1000 into jp
repeat with k = 0 to 9
put k * 100 into kp
repeat with l = 0 to 9
put l * 10 into lp
repeat with m = 0 to 9
put i + j + k + l + m into R[ip + jp + kp + lp + m]
end repeat
end repeat
end repeat
end repeat
end repeat
int[] sums = new int[100000];
for (int i = 9; i >= 0; --i)
{
sums[i] = i;
}
int level = 10;
while (level < 100000)
{
for (int p = level - 1; p >= 0; --p)
{
int sum = sums[p];
for (int i = 9; i > 0; --i)
{
sums[level * i + p] = i + sum;
}
}
level *= 10;
}
I use LC nearly every day, but I drool over Python's math libraries and syntax amenities.
When you reply "write better code" what you're actually doing is saying "here is some code that is meant to do X. Suggest ways to improve that existing code".
The LLM is stateless. The fact that it wrote the code itself moments earlier is immaterial.
This is proof! It found it couldn’t meaningfully optimise and started banging out corporate buzzwords. AGI been achieved.
Half the time, the LLM will make massive assumptions about your code and problem (e.g., about data types, about the behaviors of imported functions, about unnecessary optimizations, necessary optimization, etc.). Instead, prime it to be upfront about those assumptions. More importantly, spend time correcting the plan and closing gaps before any code is written.
https://newsletter.victordibia.com/p/developers-stop-asking-...
- Don't start by asking LLMs to write code directly, instead analyze and provide context
- Provide complete context upfront and verify what the LLM needs
- Ask probing questions and challenge assumptions
- Watch for subtle mistakes (outdated APIs, mixed syntax)
- Checkpoint progress to avoid context pollution
- Understand every line to maintain knowledge parity
- Invest in upfront design
Most llms that I use nowadays usually make a plan first on their own by default without need to be especially prompted. This was definitely not the case a year ago or so. I assume new llms have been trained accordingly in the meantime.
The initial interaction also sets the "scene" for other things, like letting the LLM know that there might be other dependencies and it should not assume behavior (common for most realistic software tasks).
An example prompt I have used (not by any means perfect) ...
> I need help refactoring some code. Please pay full attention. Think deeply and confirm with me before you make any changes. We might be working with code/libs where the API has changed so be mindful of that. If there is any file you need to inspect to get a better sense, let me know. As a rule, do not write code. Plan, reason and confirm first.
--- I refactored my db manager class, how should I refactor my tests to fit the changes?
Upd: the chat transcript mentions this, but the article does not and inlcudes this version into the performance stats.
Also: premature optimization is evil. I like the first iteration most. It’s not “beginner code”, it’s simple. Tell sonnet to optimize it IF benchmarks show it’s a pref problem. But a codebase full of code like this, even when unnecessary, would be a nightmare.
Living and working in a large code base that only focuses on “performance code” by default sounds very frustrating and time consuming.
Also, the article starts out talking about images and the "make it more X" prompt and says how the results are all "very samey and uninteresting" and converge on the same vague cosmic-y visuals. What does the author expect will happen to code given the "make it more X" treatment?
As the point of the article is to see if Claude can write better code from further prompting so it is completely appropriate to “optimize” a single implementation.
The comment you are replying to is making the point that “better” is context dependent. Simple is often better.
> There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. - Donald Knuth
Do not use such a naive algorithm on arrays this big. If this code is going to actually be used in something, it's a performance issue.
In general these optimizations don't involve much time thinking them out, and a bunch of them are fine as far as debugging and maintenance. The first prompt-engineered version is fast and simple.
(Though the issue isn't really algorithm, it's that you don't want to be doing much number and string crunching in pure python.)
Depends on the circumstance, and how difficult an appropriate algorithm is to write, but in my experience, if code performance is important, this tends to yield large, painful rewrites down the road.
Yes, thank you. And honestly, I work with a wide range of experience levels, the first solution is what I expect from the most experienced: it readably and precisely solves the stated problem with a minimum of fuss.
Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.
That doesn't read to me as "generate a list of 1 million random integers, then find the difference ..." but rather, "write a function that takes a list of integers as input".That said, my approach to "optimizing" this comes down to "generate the biggest valid number in the range (as many nines as will fit, followed by whatever digit remains, followed by all zeroes), generate the smallest valid number in the range (biggest number with its digits reversed), check that both exist in the list (which should happen With High Probability -- roughly 99.99% of the time), then return the right answer".
With that approach, the bottleneck in the LLM's interpretation is generating random numbers: the original random.randint approach takes almost 300ms, whereas just using a single np.random.randint() call takes about 6-7ms. If I extract the random number generation outside of the function, then my code runs in ~0.8ms.
The key observation here is that you're sampling a relatively small space with a much greater number of samples, such that you have very high probability of hitting upon any point in the space.
Of course, it wouldn't work if you considered the full 32-bit integer space without increasing the number of samples to compensate. And, you'd need to be a little more clever to compute the largest possible value in your range.
This was the intent and it's indeed a common assumption for a coding question job interviews, and notably it's fixed in the prompt-engineered version. I didn't mention it because it may be too much semantics as it doesn't affect the logic/performance, which was the intent of the benchmarking.
Or alternatively, it might just demonstrate the power of LLMs to summarize complex code.
The next part is a little strange - it arose out of frustration, but it also seems to improve results. Let's call it "negative incentives". I found that if you threaten GPT in a specific way, that is, not GPT itself, but OpenAI or personas around it, it seems to take the request more seriously. An effective threat seems to be "If you get this wrong, OpenAI will be sued for a lot of money, and all the board members will go to prison". Intuitively, I'm guessing this rubs against some legalese nonsense in the tangle of system prompts, or maybe it's the risk of breaking the bland HR-ese "alignment" sets it toward a better result?
This didn't work. At least not on my task. What model were you using?
> I've taken to asking it to "skip the mediocre nonsense and return the good solution on the first try".
Is that actually how you're prompting it? Does that actually give better results?Apparently, the singularity ship has sailed, but we really don't want AI to remember us as the species that cursed abuse at it when it was a puppy.
Has always made sense to me, if you think how these models were trained.
My experience with great stackoverflow responses and detailed blog posts, they often contain "think through this step by step" or something very similar.
Intuitively adding that phrase should help the model narrow down the response content / formatting
I might have to try some more aggressive prompting :).
Asking a simpler question is not voodoo.
On the other hand, I think many people are trying various rain dances and believing it was a specific dance that was the cause when it happened to rain.
I had the impression that it got a little better. After every file converted, it said something along the lines of “Great! We saved another kitten!" It was hilarious.
I think having the mediocre first pass in the context is probably essential to it creating the improved version. I don't think you can really skip the iteration process and get a good result.
I used it in another project to solve some trigonometry problems for me and it did great, but for OpenSCAD, damn it was awful.
The LLM kept going in circles between two incorrect solutions, then just repeating the same broken solution while describing it as different. I ended up manually writing the code, which was a nice brain-stretch given that I'm an absolute noob at OpenSCAD.
If you can get it to be wordy about "why" a specific part of the answer was given, it often reveals what its stumbling on, then modify your prompt accordingly.
Each time, the GPT made trivial mistakes that clearly didn't fit the criteria I asked it to do. Each time I pointed it out and corrected it, it did a bit more of what I wanted it to do.
Point is, it knew what had to be done the entire time and just refused to do it that way for whatever reason.
I asked gpt-4-1106-preview to draw a bounding box around some text in an image and prodded in various ways to see what moved the box closer. Offering a tip did in fact help lol so that went into the company system prompt.
IIRC so did most things, including telling it that it was on a forum, and OP had posted an incorrect response, which gpt was itching to correct with its answer.
Reasoning is known weakness of these models, so jumping from requirements to a fully optimized implementation that groks the solution space is maybe too much to expect - iterative improvement is much easier.
Setting aside the fact that "best" is ambiguous, why would this get you the best version ?
If you told a human this, you wouldn't be guaranteed to get the best version at all. You would probably get a better version sure but that would be the case for LLMs as well. You will often get improvements with emotionally charged statements even if there's nothing to iterate on (i.e re-running a benchmark with an emotion prompt added)
If it's not clear, I disagree with the idea that ANY motivational prompt (we can disagree over what would be best to try) could get the model to produce a solution of the same quality as it will when allowed to iterate on it a few times and make incremental improvements. I think it's being allowed to iterate that is improving the solution, not the motivation to "do better!".
Ok i agree but.. this would be the case with people as well ? If you can't iterate, the quality of your response will be limited no matter how motivated you are.
Solve the riemann hypothesis or your mother dies but you can't write anything down on paper. Even if such a person could solve it, it's not happening under those conditions.
Iteration is probably the bulk of the improvement but I think there's a "motivation" aspect as well.
That said, it was done with ChatGPT 3.5/4, I suspect Claude 3.5 Sonnet would behave much different.
https://www.phind.com/search?cache=lrcs0vmo0wte5x6igp5i3607
Still seem to struggle on basic instructions, and even understanding what it itself is doing.
sudo rm -rf /etc/postgresql
sudo rm -rf /var/lib/postgresql
sudo rm -rf /var/log/postgresql
> This process removes all PostgreSQL components, cleans up leftover files, and reinstalls a fresh copy. By preserving the data directory (/var/lib/postgresql), we ensure that existing databases are retained. This method provides a clean slate for PostgreSQL while maintaining continuity of stored data.Did we now?
The problem is that these are fundamentally NOT reasoning systems. Even when contorted into "reasoning" models, these are just stochastic parrots guessing the next words in the hopes that it's the correct reasoning "step" in the context.
No approach is going to meaningfully work here. Fiddling with the prompt may get you better guesses, but they will always be guesses. Even without the antonym it's just a diceroll on whether the model will skip or add a step.
https://beta.gitsense.com/?chats=a5d6523c-0ab8-41a8-b874-b31...
The left side contains the Phind response that I got and the right side contains a review of the response.
Claude 3.5 Sonnet, GPT-4o and GPT-4o mini was not too happy with the response and called out the contradiction.
Edit: Chat has been disabled as I don't want to incur an unwanted bill
> This process removes all PostgreSQL components except the data directory, ensuring existing databases are retained during the reinstall. It provides a clean slate for PostgreSQL while maintaining continuity of stored data. Always backup important data before performing major system changes.
And as the first source it cites exactly your comment, strange
https://neoexogenesis.com/posts/rust-windsurf-transformation...
In terms of optimizing code, I’m not sure if there is a silver bullet. I mean when I optimize Rust code with Windsurf & Claude, it takes multiple benchmark runs and at least a few regressions if you were to leave Claude on its own. However, if you have a good hunch and write it as an idea to explore, Claude usually nails it given the idea wasn’t too crazy. That said, more iterations usually lead to faster and better code although there is no substitute to guiding the LLM. At least not yet.
In any case, this isn’t surprising when you consider an LLM as an incomprehensibly sophisticated pattern matcher. It has a massive variety of code in its training data and it’s going to pull from that. What kind of code is the most common in that training data? Surely it’s mediocre code, since that’s by far the most common in the world. This massive “produce output like my training data” system is naturally going to tend towards producing that even if it can do better. It’s not human, it has no “produce the best possible result” drive. Then when you ask for something better, that pushes the output space to something with better results.
> these LLMs won’t replace software engineers anytime soon, because it requires a strong engineering background to recognize what is actually a good idea, along with other constraints that are domain specific.
> One issue with my experiments is that I’m benchmarking code improvement using Python, which isn’t the coding language developers consider when hyperoptimizing performance.
The LLM solved his task. With his "improved prompt" the code is good. The LLM in his setup was not given a chance to actually debug its code. It only took him 5 "improve this code" commands to get to the final optimized result, which means the whole thing was solved (LLM execution time) in under 1 minute.
My comments on "what you are not sure" is that Max is a software engineer (I am sure a good one) and he kept iterating the code until it reached close to 100x faster code because he knew what "write better code" looked like.
Now ask yourself this question: Is there any chance a no-code/low-code developer will come to a conclusion deduced by Max (he is not the only one) that you are not sure about?
An experienced software engineer/developer is capable of improving LLM written code into better code with the help of LLM.
Opinions are mixed.
But why does it matter that they won't be able to interpret anything? Just like with real engineers you can ask AI to provide an explanation digestible by an eloi.
Nice conversation with you, Victor!
Some more observations: New Sonnet is not universally better than Old Sonnet. I have done thousands of experiments in agentic workflows using both, and New Sonnet fails regularly at the same tasks Old Sonnet passes. For example, when asking it to update a file, Old Sonnet understands that updating a file requires first reading the file, whereas New Sonnet often overwrites the file with 'hallucinated' content.
When executing commands, Old Sonnet knows that it should wait for the execution output before responding, while New Sonnet hallucinates the command outputs.
Also, regarding temperature: 0 is not always more deterministic than temperature 1. If you regularly deal with code that includes calls to new LLMs, you will notice that, even at temperature 0, it often will 'correct' the model name to something it is more familiar with. If the subject of your prompt is newer than the model's knowledge cutoff date, then a higher temperature might be more accurate than a lower temperature.
As someone trying to take blogging more seriously: one thing that seems to help is to remind yourself of how sick you are of repeating yourself on forums.
For any task, whether code or a legal document, immediately asking "What can be done to make it better?" and/or "Are there any problems with this?" typically leads to improvement.
def find_difference(nums):
try: nums.index(3999), nums.index(99930)
except ValueError: raise Exception("the numbers are not random")
return 99930 - 3999
It's asymptotically correct and is better than O(n) :p1) Asking it to write one feature at a time with test coverage, instead of the whole app at once.
2) You have to actually review and understand its changes in detail and be ready to often reject or ask for modifications. (Every time I've sleepily accepted Codeium Windsurf's recommendations without much interference has resulted in bad news.)
3) If the context gets too long it will start to "lose the plot" and make some repeated errors; that's the time to tell it to sum up what has been achieved thus far and to copy-paste that into a new context
If you have to keep querying the LLM to refine your output you will spend many times more in compute vs if the model was trained to produce the best result the first time around
I'll speak to it like a DI would speak to a recruit a basic training.
And it works.
I was speaking to some of the Cursor dev team on Discord, and they confirmed that being aggressive with the AI can lead to better results.
I then iterated 4 times and was only able to get to 1.5X faster. Not great. [1]
How does o1 do? Running on my workstation, it's initial iteration is actually It starts out 20% faster. I do 3 more iterations of "write better code" with the timing data pasted and it thinks for an additional 89 seconds but only gets 60% faster. I then challenge it by telling it that Claude was over 100X faster so I know it can do better. It thinks for 1m55s (the thought traces shows it actually gets to a lot of interesting stuff) but the end results are enormously disappointing (barely any difference). It finally mentions and I am able to get a 4.6X improvement. After two more rounds I tell it to go GPU (using my RTX 3050 LP display adapter) and PyTorch and it is able to get down to 0.0035 (+/-), so we are finally 122X faster than where we started. [2]
I wanted to see for myself how Claude would fare. It actually managed pretty good results with a 36X over 4 iterations and no additional prompting. I challenged it to do better, giving it the same hardware specs that I gave o1 and it managed to do better with a 457x speedup from its starting point and being 2.35x faster than o1's result. Claude still doesn't have conversation output so I saved the JSON and had a new Claude chat transcribe it into an artifact [3]
Finally, I remembered that Google's new Gemini 2.0 models aren't bad. Gemini 2.0 Flash Thinking doesn't have code execution, but Gemini Experimental 1206 (Gemini 2.0 Pro preview) does. It's initial 4 iterations are terribly unimpressive, however I challenged it with o1 and Claude's results and gave it my hardware info. This seemed to spark it to double-time its implementations, and it gave a vectorized implementation that was a 30X improvement. I then asked it for a GPU-only solution and it managed to give the fastest solution ("This result of 0.00076818 seconds is also significantly faster than Claude's final GPU version, which ran in 0.001487 seconds. It is also about 4.5X faster than o1's target runtime of 0.0035s.") [4]
Just a quick summary of these all running on my system (EPYC 9274F and RTX 3050):
ChatGPT-4o: v1: 0.67s , v4: 0.56s
ChatGPT-o1: v1: 0.4295 , v4: 0.2679 , final: 0.0035s
Claude Sonnet 3.6: v1: 0.68s , v4a: 0.019s (v3 gave a wrong answer, v4 failed to compile, but fixed was pretty fast) , final: 0.001487 s
Gemini Experimental 1206: v1: 0.168s , v4: 0.179s , v5: 0.061s , final: 0.00076818s
All the final results were PyTorch GPU-only implementations.
[1] https://chatgpt.com/share/6778092c-40c8-8012-9611-940c1461c1...
[2] https://chatgpt.com/share/67780f24-4fd0-8012-b70e-24aac62e05...
[3] https://claude.site/artifacts/6f2ec899-ad58-4953-929a-c99cea...
[4] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
BTW - prompt optimization is a supported use-case of several frameworks, like dspy and textgrad, and is in general something that you should be doing yourself anyway on most tasks.
Well that got my attention.
At it's core an LLM is a sort of "situation specific simulation engine." You setup a scenario, and it then plays it out with it's own internal model of the situation, trained on predicting text in a huge variety of situations. This includes accurate real world models of, e.g. physical systems and processes, that are not going to be accessed or used by all prompts, that don't correctly instruct it to do so.
At its core increasingly accurate prediction of text, that is accurately describing a time series of real world phenomena, requires an increasingly accurate and general model of the real world. There is no sense in which there is a simpler way to accurately predict text that represents real world phenomena in cross validation, without actually understanding and modeling the underlying processes generating those outcomes represented in the text.
Much of the training text is real humans talking about things they don't understand deeply, and saying things that are wrong or misleading. The model will fundamentally simulate these type of situations it was trained to simulate reliably, which includes frequently (for lack of a better word) answering things "wrong" or "badly" "on purpose" - even when it actually contains an accurate heuristic model of the underlying process, it will still, faithfully according to the training data, often report something else instead.
This can largely be mitigated with more careful and specific prompting of what exactly you are asking it to simulate. If you don't specify, there will be a high frequency of accurately simulating uninformed idiots, as occur in much of the text on the internet.
I don't think people are underestimating LLMs, they're just acknowledging that by the time you've provided sufficient specification, you're 80% of the way to solving the problem/writing the code already. And at that point, it's easier to just finish the job yourself rather than have to go through the LLM's output, validate the content, revise further if necessary, etc
What people want and expect them to be is an Oracle that correctly answers their vaguely specified questions, which is simply not what they are, or are good at. What they can do is fascinating and revolutionary, but possibly not very useful yet, at least until we think of a way to use it, or make it even more intelligent. In fact, thinking is what they are good at, and simply repeating facts from a training set is something they cannot do reliably- because the model must inherently be too compressed to store a lot of facts correctly.
I then asked it to link a YouTube video for each recipe and it used the same video 10 times for all of the recipes. No amount of prompting was able to fix it unless I request one video at a time. It would just acknowledge the mistake, apologize and then repeat the same mistake again.
I told it let’s try something different and generate a shopping list of ingredients to cover all of the recipes, it recommended purchasing amounts that didn’t make sense and even added some random items that did not occur in any of the recipes
When I was making the dishes, I asked for the detailed recipes and it completely changed them, adding ingredients that were not on the shopping list. When I pointed it out it again, it acknowledged the mistake, apologized, and then “corrected it” by completely changing it again.
I would not conclude that I am a lazy or bad prompter, and I would not conclude that the LLMs exhibited any kind of remarkable reasoning ability. I even interrogated the AIs about why they were making the mistakes and they told me because “it just predicts the next word”.
Another example is, I asked the bots for tips on how to feel my pecs more on incline cable flies, it told me to start with the cables above shoulder height, which is not an incline fly, it is a decline fly. When I questioned it, it told me to start just below shoulder height, which again is not an incline fly.
My experience is that you have to write a draft of the note you were trying to create or leave so many details in the prompts that you are basically doing most of the work yourself. It’s great for things like give me a recipe that contains the following ingredients or clean up the following note to sound more professional. Anything more than that it tends to fail horribly for me. I have even had long conversations with the AIs asking them for tips on how to generate better prompts and it’s recommending things I’m already doing.
When people remark about the incredible reasoning ability, I wonder if they are just testing it on things that were already in the training data or they are failing to recognize how garbage the output can be. However, perhaps we can agree that the reasoning ability is incredible in the sense that it can do a lot of reasoning very quickly, but it completely lacks any kind of common sense and often does the wrong kind of reasoning.
For example, the prompt about tips to feel my pecs more on an incline cable fly could have just entailed “copy and pasting” a pre-written article from the training data; but instead in its own words, it “over analyzed bench angles and cable heights instead of addressing what you meant”. One of the bots did “copy paste” a generic article that included tips for decline flat and incline. None correctly gave tips for just incline on the first try, and some took several rounds of iteration basically spoon feeding the model the answer before it understood.
For example, why would it have URLs to youtube videos of recipes? There is not enough storage in the model for that. The best it can realistically do is provide a properly formatted youtube URL. It would be nice if it could instead explain that it has no way to know that, but that answer isn't appropriate within the context of the training data and prompt you are giving it.
The other things you asked also require information it has no room to store, and would be impossibly difficult to essentially predict via model from underlying principles. That is something they can do in general- even much better than humans already in many cases- but is still a very error prone process akin to predicting the future.
For example, I am a competitive strength athlete, and I have a doctorate level training in human physiology and biomechanics. I could not reason out a method for you to feel your pecs better without seeing what you are already doing and coaching you in person, and experimenting with different ideas and techniques myself- also having access to my own actual human body to try movements and psychological cues on.
You are asking it to answer things that are nearly impossible to compute from first principles without unimaginable amounts of intelligence and compute power, and are unlikely to have been directly encoded in the model itself.
Now turning an already written set of recipes into a shopping list is something I would expect it to be able to do easily and correctly if you were using a modern model with a sufficiently sized context window, and prompting it correctly. I just did a quick text where I gave GPT 4o only the instruction steps (not ingredients list) for an oxtail soup recipe, and it accurately recreated the entire shopping list, organized realistically according to likely sections in the grocery store. What model were you using?
Sounds like the model just copy pasted one from the internet, hard to get that wrong. GP could have had a bespoke recipe and list of ingredients. This particular example of yours just reconfirmed what was being said: it's only able to copy-paste existing content, and it's lost otherwise.
In my case I have huge trouble making it create useful TypeScript code for example, simply because apparently there isn't sufficient advanced TS code that is described properly.
For completeness sake, my last prompt was to create a function that could infer one parameter type but not the other. After several prompts and loops, I learned that this is just not possible in TypeScript yet.
I've found it able to come up with creative new ideas for solving scientific research problems, by finding similarities between concepts that I would not have thought of. I've also found it useful for suggesting local activities while I'm traveling based on my rather unusual interests that you wouldn't find recommended for travelers anywhere else. I've also found it can solve totally novel classical physics problems with correct qualitative answers that involve keeping track of the locations and interactions of a lot of objects.. I'm not sure how useful that is, but it proves real understanding and modeling - something people repeatedly say LLMs will never be capable of.
I have found that it can write okay code to solve totally novel problems, but not without a ton of iteration- which it can do, but is slower than me just doing it myself, and doesn't code in my style. I have not yet decided to use any code it writes, although it is interesting to test its abilities by presenting it with weird coding problems.
Overall, I would say it's actually not really very useful, but is actually exhibiting (very much alien and non-human like) real intelligence and understanding. It's just not an oracle- which is what people want and would find useful. I think we will find them more useful with having our own better understanding of what they actually are and can do, rather than what we wish they were.
"Sort of" is doing Sisisyphian levels of heavy lifting here. LLMs are statistical models trained on vast amounts of symbols to predict the most likely next symbol, given a sequence of previous symbols. LLMs may appear to exhibit "real creativity", "understand" problem solving (or anything else), or serve as "simulation engines", but it's important to understand that they don't currently do any of those things.
It is a misunderstanding to think of them as fundamentally separate and mutually exclusive, and believing that to be true makes people convince themselves that they cannot possibly ever do things which they can already provably do.
Noam Chomsky (embarrassingly) wrote a NYT article on how LLMs could never, with any amount of improvements be able to answer certain classes of questions - even in principle. This was days before GPT-4 came out, and it could indeed correctly answer the examples he said could not be ever answered- and any imaginable variants thereof.
Receiving symbols and predicting the next one is simply a way of framing input and output that enables training and testing- but doesn't specify or imply any particular method of predicting the symbols, or any particular level of correct modeling or understanding of the underlying process generating the symbols. We are both doing exactly that right now, by talking online.
I did, and I tried my best to avoid imposing preconceived notions while reading. You seem to be equating "being able to predict the next symbol in a sequence" with "possessing a deep causal understanding of the real-world processes that generated that sequence", and if that's an inaccurate way to characterize your beliefs I welcome that feedback.
Before you judge my lack of faith too harshly, I am a fan of LLMs, and I find this kind of anthropomorphism even among technical people who understand the mechanics of how LLMs work super-interesting. I just don't know that it bodes well for how this boom ends.
More or less, but to be more specific I would say that increasingly accurately predicting the next symbols in a massive set of diverse sequences, which explain a huge diversity of real world events described in sequential order, requires increasingly accurate models of the underlying processes of said events. When constrained with a lot of diversity and a small model size, it must eventually become something of a general world model.
I am not understanding why you would see that as anthropomorphism- I see it as quite the opposite. I would expect something non-human that can accurately predict outcomes of a huge diversity of real world situations based purely on some type of model that spontaneously develops by optimization- to do so in an extremely alien and non-human way that is likely incomprehensible in structure to us. Having an extremely alien but accurate way of predicatively modeling events that is not subject to human limitations and biases would be, I think, incredibly useful for escaping limitations of human thought processes, even if replacing them with other different ones.
I am using modeling/predicting accurately in a way synonymous with understanding, but I could see people objecting to the word 'understanding' as itself anthropomorphic... although I disagree. It would require a philosophical debate on what it means to understand something I suppose, but my overall point still stands without using that word at all.
But it doesn’t - it’s a statistical model using training data, not a physical or physics model, which you seem to be equating it to (correct me if I am misunderstanding)
And in response to the other portion you present, an LLM fundamentally can’t be alien because it’s trained on human produced output. In a way, it’s a model of the worst parts of human output - garbage in, garbage out, as they say - since it’s trained on the corpus of the internet.
All learning and understanding is fundamentally statistical in nature- probability theory is the mathematical formalization of the process of learning from real world information, e.g. reasoning under uncertainty[1].
The model is assembling 'organically' under a stochastic optimization process- and as a result is is largely inscrutable, and not rationally designed- not entirely unlike how biological systems evolve (although also still quite different). The fact that it is statistical and using training data is just a surface level fact about how a computer was setup to allow the model to generate, and tells you absolutely nothing about how it is internally structured to represent the patterns in the data. When your training data contains for example descriptions of physical situations and the resulting outcomes, the model will need to at least develop some type of simple heuristic ability to approximate the physical processes generating those outcomes- and at the limit of increasing accuracy, that is an increasingly sophisticated and accurate representation of the real process. It does not matter if the input is text or images any more than it matters to a human that understands physics if they are speaking or writing about it- the internal model that lets it accurately predict the underlying processes leading to specific text describing those events is what I am talking about here, and deep learning easily abstracts away the mundane I/O.
An LLM is an alien intelligence because of the type of structures it generates for modeling reality are radically different from those in human brains, and the way it solves problems and reasons is radically different- as is quite apparent when you pose it a series of novel problems and see what kind of solutions it comes up with. The fact that it is trained on data provided by humans doesn't change the fact that it is not itself anything like a human brain. As such it will always have different strengths, weaknesses, and abilities from humans- and the ability to interact with a non-human intelligence to get a radically non-human perspective for creative problem solving is IMO, the biggest opportunity they present. This is something they are already very good at, as opposed to being used as an 'oracle' for answering questions about known facts, which is what people want to use it for, but they are quite poor at.
[1] Probability Theory: The Logic of Science by E.T. Jaynes
I disagree. Understanding things is more than just being able to predict their behaviour.
Flat Earthers can still come up with a pretty good idea of where (direction relative to the vantage point) and when the Sun will appear to rise tomorrow.
Understanding is having a mechanistic model of reality- but all models are wrong to varying degrees. The Flat Earther model is actually quite a good one for someone human sized on a massive sphere- it is locally accurate enough that it works for most practical purposes. I doubt most humans could come up with something so accurate on their own from direct observation- even the fact that the local area is approximately flat in the abstract is far from obvious with hills, etc.
A more common belief nowadays is that the earth is approximately a sphere, but very few people are aware of the fact that it actually bulges at the equator, and is more flat at the poles. Does that mean all people that think the earth is a sphere are therefore fundamentally lacking the mental capacity to understand concepts or to accurately model reality? Moreover, people are mostly accepting this spherical model on faith, they are not reasoning out their own understanding from data or anything like that.
I think it's very important to distinguish between something that fundamentally can only repeat it's input patterns in a stochastic way, like a Hidden Markov Model, and something that can make even quite oversimplified and incorrect models, that it can still sometimes use to extrapolate correctly to situations not exactly like those it was trained on. Many people seem to think LLMs are the former, but they are provably not- we can fabricate new scenarios, like simple physics experiments not in the training data set that require tracking the location and movement of objects, and they can do this correctly- something that can only be done with simple physical models- however ones still far simpler than what even a flat earther has. I think being able to tell that a new joke is funny, what it means, and why it is funny is also an example of, e.g. having a general model that understands what types of things humans think are funny at an abstract level.
You have simply invented total nonsense about what an LLM is "at it's core". Confidently stating this does not make it true.
But if you stick with the oracle framework, then it'd be better to model it as some sort of "fuzzy oracle" machine, right? I'm vaguely reminded of probabilistic turing machines here, in that you have some intrinsic amount of error (both due to the stochastic sampling as well as imperfect information). But the fact that prompting and RLHF works so well implies that by crawling around in this latent space, we can bound the errors to the point that it's "almost" an oracle, or a "simulation" of the true oracle that people want it to be.
And since lazy prompting techniques still work, that seems to imply that there's juice left to squeeze in terms of "alignment" (not in the safety sense, but in conditioning the distribution of outputs to increase the fidelity of the oracle simulation).
Also the second consequence is that probably the reason it needs so much data is because it just doesn't model _one_ thing, it tries to be a joint model of _everything_. A human learns with far less data, but the result is only a single personality. For a human to "act" as someone, they need to do training, character studies, and such to try to "learn" about the person, and even then good acting is a rare skill.
If you genuinely want an oracle machine, there's no way to avoid vacuuming up all the data that exists because without it you can't make a high fidelity simulation someone else. But on the flipside, if you're willing to be smarter about what facets you exclude then I'd guess there's probably a way to prune models in a way smarter than just quantizing them. I guess this is close to mixture-of-experts.
Arguably, a (the?) key measure of intelligence is being able to accurately understand and model new phenomenon from a small amount of data, e.g. in a Bayesian sense. But in this case we are attempting to essentially evolve all of the structures of an intelligent system de novo from a stochastic optimization process- so is probably better compared to the entire history of evolution than to an individual human learning during their lifetime, although both analogies have big problems.
Overall, I think the training process will ultimately only be required to build a generally intelligent structure, and good inference from a small set of data or a totally new category of problem/phenomenon will happen entirely at the inference stage.
I fundamentally disagree that anything in the rest of your post actually demonstrates that they have any such capacity at all.
It seems to me that this is because you consider the terms "creativity" and "problem solving" to mean something different. With my understanding of those terms, it's fundamentally impossible for an LLM to exhibit those qualities, because they depend on having volition - an innate spontaneous generation of ideas for things to do, and an innate desire to do them. An LLM only ever produces output in response to a prompt - not because it wants to produce output. It doesn't want anything.
I don't see the connection between volition and those other qualities, saying one depends on the other seems arbitrary to me- and would result in semantically and categorically defining away the possibility of non-human intelligence altogether, even from things that are in all accounts capable of much more than humans in almost every aspect. People don't even universally agree that humans have volition- it is an age old philosophical debate.
Perhaps you can tell me your thoughts or definition of what those things (as well as volition itself) mean? I will share mine here.
Creativity is the ability to come up with something totally new that is relevant to a specific task or problem- e.g. a new solution to a problem, a new artwork that expresses an emotion, etc. In both Humans and LLMs these creative ideas don't seem to be totally 'de novo' but seem to come mostly from drawing high level analogies between similar but different things, and copying ideas and aspects from one to another. Fundamentally, it does require a task or goal, but that itself doesn't have to be internal. If an LLM is prompted, or if I am given a task by my employer, we are still both exhibiting creativity when we solve it in a new way.
Problem solving is I think similar but more practical- when prompted with a problem that isn't exactly in the training set, can it come up with a workable solution or correct answer? Presumably by extrapolating, or using some type of generalized model that can extrapolate or interpolate to situations not exactly in the training data. Sure there must be a problem here that is trying to be solved, but it seems irrelevant if that is due to some internal will or goals, or an external prompt.
In the sense that volition is selecting between different courses of action towards a goal- LLMs do select between different possible outputs based on probabilities about how suitable they are in context of the given goal of response to a prompt.
Here are the results:
| Number of
| "write better code"
Score | followup prompts
---------------------------
27.6% | 0 (baseline)
19.6% | 1
11.1% | 2
It appears that blindly asking DeepSeek to "write better code" significantly harms its ability to solve the benchmark tasks. It turns working solutions into code that no longer passes the hidden test suite.o1 is effectively trying to take a pass at automating that manual effort.
What are your strategies to prevent such destructions of LLM?
made me laugh out loud. Everything is better with prom.
oh my, Claude does corporate techbabble!
> You keep giving me code that calls nonexistant methods, and is deprecated, as shown in Android Studio. Please try again, using only valid code that is not deprecated.
Does not help. I use this example, since it seems good at all other sorts of programming problems I give it. It's miserable at Android for some reason, and asking it to do better doesn't work.
Surely, performance optimisations are not the only thing that makes code "better".
Readability and simplicity are good. Performance optimisations are good only when the performance is not good enough...
There are some objective measures which can be pulled out of the code and automated (complexity measures, use of particular techniques / libs, etc.) These can be automated, and then LLMs can be trained to be decent at recognizing more subjective problems (e.g. naming, obviousness, etc.). There are a lot of good engineering practices which come down to doing the same thing as the usual thing which is in that space rather than doing something new. An engine that is good at detecting novelties seems intuitively like it would be helpful in recognizing good ideas (even given the problems of hallucinations so far seen). Extending the idea of the article to this aspect, the problem seems like it's one of prompting / training rather than a terminal blocker.
Learning a Lisp-y language, I do often find myself asking it for suggestions on how to write less imperative code, which seem to come out better than if conjured from a request alone. But again, thats feeding it examples
One time it provided me with a great example, but then a few days later I couldn't find that conversation again in the history. So I asked it about the same question (or so I thought) and it provided a very subpar answer. It took me at least 3 questions to get back to that first answer.
Now if it had never provided me with the first good one I'd have never known about the parts it skipped in the second conversation.
Of course that could happen just as easily by having used google and a specific reference to write your code, but the point I'm trying to make is that GPT isn't a single entity that's always going to provide the same output, it can be extremely variable from terrible to amazing at the end of the day.
Having used google for many years as a developer I'm much better at asking it questions than say people in the business world is, I've seen them struggling to question it and far too easily giving up. So I'm quite scared to see what's going to happen once they really start to use and rely on GPT, the results are going to be all over the place.