One major limitation the LLM has is that it can't run a profiler on the code,
but we can. (This would be a fun thing to do in the future - feed the output
of perf to the LLM and say 'now optimize').
This has been a feature of the Scalene Python profiler (https://github.com/plasma-umass/scalene) for some time (at this point, about 1.5 years) - bring your own API key for OpenAI / Azure / Bedrock, also works with Ollama. Optimizing Python code to use NumPy or other similar native libraries can easily yield multiple order of magnitude improvements in real-world settings. We tried it on several of the success stories of Scalene (before the integration with LLMs); see https://github.com/plasma-umass/scalene/issues/58 - and found that it often automatically yielded the same or better optimizations - see https://github.com/plasma-umass/scalene/issues/554. (Full disclosure: I am one of the principal designers of Scalene.)I'm drafting a blog post that talks about how but for now the documentation will have to do.
As a human performing optimization one of the steps might be to generate a list of candidate optimizations. After that you might explore each idea in parallel. By using the LLM iteratively, you're side stepping that approach and constraining the LLM to instead picking a certain optimization first and then build on that. So one cure might be to adjust your prompt and iteration strategy to take into account that there are multiple candidate solutions rather than just one. This leads to more of an agentic / chain approach than a completions approach to the task though.
The other missing part of this is that each time you optimize the code (even if it's a failure), you learn something more about the optimization. Again this context seems missing when the LLM looks at the code (unless that history is provided in CoPilot completions automatically - it probably depends on what approach you're taking there).
I wonder if there are any AI tools out there doing context + planning + history + RAG + test + evaluation + tool use + feedback approaches. I've seen various tools do 2-3 of those. It's not a stretch to think that we might eventually see tools that start to emulate more of the developer inner loops.
Llms have had truck load of training on normal code than a super optimized ones.
That's why you will see it struggle with nginx configs. And on newer technology where there is less amount of data available.
The skipping was obvious to me at first glance, before I read the full blog post. So o1 failing it even after prodding for algorithmic improvements is a sign it's not that smart (in an AGI sense).
Note: the initial problem statement is ambiguous. Instead of saying "Given a list of random integers" - which the LLM somewhat reasonably interpreted as parsing from stain - you should phrase it as "Generate a list of random integers and then..."
Link to ChatGPT session: https://chatgpt.com/share/677d0665-3f9c-8005-aaf8-662dec923d...
https://news.ycombinator.com/item?id=42585375
Aside: the original question begins "Given a list of 1 million random integers...". For me, this suggests that we ought to be benchmarking from the point after the random numbers have been generated. I don't know how easy this benchmarking is to do in Rust, but I know how I would do it in dotnet using something like BenchmarkDotNet.
Given a function that takes a list of n random integers between x and y,
find the difference between the smallest and the largest numbers whose digits sum up to m.
then define the parameter values for testing and performance evaluation. that should have prevented at least some of the proposed compile time optimizations.edit: i'm aware though that the performance characteristics might be very different depending on the value of "n", especially when it comes to multithreading, so there should be a clause for that (e.g. n usually > 1m).
I think it's fair to leave the random generation in or out, really, as long as we're still comparing apples to apples. But I agree with you that it probably would have made the benchmarking more clear had I left them out.
What kind of humans are we comparing LLMs against?
Junior devs? Intermediate devs? Senior devs? Experts?
I think few jr. devs would be able to apply the optimizations that were in the blog post. So maybe already LLMs are performing on a level that is beyond human jr. devs, and quickly catching up with the next tier.
I am a little more pessimistic. I would say that LLMs have been hovering above junior for a year or two, with some progress towards intermediate in the context of a few methods/classes, but not much evidence of the wider architectural/systems-level thinking that I'd expect from an intermediate.
What kind of humans are doing that?
How much more productive is it making them?