Btw I thought, from the title, this would be about an AI taught to dismiss anyone's work but their own, blithely hold forth on code they had no experience with, and misinterpret goals and results to fit their preconceived notions. You know, to read code like a Senior Developer.
you mean, as in "code written by someone else == bad code" ?
I usually instruct Claude/chatGPT/etc not to generate any code until I tell it to, as they are eager to do so and often box themselves in a corner early on
works pretty well, especially because you can use a more capable model for architecting and a cheaper one to code
In fact I often ask whatever model I’m interacting with to not do anything until we’ve devised a plan. This goes for search, code, commands, analysis, etc.
It often leads to better results for me across the board. But often I need to repeat those instructions as the chat gets longer. These models are so hyped to generate something even if it’s not requested.
On the other hand, I expect that programming languages will keep evolving, and the next generation or so might be designed with LLMs in mind.
For instance, there's a conversation in the Rust's lang forum on how to best extract API documentation for processing by an LLM. Will this help? No idea. But it's an interesting experiment nevertheless.
Ultimately, LLMs (like humans) can keep a limited context in their "brains". To use them effectively, we have to provide the right context.
Am I the one missing something here?
More broadly, it's nowadays almost impossible to find what worked for other people in terms of prompting and using LLMs for various tasks within an AI product. Everyone guards this information religiously as a moat. A few open source projects are everything you have if you want to get a jumpstart on how an LLM-based system is productized.
Or any actual "proof" (i.e. source code) that your method is useful? I have seen a hundred articles like this one and, surprise!, no one ever posts source code that would confirm the results.
But where can I get high quality data of codebases, prompts, and expected results? How do I benchmark one codebase output vs another?
Would love any tips from the HN community
h1, h2, h3 {
font-feature-settings: "kern" 1, "liga" 1, "pnum" 1, "tnum" 0, "onum" 1, "lnum" 0, "dlig" 1;
font-variant-ligatures: discretionary-ligatures;
}
https://fonts.google.com/specimen/Lato?preview.text=Reaction...
EDIT: It's called ligatures: https://developer.mozilla.org/en-US/docs/Web/CSS/font-varian.... The CSS for headings on this site turns on some extra ligatures.
(So does `font-feature-settings: "dlig" 1`, which is the low-level equivalent; the site includes both.)
These are ligatures. I got the code to enable them from Kenneth's excellent Normalize-Opentype.css [0]
[0]: https://kennethormandy.com/journal/normalize-opentype-css/
I was probably not alive the last time anyone would have learned that you should read existing code in some kind of linear order, let alone programming. Is that seriously what the author did as a junior, or is it a weirdly stilted way to make an analogy to sequential information being passed into an LLM... which also seems to misunderstand the mechanism of attention if I'm honest
I swear like 90% of people who write about "junior developers" have a mental model of them that just makes zero sense that they've constructed out of a need to dunk on a made up guy to make their point
I would read it start to finish. Later on, I learned to read the abstract, then jump to either the conclusion or some specific part of the motivation or results that was interesting. To be fair, I’m still not great at reading these kinds of things, but from what I understand, reading it start to finish is usually not the best approach.
So, I think I agree that this is not really common with code, but maybe this can be generalized a bit.
It really, really depends on who you are and what your goal is. If it's your area, then you can probably skim the introduction and then forensically study methods and results, mostly ignore conclusion.
However, if you're just starting in an area, the opposite parts are often more helpful, as they'll provide useful context about related work.
Academic papers are designed to be read from start to finish. They have an abstract to set the stage, an introduction, a more detailed setup of the problem, some results, and a conclusion in order.
A structured, single-document academic paper is not analogous to a multi-file codebase.
Also: https://web.stanford.edu/class/ee384m/Handouts/HowtoReadPape...
No, it’s exactly the opposite: when I write papers I follow a rigid template of what a reader (reviewer) expects to see. Abstract, intro, prior/related work, main claim or result, experiments supporting the claim, conclusion, citations. There’s no room or expectation to explain any of the thought process that led to the claim or discovery.
Vast majority of papers follow this template.
Given the variety of responses here, I wonder if some of this is domain specific.
I would not say it should be read start to finish, I often had to read over parts multiple times to understand it.
Reading start to finish is only worth it if you're interested in the gory details, I'm usually not.
I was reading mostly neuroscience papers when I was taught this method as an undergrad (though the details are a bit fuzzy these days).
I’d bet it also varies quite a bit with expertise/familiarity with the material. A newcomer will have a hard time understanding the methodology of a niche paper in neuroscience, for example, but the concepts communicated in the abstract and other summary sections are quite valuable.
The difference is usually papers written that badly don't go into "production"--they don't pass review.
I usually read code top-to-bottom (at least on a first pass) in two ways--both root-to-leaf in the directory/package structure and top-to-bottom in each source file. Only then when I've developed some theory of what it's about do I "jump around" and follow e.g. xref-find-references. This is exactly analogous to how I approach academic papers.
I think the idea that you can't (or shouldn't?) approach code this way is a psychological adaptation to working on extremely badly wrought codebases day in and day out. Because the more you truly understand about them the more depressing it gets. Better just to crush those jira points and not think too much.
I think you're jumping ahead and missing a point that the article itself made: there are indeed bootcamp developers who were taught this way. I have spent quite a number of hours of my life trying to walk some prospective developers back from this mindset.
That said I think that you could write this entire article without dunking on junior developers and I don't consider it particularly well written, but that's a separate issue I guess.
But yea, having now read the whole thing I'm mostly taking issue with the writing style I guess. I find the method they tried interesting but it's worth noting that it's ultimately just another datapoint for the value of multi-scale analytic techniques when processing most complex data (Which is a great thing to have applied here, don't get me wrong)
Edited the post to improve clarity. Thanks for the writing tip!
"Remember your first day reading production code? Without any experience with handling mature codebases, you probably quickly get lost in the details".
Which looks pretty much accurate. And yes, this includes the (later) implied idea that many juniors would read a PR in some kind of linear order, or at least, not read it in order of importance, or don't know how to properly order their PR code reading. And yes, some just click in the order Github shows the changed files.
Not that for 99% of the industry, "junior dev" is not the same as something like:
"just out of uni person with 12+ years of experience programming since age 10, who built a couple of toy compilers before they were 16, graduated Stanford, and was recently hired at my FAANG team"
It's usually something bewteen that and the DailyWTF fare, often closer to the latter.
> Remember your first day reading production code? You probably did what I did - start at line 1, read every file top to bottom, get lost in the details.
I copied before refreshing, and sure enough that line was modified.
If you want to dive all the way down that rabbit hole, can I recommend you check out the wikipedia article for the book Literate Programming [1] by Donald Kunth [2].
[1]: https://en.wikipedia.org/wiki/Literate_programming [2]: https://en.wikipedia.org/wiki/Donald_Knuth
The range of (areas of) competence is just so damn vast in our industry that any discussion about the quality of generated code (or code reviews in this case) is doomed. There just isn't a stable, shared baseline for what quality looks like.
I mean really - how on earth can Jonny Startup, who spends his days slinging JS/TS to get his business launched in < a month[1], and Terrence PhD the database engineer, who writes simulation tested C++ for FoundationDB, possibly have a grounded discussion about code quality? Rarely do I see people declaring their priors.
Furthermore, the article is so bereft of detail and gushes so profusely about the success and virtues of their newly minted "senior level" AI that I can't help but wonder if they're selling something...
/rant
[1] Please don't read this as slight against Jonny Startup, his priorities are different
Conversely, if Terrence has only ever worked in high rigour environments, he's unlikely to understand Jonny's perspective when Jonny says that code generation tools are doing amazing "reliable" things.
Again, this isn't meant to be a value judgement against either Jonny or Terrence, more that they don't have shared context & understanding on what and how the other is building, and therefore are going to struggle to have a productive conversation about a magic blackbox that one thinks will take their job in 6 months.
With all the money in the AI space these days, my prior probability for an article extolling the virtues of AI actually trying to sell something is rather high.
I just want a few good unbiased academic studies on the effects of various AI systems on things like delivery time (like are AI systems preventing IT projects from going overtime on a fat-tailed distribution? is it possible with AI to put end to the chapter of software engineering projects going disastrously overtime/overbudget?)
> Remember your first day reading production code? You probably did what I did - start at line 1, read every file top to bottom, get lost in the details.
Now it reads:
> Remember your first day reading production code? Without any experience with handling mature codebases, you probably quickly get lost in the details.
That’s a trivial change to make for a line that did not receive the feedback that the author wanted. If that’s the case, maybe the text was more about saying what people wanted to hear than honestly portraying how to make AI read code better.
Top to bottom left to right is how we read text (unless you are using Arabic or Hebrew!), the analogy was fine IMO. Don’t let one HN comment shake your confidence, while people here may be well intentioned they are not always right.
I've been a lurker on HN ever since I was a kid. I've seen over and over how HN is the most brusque & brutal online community.
But that's also why I love it. Taking every piece of feedback here to learn and improve in the future, and feeling grateful for the thousands of views my article is receiving!
Your article has been very well received, and it wasn't because that one line deceived people into paying attention, it's because the content is good.
I suppose it's not safe to assume that everyone started out like this. But advael is guilty of assuming that nobody started out like this. And on top of that, conveying it in a very negative and critical way. Don't get discouraged.
And, indeed, reading every file from top to bottom is very alien to me as a junior.
I would just try to get to the file I thought the change I needed was made and start trying and error. Definitely not checking the core files, much less creating a mental model of the architecture (the very concept of architecture would be alien to me then).
I would do get lost in irrelevant details (because I thought they were relevant), while completely missing the details that did matter.
Actually, the article shows that feed an AI with "structured" source code files instead of just "flat full set" files allow the LLM to give better insights
Some of us have been around since before the concept of a “Pull Request” even existed.
Early in my career we used to print out code (on paper, not diffs) and read / have round table reviews in person! This was only like 2 decades ago, too!
It's only when a PR reaches a fairly high complexity (typically a refactoring, rather than a new feature) that I take the effort to sort it any further.
So, yeah, I guess I'm pleading guilty of doing that? But also, in my decades of experience, it works for me. I'm sure that there are other manners of reviewing, of course.
That seems to have been the case: compare the tricks people had to do with GPT-3 to how Claude Sonnet 3.6 performs today.
In mathematics, a function from a set X to a set Y assigns to each element of X exactly one element of Y.[1] The set X is called the domain of the function[2] and the set Y is called the codomain of the function.[3]
Not to call you out specifically, but a lot of people seem to misunderstand AI as being just like any other piece of code. The problem is, unlike most of the code and functions we write, it's not simply another function, and even worse, it's usually not deterministic. If we both give a function the same input, we should expect the same input. But this isn't the case when we paste text into ChatGPT or something similar.So what was the initial prompt? "What's in this file?"
And then you added context and it became context-aware. A bit of an overstatement to call this "Holy Shit moment"
Also , why is "we"? What is "our AI"? And what is "our benchmark script"?
And how big is your codebase? 50k files? 20 files?
This post has very very little value without a ton of details, looks like nowadays everything "ai" labeled gets to the front page.
it’s been this way for like a year or more. hype machine gotta hype.
For example, Cursor Agent mode does this out of the box. It literally looks for context before applying features, changes, fixes etc. It will even build, test and deploy your code - fixing any issues it finds along the way.
I haven't tried Cursor yet but for me cline does an excelent job. It uses internal mechanisms to understand the code base before making any changes.
Personally I've been very impressed with Cursor Agent mode, I'm using it almost exclusively. It understands the entire codebase, makes changes across files, generates new files, and interacts with terminal input/output. Using it, I've been able to build, test & deploy fullstack React web apps and three.js games from scratch.
Ed. I see some people are disagreeing. I wish they explained how they imagine that would work.
What did they have for lunch? We'll never know.
My coding agent allows you to put any number of named blocks in your code and then mention those in your prompts by name, and the AI understands what code you mean. Here's an example:
In my code:
-- block_begin SQL_Scripts
...some sql scripts...
-- block_end
Example prompt: Do you see any bugs in block(SQL_Script)?
Up to you, but several editors have established syntax which any code-trained model will likely have seen plenty of examples
vim (set foldmethod=marker and then {{{ begin\n }}} end\n ) https://vimdoc.sourceforge.net/htmldoc/usr_28.html#28.6>
JetBrains <editor-fold desc=""></editor-fold> https://www.jetbrains.com/help/idea/working-with-source-code...
Visual Studio (#pragma region) https://learn.microsoft.com/en-us/cpp/preprocessor/region-en... (et al, each language has its own)
UPDATE: In other words it's always "block_begin" "block_end" regardless of what the comment characters are which will be different for different files of course.
At a superficial level, I have no idea what "shared patterns" means or why it logically follows that sharing them would cause a race condition. It also starts out talking about authentication changes, but then cites a PR that modified "retry logic"—without that shared context, it's not clear to me that an auth change has anything to do with retry logic unless the retry is related to retries on authentication failures.
I'm surprised this "senior developer AI reviewer" did not caught this bug...
In a way, we've solved the raw "intelligence" part -- the next token prediction. (At least in certain domains like text.)
But now we have to figure out how to structure that raw intelligence into actual useful thinking patterns. How to take a problem, analyze it, figure out ways of breaking it down, try those ways until you run into roadblocks, then start figuring out some solution ideas, thinking about them more to see if they stand up to scrutiny, etc.
I think there's going to be a lot of really interesting work around that in the next few years. A kind of "engineering of practical thinking". This blog post is a great example of one first step.
My go-to framing is:
1. W've developed an amazing tool that extends a document. Any "intelligence" is in there.
2. Many uses begin with a document that resembles a movie-script conversation between a computer and a human, alternatively adding new lines (from a real human) and performing the extended lines that parse out as "Computer says."
3. This illusion is effective against homo sapiens, who are biologically and subconsciously primed to make and experience stories. We confuse the actor with the character with the scriptwriter.
Unfortunately, the illusion is so good that a lot of developers are having problems pulling themselves back to the real world too. It's as if we're trying to teach fashion-sense and embarrassment and empathy to a cloud which looks like a person, rather than changing how the cloudmaker machine works. (The latter also being more difficult and more expensive.)
It's easy to get LLMs to do seemingly amazing things. It's incredibly hard to build something where it does this amazing thing consistently and accurately for all reasonable inputs.
> Analyzing authentication system files:
> - Core token validation logic
> - Session management
> - Related middleware
This hard coded string is doing some very heavy lifting. This isn't anything special until this string is also generated accurately and consistently for any reasonable PR.
OP if you are reading, the first thing you should do is get a variety of codebases with a variety of real world PRs and set up some evals. This isn't special until evals show it producing consistent results.
Any tips on how should I get codebases and real world PRs? Are the ones on popular open source repos on GitHub sufficient? I worry that they don't really accurately reflect real world closed source experience because of the inherent selection bias.
Secondly, after getting all this, how do I evaluate which method gave better results? Should it be done by a human, or should I just plug an LLM to check?
Sigh.
Like cloud-scale, no-code scale, or NoSQL scale? You are confused, which shows that, maybe, you should not be using such tools with the experience that you don't have.
That is the dumbest statement I have heard this week. You should perhaps refrain from commenting, at-least until you gain the modicum of intelligence that you currently don’t have.
Don't need to switch to a different repo for quick test, just make it reproable on your current repo.
Perhaps if they're some day augmented by formal methods, that might change.
I may accidentally have been inspired from your message when I wrote the following piece, yesterday: https://yoric.github.io/post/formal-ai/
As opposed to what, yet another beginner React app? That’s what everyone seems to be testing with but none of the projects I’ve seen are reflective of a production codebase that’s years old and has been touched by a dozen developers.
Throw it at a complicated non-frontend mixed language repo like cxx-qt [1] or something, preferably where the training data doesn’t include the latest API.
That is the reason LLM's in their current shape are pretty useless to me for most tasks.
They happily mix different versions of popular frameworks, so I have to do so much manual work to fix it, I rather do all by myself then.
Pure (common) math problems, or other domains where the tech did not change so much, like bash scripts or regex are where I can use them. But my actual code? Not really. The LLM would need to be trained only on the API version I use and that is not a thing yet, as far as I am aware.
And I think he left out the most important part, was the answer actually right? The real value of any good dev at all is that he can provide reasonably accurate analysis with logic and examples. "Could have an error" is more like a compiler warning than the output of a good engineer.
Side note: "broke the benchmark script?" If you have an automated way to qualitatively evaluate the output of an LLM in a reasonably broad context like code reading, that's far bigger a story.
Wouldn't you have the AI annotate it?
Let me spell it out for you. These results. Are. Not. Worthless.
Certainly what you said is correct on what he “should” do to get additional data, but your tonality of implying that the results are utter trash and falsely anthropomorphizing something is wrong.
Why is it wrong? Imagine Einstein got most things wrong in his life. Most things but he did discover special and general relativity. It’s just everything else was wrong. Relativity is still worth something. The results are still worthwhile.
We have an example of an LLM hallucinating. Then we have another example of additional contextual data causing the LLM to stop hallucinating. This is a data point leaving a clue about hallucinations and stopping hallucinations. It’s imperfect but a valuable clue.
My guess is that there’s a million causal factors that cause an LLM to hallucinate and he’s found one.
If he does what he did a multitude of times for different topics and different problems where contextual data stops an hallucination, with enough data and categorization of said data we may be able to output statistical data and have insight into what’s going on from a statistical perspective. This is just like how we analyze other things that produce fuzzy data like humans.
Oh no! Am I anthropomorphizing again?? Does that action make everything I said wrong? No, it doesn’t. Humans produce correct data when given context. It is reasonable to assume in many cases LLMs will do the same. I wrote this post because I agree with everything you said but not your tone which implies that what OP did is utterly trivial.
Their comment is "do it consistently, then I'll buy your explanation"
He didn’t literally say it but the comment implies it is worthless as does yours.
Humans dont “buy it” when they think something is worthless. The tonality is bent this way.
He could have said, “this is amazingly useful data but we need more” but of course it doesn’t read like this at all thanks to the first paragraph. Let’s not hallucinate it into something it’s not with wordplay. The comment is highly negative.
The comparison to how a senior dev would approach the assignment, as a metaphor explaining the mechanism, makes perfect sense to me.
> We are groking how to utilize them.
Indeed.
The fact that these tools have extremely weird and new to the world interfacial quirks is what the discussion is about…
Versus how no publicly-available AI can do it consistently (yet). Although it seems like a matter of time at this point, and then work as we know it changes dramatically.
After some time humans would gather some background info needed to be more productive and we need to find out how to copy that.
Humans who make lots of mistakes with confidence that they aren't mistakes usually get fired or steered into a position where they can do the least amount of damage.
It's not that AI needs more background info for this type of of thing. It needs the ability to iteratively check it's own work and make corrections. This is what humans do better.
Most other related issues of models these days are due to the tokenizer or poor choice of sampler settings which is a cheap shot on models.
LLMs can generlaly only do what they have data on, either in training, or instructions via prompting it seems.
Keeping instructions reliable, is increasing and testing, appears to benefit from LLMops tools like Agenta, etc.
It seems to me like LLMs are reasonably well suited for things that code can't do easily as well. You can find models on Hugging face that are great at categorizing and applying labels and categorization, instead of trying to get a generalized assistant model to do it.
I'm more and more looking at tools like OpenRouter to allow doing each step with the model that does it best, almost functionally where needed to increase stability.
For now, it seems to be one way to improve reliability dramatically, happy to learn about what others are finding too.
It seems like a pretty nascent area still where existing tooling in other areas of tech is still figuring itself out in the LLM space.
The end result was quite hilarious I have to say.
It's final verdict was:
End result? It’s a program yellin’, "HELLO WORLD!" Like me at the pub after 3 rum shots. Cheers, matey! hiccup
:D
It's really quite interesting how the LLM comes up with ways to discuss about code :)
Are you trying to market a product?
Wait. You fixed your AI by doing traditional programming !?!?!
Transformers process their whole context window in parallel, unlike people who process it serially. There simply is no place that gets processed "first".
I'd love to see anyone who disagrees explain to me how that is supposed to work.
Tech debt is a management problem, not a coding problem. A statement like this undermines my confidence in the story being told, because it indicates the lack of experience of the author.
I'd argue the creation of tech debt is often coding problem. The longevity and persistence of tech debt is a management problem.
sounds like a people problem — which is management problem.
> I'd argue the creation of tech debt is often coding problem. The longevity and persistence of tech debt is a management problem.
i’d argue the creation of tech debt is more often due to those doing the coding operating under the limitations placed on them. The longevity and persistence of tech debt is just an extension of that.
given an infinite amount of time and money, i can write an ideal solution to a solvable problem (or at least close to ideal, i’m not that good of a dev).
the limitations create tech debt, and they’re always there because infinite resources (time and money) don’t exist.
so tech debt always exists because there’s always limitations. most of the time those resource limitations are decided by management (time/money/people)
but there are language/framework/library limitations which create tech debt too though, which i think is what you might be referring to?
usually those are less common though
It's an easy out to just blame bad management for all the ills of a bad code base, and there's definitely plenty of times I've wanted to take longer to fix/prevent some tech debt and haven't been given the time, but it's self-serving to blame it all on outside forces.
It's also ignoring the times where management is making a justifiable decision to allow technical debt in order to meet some other goal, and the decision that a senior engineer often has to make is which technical debt to incur in order to work within the constraints.
This is a short and sweet article about a very cool real-world result in a very new area of tooling possibilities, with some honest and reasonable thoughts
Maybe the "Senior vs Junior Developer" narrative is a little stretched, but the substance of the article is great
Can't help but wonder if people are getting mad because they feel threatened
Just the other day I used cursor and iteratively implemented stories for 70 .vue files in few hours, while also writing documentation for the components and pages, and with the documentation being further fed to cursor, to write many E2Es, something that would've taken me at least few days if not a week.
When I shared that with some coworkers they went into a hunt to find all the shortcomings (often petty duplication of mocks, sometimes missing a story scenario, nothing major).
I found it striking as we really needed it and it provides tangible benefits:
- domain and UI stakeholders can navigate stories and think of more cases/feedback with ease on a UX/UI pov without having to replicate the scenarios manually doing multiple time consuming repetitive operations in the actual applications
- documentation proved to be very valuable to a junior that joined us this very january
- E2Es caught multiple bugs in their own PRs in the weeks after
And yet, instead of appreciating the cost/benefit ratio (something that should characterise a good engineer, after all, that's our job) of the solution, I was scolded because they (or I) would've done a more careful job missing that they never done that in the first place.
I have many such examples, such as automatically providing all the translation keys and translations for a new locale, just to find cherry picked criticism that this or that could've been spelled differently. Of course it can, what's your job if not being responsible for the localisation? That shouldn't diminish that 95% of the content was correct and provided in few seconds rather than days.
Why they do that? I genuinely feel some feel threatened, most of those reek insecurity.
I can understand some criticism towards those who build and sell hype with cherry picked results, but I cannot but find some of the worst critics suffering of Luddism.
I suppose it's simply easier to think of them as scared and afraid of losing their lobs to robots, but the reality is most programmers already know someone who lost their job to a robot that doesn't even exist yet.
I strongly believed in the value provided by setting up stories, writing more documentation and E2Es in the few hours I had and it did.
Due to the boilerplate-y nature of the task LLMs proved to be a great fit, having me reviewing more than writing thousands of lines of code across almost 80 files in few hours rather than multiple days.
The fact that the cost/benefit ratio is lost on so many people is appalling but unsurprising in a field that severely lacks the "engineering" part and is thus uneducated to think in those terms.
In my experience since cursor doesn't know how a frontend app looks like nor can it run a browser, the tests it writes are often inane.
Can you tell me what testing stack do you use, and how do you approach the process of writing large tests for mature codebases with cursor?
First I had it write stories based on the pages and components. I had obviously to review the work and further add more cases.
Then I had it generate a markdown file where it documented the purpose and usage and apis for those and combined it with user stories written in our project management tool which I copy pasted in different files. It helped our user stories are written in a gerkin-like fashion (when/and/or/then) which is computer-friendly.
As most of the components had unique identifiers in terms of data-test attributes I could further ask it to implement more e2e cases.
Overall I was very satisfied of the cost/benefit ratio.
Stories were the most complicated part as Cursor tended to redeclare mocks multiple times rather than sharing them across, and it wasn't consistent in the API choices it made (storybook has too many ways to accomplish the same thing).
E2Es with Playwright were the easiest part, the criticism here was that I used data attributes (which users don't see) over elements like text. I very much agree with that, as I myself am a fan of testing the way that users would. Problem is that as our application is localized I had to compromise in order to keep them parallel and fast, as many tests do change locale settings which was interfering, as new pages loading had a different locale then expected. I'm not the only one using such attributes for testing, I know it's common practice in big cushy tech too.
One thing I want to note, you can't do it in few prompts, it feels like having to convince the agent to do what you ask him iteratively.
I'm still convinced of the cost/benefits ratio and with practice you get better at prompting. You try to get to the result you want by manual editing and chatting, then feed the example result to generate more.
Success with current day LLMs isn't about getting them to output perfect code. Having them do the pets their good at - rough initial revs - and then iterating from there, is more effective. The important metric is code (not LoC, mind you) that gets checked into git/revision control and sent for PR and merged. Realizing when convincing the LLM to output flawless code ends up taking you in circles and is unproductive, while not throwing away the LLM as a useful tool is where the sweet spot is.
It was a matter of cost vs benefit ratio which ultimately resulted in net benefits. Stakeholders like designers and product don't see nor care that some mocks are repeated or that the sub optimal API is used in stories. Customers don't care why the application is broken, they care it is and the additional E2Es catched multiple bugs. Juniors would appreciate documentation and stories even if they might be redundant.
I think the biggest fallacy committed in evaluating the benefits of LLMs is in comparing them with the best output humans can generate.
But if those humans do not have the patience, energy or time budget to generate such output (and more often than not they don't) I think one should evaluate leveraging LLMs to lower the required effort and trying to find an acceptable sweet spot, otherwise you risk falling into Luddism.
Even as of 2025 humans outperform machines in tailoring 200 years after Luddism appeared, that doesn't change that it's thanks to machines that lifted humans out of a lot of the repetitive work that we can cloth virtually every human for pennies. That hasn't removed the need for human oversight in tailoring or that very same role behind higher quality clothes.
> The Luddites were members of a 19th-century movement of English textile workers who opposed the use of certain types of automated machinery due to concerns relating to worker pay and output quality.
I think the push for AI is the modern-day equivalent for software development - to move making programs from being labour intensive to being capital intensive. There are some of us who don't see it as a good thing.
As for my personal perspective - I view AI as the over-confident senior "salesman" programmer - it has no model for what sort of things it cannot do yet when it attempts anything it requires a lot of prompting to get somewhere which looks passable. My values for developing software is reliability and quality - which I had hoped we were going to achieve by further exploring advanced type systems (including effect systems), static analysis, model checkers, etc. Instead the "market" prioritises short-term garbage creation to push sales until the next quarterly cycle. I don't have much excitement for that.
The life situation of your coworkers could vary widely: maybe some are financially insecure living paycheck to paycheck, maybe some have made a significant purchase and cant afford to loose their job, maybe someone had a new child born and doesn't have the time to make huge investments in their workflow and is afraid of drowning in the rising tide. Maybe they're pushing back against performative business, not wanting everyone to feel that to be productive they need to constantly be modifying 100s of vue files.
Maybe they're jealous of you and your cybernetic superpowers; jealousy is a completely normal human feeling. Or maybe you were going about this in an ostentatious manner, appearing to others as tooting your own horn. Maybe there's a competition for promotions and others feel the need to make such political moves like shooting your work down. Maybe this work that you did was a political move.
Technologies are deployed and utilized in certain human contexts, inside certain organizational structures, and so on. Nothing is ever a plain and simple, cold hard cost-benefit analysis.
I think LLMs today, for all their goods and bads, can do some useful work. The problem is that there is still mystery on how to use them effectively. I'm not talking about some pie in the sky singularity stuff, but just coming up with prompts to do basic, simple tasks effectively.
Articles like that are great for learning new prompting tricks and I'm glad the authors are choosing to share their knowledge. Yes, OP isn't saying the last word on prompting, and there's a million ways it could be better. But the article is still useful to an average person trying to learn how to use LLMs more productively.
>the "Senior vs Junior Developer" narrative
It sounds to me like just another case of "telling the AI to explicitly reason through its answer improves the quality of results". The "senior developer" here is better able to triage aspects of the codebase to identify the important ones (and to the "junior" everything seems equally important) and I would say has better reasoning ability.
Maybe it works because when you ask the LLM to code something, it's not really trying to "do a good job", besides whatever nebulous bias is instilled from alignment. It's just trying to act the part of a human who is solving the problem. If you tell it to act a more competent part, it does better - but it has to have some knowledge (aka training data) of what the more competent part looks like.
We segment the project into logical areas based on what the user is asking, then find interesting symbol information and use it to search call chain information which we’ve constructed at project import.
This gives the LLM way better starting context and we then provide it tools to move around the codebase through normal methods you or I would use like go_to_def.
We’ve analyzed a lot of competitor products and very few have done anything other than a rudimentary project skeleton like Aider or just directly feeding opened code as context which breaks down very quickly on large code projects.
We’re very happy with the level of quality we see from our implementation and it’s something that really feels overlooked sometimes by various products in this space.
Realistically, the only other product I know of approaching this correctly with any degree of search sophistication is Cody from SourceGraph which yeah, makes sense.
A lot of people tinkering with AI think it's more complex than it is. If you ask it ELI5 it will do that.
Often I will say "I already know all that, I'm an experienced engineer and need you to think outside the box and troubleshoot with me. "
It works great.
It's good when it works, it's crap when it doesn't, for me it mostly doesn't work. I think AI working is a good indicator of when you're writing code which has been written by lots of other people before.
This is arguably good news for the programming profession, because that has a big overlap with cases that could be improved by a library or framework, the traditional way we've been trying to automate ourselves out of a job for several decades now.
Try giving them more context of APIs that actually exist then as part of the inputs...
If like me you didn't know, apparently this is mostly stylistic, and comes from a historical practice that predates printing. There are other common ligatures such as CT, st, sp and th. https://rwt.io/typography-tips/opentype-part-2-leg-ligatures
Does PR 1234 actually exist? Did it actually modify the retry logic? Does the token refresh logic actually share patterns with the notification service? Was the notification service added last month? Does it use websockets?
https://amistrongeryet.substack.com/p/unhobbling-llms-with-k...
> Our entire world – the way we present information in scientific papers, the way we organize the workplace, website layouts, software design – is optimized to support human cognition. There will be some movement in the direction of making the world more accessible to AI. But the big leverage will be in making AI more able to interact with the world as it exists.
> We need to interpret LLM accomplishments to date in light of the fact that they have been laboring under a handicap. This helps explain the famously “jagged” nature of AI capabilities: it’s not surprising that LLMs struggle with tasks, such as ARC puzzles, that don’t fit well with a linear thought process. In any case, we will probably find ways of removing this handicap
https://github.com/jimmc414/better_code_analysis_prompts_for...
I used this tool to flatten the example repo and PRs into text:
Is this an example of confabulation (hallucination)? It's difficult to tell from the post.