A messy experiment that changed how I think about AI code analysis

460 points by namanyayg 5 days ago | 227 comments

JoeAltmaier 5 days ago |
Pretty impressive. But for the part about nitpicking on style and uniformity (at the end) the results seem useful.
Btw I thought, from the title, this would be about an AI taught to dismiss anyone's work but their own, blithely hold forth on code they had no experience with, and misinterpret goals and results to fit their preconceived notions. You know, to read code like a Senior Developer.
svilen_dobrev 5 days ago |
> to read code like a Senior Developer.
you mean, as in "code written by someone else == bad code" ?
dearing 5 days ago |
Its being cute but speaking about politicking at code review.
jerpint 5 days ago |
I think there will be lessons learned here as well for better agentic systems writing code more generally; instead of “committing” to code as of the first token generated, first generate overall structure of code base, with abstractions, and only then start writing code.
I usually instruct Claude/chatGPT/etc not to generate any code until I tell it to, as they are eager to do so and often box themselves in a corner early on
rmbyrro 5 days ago |
aider.chat has an /architect mode where you can discuss the architecture first and later ask it to execute the architectural decisions
works pretty well, especially because you can use a more capable model for architecting and a cheaper one to code
namanyayg 5 days ago |
I didn't know about this, thanks for sharing
thegeomaster 5 days ago |
This is literally chain-of-thought! Even better than generic chain-of-thought prompting ("Think step by step and write down your thought process."), you're doing a domain-specific CoT, where you use some of your human intuition on how to approach a problem and imparting the LLM with it.
j_bum 5 days ago |
Yes I frequently do this too.
In fact I often ask whatever model I’m interacting with to not do anything until we’ve devised a plan. This goes for search, code, commands, analysis, etc.
It often leads to better results for me across the board. But often I need to repeat those instructions as the chat gets longer. These models are so hyped to generate something even if it’s not requested.
Kinrany 5 days ago |
We already have languages for expressing abstractions, they're called programming languages. Working software is always built interactively, with a combination of top-down and bottom-up reasoning and experimentation. The problem is not in starting with real code, the problem is in being unable to keep editing the draft.
qup 5 days ago |
Not a problem with the correct tooling.
Yoric 4 days ago |
I agree.
On the other hand, I expect that programming languages will keep evolving, and the next generation or so might be designed with LLMs in mind.
For instance, there's a conversation in the Rust's lang forum on how to best extract API documentation for processing by an LLM. Will this help? No idea. But it's an interesting experiment nevertheless.
namanyayg 5 days ago |
That's exactly what I've understood, and this becomes even more important as the size of codebase scales.
Ultimately, LLMs (like humans) can keep a limited context in their "brains". To use them effectively, we have to provide the right context.
jalopy 5 days ago |
This looks very interesting, however it seems to me like the critical piece of this technique is missing from the post: the implementations of getFileContext() and shouldStartNewGroup().
Am I the one missing something here?
thegeomaster 5 days ago |
Reading between the lines, it sounds like they are creating an AI product for more than just their own codebase. If this is the case, they'd probably be keeping a lot of the secret sauce hidden.
More broadly, it's nowadays almost impossible to find what worked for other people in terms of prompting and using LLMs for various tasks within an AI product. Everyone guards this information religiously as a moat. A few open source projects are everything you have if you want to get a jumpstart on how an LLM-based system is productized.
iamleppert 5 days ago |
No, the code he posted sorts files by size, groups them, and then…jazz hands?
layer8 5 days ago |
Yeah, and in the code bases I’m familiar with, you’d need a lot of contextual knowledge that can’t be derived from the code base itself.
whinvik 5 days ago |
Sounds interesting. Do you have documentation on how you built the whole system?
namanyayg 5 days ago |
I'll write something up, what are you curious about exactly?
JTyQZSnP3cQGa8B 5 days ago |
> Do you have documentation on how you built the whole system
Or any actual "proof" (i.e. source code) that your method is useful? I have seen a hundred articles like this one and, surprise!, no one ever posts source code that would confirm the results.
namanyayg 5 days ago |
I have been trying to figure out how to publish evals or benchmarks for this.
But where can I get high quality data of codebases, prompts, and expected results? How do I benchmark one codebase output vs another?
Would love any tips from the HN community
JTyQZSnP3cQGa8B 5 days ago |
That's the problem with people who use AI. You think too much and fail to deliver. I'm not asking for benchmarks or complicated stuff, I want source code, actual proof that I can diff myself. Also that's why the SWE is doomed because of AI, but that's another story.
techn00 5 days ago |
the implementations of getFileContext() and shouldStartNewGroup().
theginger 5 days ago |
What is with the font joining the character c and t on this site?(In headings)
escape_goat 5 days ago |
It's not joining it in a kerning sense, that's just the remarkably serif nature of EB Garamond, which has a little teardrop terminal on the tip of the 'c'. It's possible that you have font smoothing that is tainting the gap, otherwise it's your eyes.
teraflop 5 days ago |
No, the heading font is Lato, not Garamond, and it's definitely some kind of digraph that only shows up with the combination "ct". Compare the letter "c" in these two headings: https://i.imgur.com/Zq53gTd.png
escape_goat 5 days ago |
This should be upvoted. Thank you, I hadn't realized that OP was referring to the heading font or scrolled down to see what is yes, quite a remarkable ligature. It appears to be Lato delivered from <https://brick.freetls.fastly.net/fonts/lato/700.woff> The ligature appears due to discretionary ligatures being turned on.
h1, h2, h3 { font-feature-settings: "kern" 1, "liga" 1, "pnum" 1, "tnum" 0, "onum" 1, "lnum" 0, "dlig" 1; font-variant-ligatures: discretionary-ligatures; }
eichin 5 days ago |
Actually, EB Garamond has c_t and s_t ligatures.
codesnik 5 days ago |
and a very subtle f_f. I don't find those nice though.
jfk13 5 days ago |
It does, but those would only be applied if the `font-variant-ligatures: historical-ligatures` property were specified, so they don't appear on this site.
escape_goat 5 days ago |
I inspected for a ligature and any evidence of CSS kerning being turned on before commenting, but I didn't test it to see what the page looked like with it turned on, so I didn't have active knowledge of the possibility of a ligature. If I'd know, it would have been better to give wider scope to the possibility that somehow kerning was being activated by OP's browser. I should have known better than to make a remark about a font without absolutely scrupulous precision! I actually appreciate the comments and corrections.
wymerica 5 days ago |
I was curious about this as well, it looks as though he’s using a specific font which creates a ligature between those letters. I think it’s specific because it’s only on the CT and it’s on other pages in his site. I went further to investigate what this might be and it’s a little used print style: https://english.stackexchange.com/questions/591499/what-is-t...
csallen 5 days ago |
I was wondering the same thing. That doesn't seem to happen in the Lato font on Google Fonts:
https://fonts.google.com/specimen/Lato?preview.text=Reaction...
EDIT: It's called ligatures: https://developer.mozilla.org/en-US/docs/Web/CSS/font-varian.... The CSS for headings on this site turns on some extra ligatures.
jfk13 5 days ago |
Specifically, `font-variant-ligatures: discretionary-ligatures` enables this.
(So does `font-feature-settings: "dlig" 1`, which is the low-level equivalent; the site includes both.)
namanyayg 5 days ago |
In a previous lifetime I was a huge typography nerd (I could name 95% of common fonts in just a glance ~10 years ago).
These are ligatures. I got the code to enable them from Kenneth's excellent Normalize-Opentype.css [0]
[0]: https://kennethormandy.com/journal/normalize-opentype-css/
skykooler 4 days ago |
I was wondering that too - I don't think that's a ligature I've ever seen before.
advael 5 days ago |
I read to like the first line under the first bold heading and immediately this person seemed like an alien. I'll go back and read the rest because it's silly to be put off a whole article by this kind of thing, but what in the actual fuck?
I was probably not alive the last time anyone would have learned that you should read existing code in some kind of linear order, let alone programming. Is that seriously what the author did as a junior, or is it a weirdly stilted way to make an analogy to sequential information being passed into an LLM... which also seems to misunderstand the mechanism of attention if I'm honest
I swear like 90% of people who write about "junior developers" have a mental model of them that just makes zero sense that they've constructed out of a need to dunk on a made up guy to make their point
dnadler 5 days ago |
While that wasn’t my experience as a junior developer, this is something that I used to do with academic papers.
I would read it start to finish. Later on, I learned to read the abstract, then jump to either the conclusion or some specific part of the motivation or results that was interesting. To be fair, I’m still not great at reading these kinds of things, but from what I understand, reading it start to finish is usually not the best approach.
So, I think I agree that this is not really common with code, but maybe this can be generalized a bit.
disgruntledphd2 5 days ago |
> reading it start to finish is usually not the best approach.
It really, really depends on who you are and what your goal is. If it's your area, then you can probably skim the introduction and then forensically study methods and results, mostly ignore conclusion.
However, if you're just starting in an area, the opposite parts are often more helpful, as they'll provide useful context about related work.
Aurornis 5 days ago |
> this is something that I used to do with academic papers
Academic papers are designed to be read from start to finish. They have an abstract to set the stage, an introduction, a more detailed setup of the problem, some results, and a conclusion in order.
A structured, single-document academic paper is not analogous to a multi-file codebase.
rorytbyrne 5 days ago |
No, they are designed to elucidate the author's thought process - not the reader's learning process. There's a subtle, but important difference.
Also: https://web.stanford.edu/class/ee384m/Handouts/HowtoReadPape...
p1esk 5 days ago |
they are designed to elucidate the author's thought process - not the reader's learning process
No, it’s exactly the opposite: when I write papers I follow a rigid template of what a reader (reviewer) expects to see. Abstract, intro, prior/related work, main claim or result, experiments supporting the claim, conclusion, citations. There’s no room or expectation to explain any of the thought process that led to the claim or discovery.
Vast majority of papers follow this template.
the_af 5 days ago |
You're supposed to read academic papers from start to finish.
jghn 5 days ago |
I was taught to read the abstract, then the conclusion, then look at the figures, and maybe dig into other sections if there's something that drew my interest.
Given the variety of responses here, I wonder if some of this is domain specific.
nis251413 5 days ago |
It depends also on what you want to get from the article. Usually I focus on the methods section to really understand what the paper did (usually I read experimental papers in cognitive science/neuroscience). I may read parts of the results, but hopefully they have figures that summarize them so I do not have to read much. I rarely read the conclusion section and in general I do not care much about how authors interpret their results, because people can make up anything and if one does not read the methods can get really mislead by the authors' biases.
fsmv 5 days ago |
I learned very quickly reading math papers that you should not get stuck staring at the formulas, read the rest first and let them explain the formulas.
I would not say it should be read start to finish, I often had to read over parts multiple times to understand it.
baq 5 days ago |
You're supposed to read the abstract, preferably the bottom half first to see if there are conclusions there, then proceed to the conclusions if the abstract is insufficient. Once you're through with that, you can skim the introduction and decide if the paper is worth your attention.
Reading start to finish is only worth it if you're interested in the gory details, I'm usually not.
dnadler 4 days ago |
It’s interesting how many different opinions there are in this thread! Perhaps it really varies by field.
I was reading mostly neuroscience papers when I was taught this method as an undergrad (though the details are a bit fuzzy these days).
I’d bet it also varies quite a bit with expertise/familiarity with the material. A newcomer will have a hard time understanding the methodology of a niche paper in neuroscience, for example, but the concepts communicated in the abstract and other summary sections are quite valuable.
jcgrillo 4 days ago |
The academic paper analogy is interesting, because code and papers are meant to do the exact same thing: communicate ideas to colleagues. Code written by a small group of competent programmers with a clear, shared vision is therefore a lot easier to read than code written by a large group of programmers who are just desperately trying to crush enough jira story points that they don't get noticed at the next performance review.
The difference is usually papers written that badly don't go into "production"--they don't pass review.
I usually read code top-to-bottom (at least on a first pass) in two ways--both root-to-leaf in the directory/package structure and top-to-bottom in each source file. Only then when I've developed some theory of what it's about do I "jump around" and follow e.g. xref-find-references. This is exactly analogous to how I approach academic papers.
I think the idea that you can't (or shouldn't?) approach code this way is a psychological adaptation to working on extremely badly wrought codebases day in and day out. Because the more you truly understand about them the more depressing it gets. Better just to crush those jira points and not think too much.
Klonoar 5 days ago |
> I was probably not alive the last time anyone would have learned that you should read existing code in some kind of linear order
I think you're jumping ahead and missing a point that the article itself made: there are indeed bootcamp developers who were taught this way. I have spent quite a number of hours of my life trying to walk some prospective developers back from this mindset.
That said I think that you could write this entire article without dunking on junior developers and I don't consider it particularly well written, but that's a separate issue I guess.
advael 5 days ago |
I suppose such a bootcamp may exist but wow, that's crazy to me
But yea, having now read the whole thing I'm mostly taking issue with the writing style I guess. I find the method they tried interesting but it's worth noting that it's ultimately just another datapoint for the value of multi-scale analytic techniques when processing most complex data (Which is a great thing to have applied here, don't get me wrong)
namanyayg 5 days ago |
I was a junior so long ago that I've forgotten how I first read code, but I do remember I was very confused.
Edited the post to improve clarity. Thanks for the writing tip!
advael 5 days ago |
Yea sorry if I came off caustic there, dealing with really dismissive attitudes toward juniors I'm actively trying to foster has perhaps left a bad taste in my mouth
namanyayg 5 days ago |
No worries. I took the metaphor too far and you rightfully called me out. I'm still learning how to write well, I promise you'll see better from me in the future.
advael 5 days ago |
Love to see someone genuinely trying to improve at something and I'm glad to have played a tiny part in it
nfRfqX5n 5 days ago |
Didn’t seem like dunking on juniors to me
coldtea 5 days ago |
I don't know. Your comment feels like alien. The first line under the first bold heading is:
"Remember your first day reading production code? Without any experience with handling mature codebases, you probably quickly get lost in the details".
Which looks pretty much accurate. And yes, this includes the (later) implied idea that many juniors would read a PR in some kind of linear order, or at least, not read it in order of importance, or don't know how to properly order their PR code reading. And yes, some just click in the order Github shows the changed files.
Not that for 99% of the industry, "junior dev" is not the same as something like:
"just out of uni person with 12+ years of experience programming since age 10, who built a couple of toy compilers before they were 16, graduated Stanford, and was recently hired at my FAANG team"
It's usually something bewteen that and the DailyWTF fare, often closer to the latter.
lolinder 5 days ago |
The article was updated, probably in response to the parent comment. It used to read this:
> Remember your first day reading production code? You probably did what I did - start at line 1, read every file top to bottom, get lost in the details.
I copied before refreshing, and sure enough that line was modified.
schaefer 5 days ago |
> I was probably not alive the last time anyone would have learned that you should read existing code in some kind of linear order, let alone programming.
If you want to dive all the way down that rabbit hole, can I recommend you check out the wikipedia article for the book Literate Programming [1] by Donald Kunth [2].
[1]: https://en.wikipedia.org/wiki/Literate_programming [2]: https://en.wikipedia.org/wiki/Donald_Knuth
bobnamob 5 days ago |
I think this article is indicative of the "vibe" I've been getting when reading any discussion around genAI programming.
The range of (areas of) competence is just so damn vast in our industry that any discussion about the quality of generated code (or code reviews in this case) is doomed. There just isn't a stable, shared baseline for what quality looks like.
I mean really - how on earth can Jonny Startup, who spends his days slinging JS/TS to get his business launched in < a month[1], and Terrence PhD the database engineer, who writes simulation tested C++ for FoundationDB, possibly have a grounded discussion about code quality? Rarely do I see people declaring their priors.
Furthermore, the article is so bereft of detail and gushes so profusely about the success and virtues of their newly minted "senior level" AI that I can't help but wonder if they're selling something...
/rant
[1] Please don't read this as slight against Jonny Startup, his priorities are different
9rx 5 days ago |
Is there a difference in quality? Johnny Startup is presumably trading quality in order to release sooner, but the lower quality accepted in that trade is recognizable.
bobnamob 5 days ago |
If Jonny startup has been building release prioritised systems all his life/career, there's a decent chance he doesn't even know what more goes into systems with higher release & maintenance standards.
Conversely, if Terrence has only ever worked in high rigour environments, he's unlikely to understand Jonny's perspective when Jonny says that code generation tools are doing amazing "reliable" things.
Again, this isn't meant to be a value judgement against either Jonny or Terrence, more that they don't have shared context & understanding on what and how the other is building, and therefore are going to struggle to have a productive conversation about a magic blackbox that one thinks will take their job in 6 months.
9rx 4 days ago |
Your leap that lack of exposure, maybe even lack of capability, to writing high quality software precludes being aware of varying degrees of software quality is curious, and frankly unrealistic. In reality, Johnny Startup knows full well what tradeoffs he is making. Top quality software is not a priority concern of his, but he understands that it can be for other people. He is under no illusions that safety-critical software, for example, is written like a line-of-business MVP. And vice verse. Especially given the context of people participating in communities like HN where software quality is a regular topic. We are not talking about people living under rocks, as they say. It is effectively impossible for one to not have that awareness.
zkry 5 days ago |
> Furthermore, the article is so bereft of detail and gushes so profusely about the success and virtues of their newly minted "senior level" AI that I can't help but wonder if they're selling something...
With all the money in the AI space these days, my prior probability for an article extolling the virtues of AI actually trying to sell something is rather high.
I just want a few good unbiased academic studies on the effects of various AI systems on things like delivery time (like are AI systems preventing IT projects from going overtime on a fat-tailed distribution? is it possible with AI to put end to the chapter of software engineering projects going disastrously overtime/overbudget?)
lolinder 5 days ago |
To anyone who gets confused by the parent comment, note that the line they're referring to has been updated. It used to read:
> Remember your first day reading production code? You probably did what I did - start at line 1, read every file top to bottom, get lost in the details.
Now it reads:
> Remember your first day reading production code? Without any experience with handling mature codebases, you probably quickly get lost in the details.
namanyayg 5 days ago |
Oops, I should have marked my edit clearly. Added a footnote now.
lolinder 5 days ago |
Thanks! No worries, we all live and learn. :)
thimabi 5 days ago |
The change makes me question the authenticity of the text. I mean, did the author actually read files from top to bottom, or did he just write that because it suited his narrative?
That’s a trivial change to make for a line that did not receive the feedback that the author wanted. If that’s the case, maybe the text was more about saying what people wanted to hear than honestly portraying how to make AI read code better.
namanyayg 5 days ago |
I forced an analogy and took the metaphor too far. I promise you'll see better from me in the future!
autobodie 5 days ago |
Metaphor? What metaphor? What analogy?
tmpz22 5 days ago |
> Remember your first day reading production code? You probably did what I did - start at line 1, read every file top to bottom, get lost in the details.
Top to bottom left to right is how we read text (unless you are using Arabic or Hebrew!), the analogy was fine IMO. Don’t let one HN comment shake your confidence, while people here may be well intentioned they are not always right.
namanyayg 5 days ago |
Haha thank you for the kind words!
I've been a lurker on HN ever since I was a kid. I've seen over and over how HN is the most brusque & brutal online community.
But that's also why I love it. Taking every piece of feedback here to learn and improve in the future, and feeling grateful for the thousands of views my article is receiving!
edanm 5 days ago |
Hebrew speakers also read top to bottom and left to right, when they're reading code, because coding is (almost always) in English languages. :)
lolinder 5 days ago |
Don't take this feedback too personally—remember that most HN users read and don't vote or comment, a subset of them read and vote, and only a tiny loud fraction of us actually comment.
Your article has been very well received, and it wasn't because that one line deceived people into paying attention, it's because the content is good.
iaseiadit 5 days ago |
When I started out, I did read code top-to-bottom. I was mostly self-taught and didn't have a mental model yet of how code was structured, so I relied on this "brute force" method to familiarize myself.
I suppose it's not safe to assume that everyone started out like this. But advael is guilty of assuming that nobody started out like this. And on top of that, conveying it in a very negative and critical way. Don't get discouraged.
Jtsummers 5 days ago |
This discussion is about junior professionals, not zero experience programmers. If a junior professional programmer is still starting at the top of files instead of at the entry points to the program or the points of interest, then they had a very poor education.
brundolf 5 days ago |
Wow, people are being very uncharitable in this comment section
apstls 5 days ago |
Welcome to LLM-related threads on HN.
soneca 5 days ago |
Oh, I was confused, thanks a lot.
And, indeed, reading every file from top to bottom is very alien to me as a junior.
I would just try to get to the file I thought the change I needed was made and start trying and error. Definitely not checking the core files, much less creating a mental model of the architecture (the very concept of architecture would be alien to me then).
I would do get lost in irrelevant details (because I thought they were relevant), while completely missing the details that did matter.
olivierduval 5 days ago |
I think that you missed the point and should have read until "That’s exactly how we feed codebases to AI"... ;-)
Actually, the article shows that feed an AI with "structured" source code files instead of just "flat full set" files allow the LLM to give better insights
loeg 5 days ago |
I have actually just printed out codebases and read them cover to cover before (sometimes referencing ahead for context), as a senior engineer. If you need to quickly understand what every line is doing on a small to medium sized body of code, it's a pretty good way to avoid distraction and ramp up quickly. I find that just reading every line goes pretty quickly and gives me a relatively good memory of what's going on.
ninetyninenine 5 days ago |
Doing this requires higher IQ. Believe it or not a ton of people literally don’t do this because they can’t. This ability doesn’t exist for them. Thousands of pages of code is impossible to understand line by line for them. This separation of ability is very very real.
pdhborges 5 days ago |
I don't read all the lines of code but I open and scan a ton of files from the code base to get a feel of which concepts abstractions and tricks are used.
myvoiceismypass 5 days ago |
> I was probably not alive the last time anyone would have learned that you should read existing code in some kind of linear order, let alone programming.
Some of us have been around since before the concept of a “Pull Request” even existed.
Early in my career we used to print out code (on paper, not diffs) and read / have round table reviews in person! This was only like 2 decades ago, too!
overgard 5 days ago |
Yeah, to me his description of how programmers think didn't really jive with either senior or junior. I think with senior developers when they look at a code review, they're busy, so they're looking for really obvious smells. If there's no obvious smells and it's easy to understand what the code is intending to do, they usually let it pass. Most of the time if one of my PR's get's rejected it's something along the line of "I don't know why, but doing X seems sketch" or "I need more comments to understand the intended flow" or "The variable/function names aren't great"
Yoric 4 days ago |
Erm. I've been a developer for... well, certainly longer than most people on HN, I've reviewed code for most of that time, and for most PRs/MRs, I read the code almost linearly. I take a few notes here and there, and sometimes return to amend my notes, but that's often it.
It's only when a PR reaches a fairly high complexity (typically a refactoring, rather than a new feature) that I take the effort to sort it any further.
So, yeah, I guess I'm pleading guilty of doing that? But also, in my decades of experience, it works for me. I'm sure that there are other manners of reviewing, of course.
OzzyB 5 days ago |
So it turns out that AI is just like another function, inputs and outputs, and the better you design your input (prompt) the better the output (intelligence), got it.
jprete 5 days ago |
The Bitter Lesson claimed that the best approach was to go with more and more data to make the model more and more generally capable, rather than adding human-comprehensible structure to the model. But a lot of LLM applications seem to add missing domain structure until the LLM does what is wanted.
mbaytas 5 days ago |
Improving model capability with more and more data is what model developers do, over months. Structure and prompting improvements can be done by the end user, today.
do_not_redeem 5 days ago |
The Bitter Lesson pertains to the long term. Even if it holds, it may take decades to be proven correct in this case. Short-term, imparting some human intuition is letting us get more useful results faster than waiting around for "enough" computation/data.
Philpax 5 days ago |
The Bitter Lesson states that you can overcome the weakness of your current model by baking priors in (i.e. specific traits about the problem, as is done here), but you will get better long-term results by having the model learn the priors itself.
That seems to have been the case: compare the tricks people had to do with GPT-3 to how Claude Sonnet 3.6 performs today.
shahzaibmushtaq 5 days ago |
You got that 100% right. The title should be "The day I told (not taught) AI to read code like a Senior Developer".
syndicatedjelly 5 days ago |
Not trying to nitpick, but the phrase "AI is just like another function" is too charitable in my opinion. A function, in mathematics as well as programming, transforms a given input into a specific output in the codomain space. Per the Wikipedia definition,
In mathematics, a function from a set X to a set Y assigns to each element of X exactly one element of Y.[1] The set X is called the domain of the function[2] and the set Y is called the codomain of the function.[3]
Not to call you out specifically, but a lot of people seem to misunderstand AI as being just like any other piece of code. The problem is, unlike most of the code and functions we write, it's not simply another function, and even worse, it's usually not deterministic. If we both give a function the same input, we should expect the same input. But this isn't the case when we paste text into ChatGPT or something similar.
int_19h 5 days ago |
LLMs are literally a deterministic function of a bunch of numbers to a bunch of numbers. The non-deterministic part only comes when you apply the random pick to select a token based on the weights (deterministically) computed by the model.
owenpalmer 4 days ago |
LLMs are deterministic. It's just that the random seed is hidden from you, but is still an input.
risyachka 5 days ago |
>> The AI went from: >> "This file contains authentication logic using JWT tokens"
So what was the initial prompt? "What's in this file?"
And then you added context and it became context-aware. A bit of an overstatement to call this "Holy Shit moment"
Also , why is "we"? What is "our AI"? And what is "our benchmark script"?
And how big is your codebase? 50k files? 20 files?
This post has very very little value without a ton of details, looks like nowadays everything "ai" labeled gets to the front page.
dijksterhuis 5 days ago |
> looks like nowadays everything "ai" labeled gets to the front page.
it’s been this way for like a year or more. hype machine gotta hype.
cloudking 5 days ago |
Sounds like OP hasn't tried the AI IDEs mentioned in the article.
For example, Cursor Agent mode does this out of the box. It literally looks for context before applying features, changes, fixes etc. It will even build, test and deploy your code - fixing any issues it finds along the way.
batata_frita 5 days ago |
Have you tried cline to compare with cursor?
I haven't tried Cursor yet but for me cline does an excelent job. It uses internal mechanisms to understand the code base before making any changes.
cloudking 5 days ago |
I'll give it a go, I've also heard Windsurf is quite good.
Personally I've been very impressed with Cursor Agent mode, I'm using it almost exclusively. It understands the entire codebase, makes changes across files, generates new files, and interacts with terminal input/output. Using it, I've been able to build, test & deploy fullstack React web apps and three.js games from scratch.
yapyap 5 days ago |
haha man, some of yall really talk about AI like it’s some baby with all the knowledge in the world, waiting to be taught common sense
_0ffh 5 days ago |
Very sceptical of "Context First: We front-load system understanding before diving into code". The LLM sees the whole input at once, it's a transformer, not a recurrent model. Order shouldn't matter in that sense.
Ed. I see some people are disagreeing. I wish they explained how they imagine that would work.
Workaccount2 5 days ago |
Just like training data, the more context and the higher quality the context you give the model, the better the outputs become.
scinadier 5 days ago |
A bit of a disappointing read. The author never elaborates on the details of the particular day in which they taught AI to read code like a Senior Developer.
What did they have for lunch? We'll never know.
namanyayg 5 days ago |
Credit goes to "You Suck at Cooking" for their genius smash burger recipe [0] for my lunch that day
[0] https://www.youtube.com/watch?v=nq9WnmCGoFQ
quantadev 5 days ago |
In my Coding Agent, I ended up realizing my prompts need to be able to specifically mention very specific areas in the code, for which no real good syntax exists for doing that so I invented something I call "Named Blocks".
My coding agent allows you to put any number of named blocks in your code and then mention those in your prompts by name, and the AI understands what code you mean. Here's an example:
In my code:
-- block_begin SQL_Scripts ...some sql scripts... -- block_end
Example prompt:
Do you see any bugs in block(SQL_Script)?
mdaniel 5 days ago |
> for which no real good syntax exists for doing that
Up to you, but several editors have established syntax which any code-trained model will likely have seen plenty of examples
vim (set foldmethod=marker and then {{{ begin\n }}} end\n ) https://vimdoc.sourceforge.net/htmldoc/usr_28.html#28.6>
JetBrains <editor-fold desc=""></editor-fold> https://www.jetbrains.com/help/idea/working-with-source-code...
Visual Studio (#pragma region) https://learn.microsoft.com/en-us/cpp/preprocessor/region-en... (et al, each language has its own)
quantadev 4 days ago |
The great thing about my agent, which I left out, is that it extracts out all the named blocks using just pure Python, so that the prompt itself has them embedded directly in it. That's faster and more efficient than even having a "tool call" that extracts blocks by name. So I needed a solution where my own code can get named block content out of any kind of file. Having one syntax that works on ALL types of files was ideal.
UPDATE: In other words it's always "block_begin" "block_end" regardless of what the comment characters are which will be different for different files of course.
qianli_cs 5 days ago |
I thought it was about LLM training but it’s actually prompt engineering.
namanyayg 5 days ago |
I'm thinking about training next! But deepseek is so good already
voidhorse 5 days ago |
To me, this post really just highlights how important the human element will remain. Without achieving the same level of contextual understanding of the code base, I have no clue as to whether or not the AI warning makes any sense.
At a superficial level, I have no idea what "shared patterns" means or why it logically follows that sharing them would cause a race condition. It also starts out talking about authentication changes, but then cites a PR that modified "retry logic"—without that shared context, it's not clear to me that an auth change has anything to do with retry logic unless the retry is related to retries on authentication failures.
dimtion 5 days ago |
Without knowing exactly how createNewGroup and addFileToGroup are implemented it is hard to tell, but it looks like the code snippet has a bug where the last group created is never pushed to groups variable.
I'm surprised this "senior developer AI reviewer" did not caught this bug...
crazygringo 5 days ago |
I'm fascinated by stories like these, because I think it shows that LLM's have only shown a small amount of their potential so far.
In a way, we've solved the raw "intelligence" part -- the next token prediction. (At least in certain domains like text.)
But now we have to figure out how to structure that raw intelligence into actual useful thinking patterns. How to take a problem, analyze it, figure out ways of breaking it down, try those ways until you run into roadblocks, then start figuring out some solution ideas, thinking about them more to see if they stand up to scrutiny, etc.
I think there's going to be a lot of really interesting work around that in the next few years. A kind of "engineering of practical thinking". This blog post is a great example of one first step.
Terr_ 4 days ago |
> But now we have to figure out how to structure that raw intelligence into actual useful thinking patterns.
My go-to framing is:
1. W've developed an amazing tool that extends a document. Any "intelligence" is in there.
2. Many uses begin with a document that resembles a movie-script conversation between a computer and a human, alternatively adding new lines (from a real human) and performing the extended lines that parse out as "Computer says."
3. This illusion is effective against homo sapiens, who are biologically and subconsciously primed to make and experience stories. We confuse the actor with the character with the scriptwriter.
Unfortunately, the illusion is so good that a lot of developers are having problems pulling themselves back to the real world too. It's as if we're trying to teach fashion-sense and embarrassment and empathy to a cloud which looks like a person, rather than changing how the cloudmaker machine works. (The latter also being more difficult and more expensive.)
afro88 5 days ago |
Another cherry-picked example of an LLM doing something amazing, written about with a heavy dose of anthropomorphism.
It's easy to get LLMs to do seemingly amazing things. It's incredibly hard to build something where it does this amazing thing consistently and accurately for all reasonable inputs.
> Analyzing authentication system files:
> - Core token validation logic
> - Session management
> - Related middleware
This hard coded string is doing some very heavy lifting. This isn't anything special until this string is also generated accurately and consistently for any reasonable PR.
OP if you are reading, the first thing you should do is get a variety of codebases with a variety of real world PRs and set up some evals. This isn't special until evals show it producing consistent results.
namanyayg 5 days ago |
That's exactly what I want to do next.
Any tips on how should I get codebases and real world PRs? Are the ones on popular open source repos on GitHub sufficient? I worry that they don't really accurately reflect real world closed source experience because of the inherent selection bias.
Secondly, after getting all this, how do I evaluate which method gave better results? Should it be done by a human, or should I just plug an LLM to check?
jarebear6expepj 5 days ago |
“…should it be done by a human?”
Sigh.
namanyayg 5 days ago |
I'll do it personally in the beginning but was thinking about doing it on scale
JTyQZSnP3cQGa8B 5 days ago |
> doing it on scale
Like cloud-scale, no-code scale, or NoSQL scale? You are confused, which shows that, maybe, you should not be using such tools with the experience that you don't have.
zBard 4 days ago |
‘Like cloud-scale, no-code scale or NoSQL scale’.
That is the dumbest statement I have heard this week. You should perhaps refrain from commenting, at-least until you gain the modicum of intelligence that you currently don’t have.
smusamashah 5 days ago |
Very first thing you can tell us (or try if you haven't) is that if you re-prompt, does it give the same answer? Second can you get it to generate (consistently and repeatedly) the text that gp pointed out?
Don't need to switch to a different repo for quick test, just make it reproable on your current repo.
namanyayg 5 days ago |
Not the exact text, but still decent quality. I'll play around with temperature and prompts a bit.
QuadmasterXLII 5 days ago |
If you could get an LLM to check, you could just spam solutions with any assortment of models and then use your checker to pick the best.
Yoric 4 days ago |
Except LLMs are usually really bad at checking.
Perhaps if they're some day augmented by formal methods, that might change.
QuadmasterXLII 4 days ago |
Sorry, I was unclear- my statement was supposed to be proof that LLMs are bad at checking, because the approach I listed is well known to not work
Yoric 3 days ago |
Got it, now it makes sense :)
I may accidentally have been inspired from your message when I wrote the following piece, yesterday: https://yoric.github.io/post/formal-ai/
throwup238 5 days ago |
> I worry that they don't really accurately reflect real world closed source experience because of the inherent selection bias.
As opposed to what, yet another beginner React app? That’s what everyone seems to be testing with but none of the projects I’ve seen are reflective of a production codebase that’s years old and has been touched by a dozen developers.
Throw it at a complicated non-frontend mixed language repo like cxx-qt [1] or something, preferably where the training data doesn’t include the latest API.
[1] https://github.com/KDAB/cxx-qt
lukan 5 days ago |
"preferably where the training data doesn’t include the latest API"
That is the reason LLM's in their current shape are pretty useless to me for most tasks.
They happily mix different versions of popular frameworks, so I have to do so much manual work to fix it, I rather do all by myself then.
Pure (common) math problems, or other domains where the tech did not change so much, like bash scripts or regex are where I can use them. But my actual code? Not really. The LLM would need to be trained only on the API version I use and that is not a thing yet, as far as I am aware.
owenpalmer 4 days ago |
Have you tried RAG on the docs?
lukan 4 days ago |
Not yet, I can imagine it can improve things, but was too sceptical that it is worth the effort. Have you had good succes with it?
fuzzythinker 5 days ago |
https://www.kaggle.com/competitions/konwinski-prize
rHAdG12327 5 days ago |
Given how eager Microsoft is to steal other people's code, perhaps the leaked Windows source code would be an option. Or perhaps Microsoft will let you train on their internal issue tracker.
InkCanon 5 days ago |
I also think there's some exaggeration. Annotating files with a feature tag system is both manual and not scabale. Custom prompting for each commit or feature a lot more so. You do a decent bit of specialized work here.
And I think he left out the most important part, was the answer actually right? The real value of any good dev at all is that he can provide reasonably accurate analysis with logic and examples. "Could have an error" is more like a compiler warning than the output of a good engineer.
Side note: "broke the benchmark script?" If you have an automated way to qualitatively evaluate the output of an LLM in a reasonably broad context like code reading, that's far bigger a story.
imoreno 5 days ago |
>Annotating files with a feature tag system is both manual and not scabale.
Wouldn't you have the AI annotate it?
Yoric 4 days ago |
Isn't the entire point of the post the fact that the AI is missing this information in the first place?
ninetyninenine 5 days ago |
This post talks as if the results are a worthless pile of trash while obeying the HN rules of not directly insulting the results. I agree with everything under the first paragraph.
Let me spell it out for you. These results. Are. Not. Worthless.
Certainly what you said is correct on what he “should” do to get additional data, but your tonality of implying that the results are utter trash and falsely anthropomorphizing something is wrong.
Why is it wrong? Imagine Einstein got most things wrong in his life. Most things but he did discover special and general relativity. It’s just everything else was wrong. Relativity is still worth something. The results are still worthwhile.
We have an example of an LLM hallucinating. Then we have another example of additional contextual data causing the LLM to stop hallucinating. This is a data point leaving a clue about hallucinations and stopping hallucinations. It’s imperfect but a valuable clue.
My guess is that there’s a million causal factors that cause an LLM to hallucinate and he’s found one.
If he does what he did a multitude of times for different topics and different problems where contextual data stops an hallucination, with enough data and categorization of said data we may be able to output statistical data and have insight into what’s going on from a statistical perspective. This is just like how we analyze other things that produce fuzzy data like humans.
Oh no! Am I anthropomorphizing again?? Does that action make everything I said wrong? No, it doesn’t. Humans produce correct data when given context. It is reasonable to assume in many cases LLMs will do the same. I wrote this post because I agree with everything you said but not your tone which implies that what OP did is utterly trivial.
godelski 5 days ago |
They didn't say worthless, they said amazing.
Their comment is "do it consistently, then I'll buy your explanation"
ninetyninenine 5 days ago |
lol “seemingly amazing” means not amazing at all.
He didn’t literally say it but the comment implies it is worthless as does yours.
Humans dont “buy it” when they think something is worthless. The tonality is bent this way.
He could have said, “this is amazingly useful data but we need more” but of course it doesn’t read like this at all thanks to the first paragraph. Let’s not hallucinate it into something it’s not with wordplay. The comment is highly negative.
wholinator2 5 days ago |
You seem very emotionally involved in this. It says "an LLM doing something amazing". That's the sentence. Later the term "seemingly amazing" is used. Implying that it _seems amazing_. Anything beyond that is your personal interpretation. Do you disagree that there is an excess of cherrypicked LLM examples getting anthropomorphized? Yeah, it did a cool thing. Yes, llms doing single cool things are everywhere. Yes, I well be more convinced of its impact when i see it tested more widely.
nuancebydefault 5 days ago |
Still, the findings in the article are very valuable. The fact that directing the "thought" process of the LLM by this kind of prompting, yields much better results, is useful.
The comparison to how a senior dev would approach the assignment, as a metaphor explaining the mechanism, makes perfect sense to me.
mensetmanusman 5 days ago |
No need for the pessimism, these are new tools that humans have invented. We are groking how to utilize them.
ramblerman 5 days ago |
OP brought a rational argument, you didn't. It sounds like you are defending your optimism with emotion.
> We are groking how to utilize them.
Indeed.
mensetmanusman 4 days ago |
It’s tautologically the case that these are new tools under constant improvement.
The fact that these tools have extremely weird and new to the world interfacial quirks is what the discussion is about…
asah 5 days ago |
It's incredibly hard to get __ HUMANS __ to do this amazing thing consistently and accurately for all reasonable inputs.
iaseiadit 5 days ago |
Some humans can do it consistently, other humans can't.
Versus how no publicly-available AI can do it consistently (yet). Although it seems like a matter of time at this point, and then work as we know it changes dramatically.
overgard 5 days ago |
To be fair, most senior developers don't have any incentive to put this amount of analysis into a working codebase. When the system is working, nobody really wants to spend time they could be working on something interesting trying to find bugs in old code. Plus there's the social consideration that your colleagues might not like you a lot if you spend all your time picking their (working) code apart while not doing any of your tasks. Usually this kind of analysis would come from someone specifically brought in to find issues, like an auditor or a pen-tester.
kotlip 5 days ago |
The right incentives would motivate bug hunting, it entirely depends on company management. Most competent senior devs I’ve worked with spend a great deal of time carefully reading through PRs that involve critical changes. In either case, the question is not whether humans tend to act a certain way, but whether they are capable of skillfully performing a task.
smrtinsert 5 days ago |
Unfortunately some senior devs like myself do care. Too bad no one else does. Code reviews become quick after a while, you brain adapts to being to review code deeply and quickly
talldayo 5 days ago |
Humans are fully capable of protracted, triple-checked scrutiny if the incentives align just right. Given the same conditions, you cannot ever compel an AI to stop being wrong or consistently communicate what it doesn't understand.
ryanackley 5 days ago |
What I want to know is how accurate was the comment? I've found AI to frequently suggest plausible changes. Like they use enough info and context to look like excellent suggestions on the surface but you realize with some digging it was so completely wrong.
zbyforgotp 4 days ago |
This is exactly the behaviour that I would expect from a human teleported into a project and given a similar task.
After some time humans would gather some background info needed to be more productive and we need to find out how to copy that.
ryanackley 4 days ago |
You're saying a human would also offer bullsh*t suggestions that seem right but are in fact wrong?
Humans who make lots of mistakes with confidence that they aren't mistakes usually get fired or steered into a position where they can do the least amount of damage.
It's not that AI needs more background info for this type of of thing. It needs the ability to iteratively check it's own work and make corrections. This is what humans do better.
zbyforgotp 4 days ago |
Ok, maybe that was exaggerated - there are differences, but they are shrinking.
Der_Einzige 5 days ago |
The people who claim it's that hard to do these things have never heard of or used constrained/structured generation, and it shows big time!
Most other related issues of models these days are due to the tokenizer or poor choice of sampler settings which is a cheap shot on models.
j45 4 days ago |
What i'm learning is just because something might be hard to you or I, doesn't mean it's not possible or not working.
LLMs can generlaly only do what they have data on, either in training, or instructions via prompting it seems.
Keeping instructions reliable, is increasing and testing, appears to benefit from LLMops tools like Agenta, etc.
It seems to me like LLMs are reasonably well suited for things that code can't do easily as well. You can find models on Hugging face that are great at categorizing and applying labels and categorization, instead of trying to get a generalized assistant model to do it.
I'm more and more looking at tools like OpenRouter to allow doing each step with the model that does it best, almost functionally where needed to increase stability.
For now, it seems to be one way to improve reliability dramatically, happy to learn about what others are finding too.
It seems like a pretty nascent area still where existing tooling in other areas of tech is still figuring itself out in the LLM space.
revskill 5 days ago |
The seniors master more than 2 or 3 languages.
SunlitCat 5 days ago |
Oh my. That title alone inspired me to ask ChatGPT to read a simple hello world cpp program like a drunken sailor.
The end result was quite hilarious I have to say.
It's final verdict was:
End result? It’s a program yellin’, "HELLO WORLD!" Like me at the pub after 3 rum shots. Cheers, matey! hiccup
:D
namanyayg 5 days ago |
Recently I've started appending "in the style of Edgar Allen Poe" to my prompts when I'm discussing code architecture.
It's really quite interesting how the LLM comes up with ways to discuss about code :)
dartos 5 days ago |
I think the content is interesting, but anthropomorphizing AI always rubs me the wrong way and ends up sounding like marketing.
Are you trying to market a product?
zbyforgotp 5 days ago |
Personally I would not hardcode the discovery process in code but just gave the llm tools to browse the code and find what it needs itself.
atemerev 5 days ago |
This is what Aider doing out of the box
mbrumlow 5 days ago |
> Context First: We front-load system understanding before diving into code Pattern Matching: Group similar files to spot repeated approaches Impact Analysis: Consider changes in relation to the whole system
Wait. You fixed your AI by doing traditional programming !?!?!
_0ffh 5 days ago |
The context first bit doesn't even make sense.
Transformers process their whole context window in parallel, unlike people who process it serially. There simply is no place that gets processed "first".
I'd love to see anyone who disagrees explain to me how that is supposed to work.
danjl 5 days ago |
> Identifying tech debt before it happens
Tech debt is a management problem, not a coding problem. A statement like this undermines my confidence in the story being told, because it indicates the lack of experience of the author.
noirbot 5 days ago |
I don't think that's totally accurate though. It can definitely be a coding problem - corners cut for expediency and such. Sometimes that's because management doesn't offer enough time to not do it that way, but it can also just be because the dev doesn't bother or does the bare minimum.
I'd argue the creation of tech debt is often coding problem. The longevity and persistence of tech debt is a management problem.
danjl 5 days ago |
Not taking enough time is a management problem. It doesn't matter whether It is the manager or the developer who takes shortcuts. The problem is planning, not coding.
dijksterhuis 5 days ago |
> it can also just be because the dev doesn't bother or does the bare minimum
sounds like a people problem — which is management problem.
> I'd argue the creation of tech debt is often coding problem. The longevity and persistence of tech debt is a management problem.
i’d argue the creation of tech debt is more often due to those doing the coding operating under the limitations placed on them. The longevity and persistence of tech debt is just an extension of that.
given an infinite amount of time and money, i can write an ideal solution to a solvable problem (or at least close to ideal, i’m not that good of a dev).
the limitations create tech debt, and they’re always there because infinite resources (time and money) don’t exist.
so tech debt always exists because there’s always limitations. most of the time those resource limitations are decided by management (time/money/people)
but there are language/framework/library limitations which create tech debt too though, which i think is what you might be referring to?
usually those are less common though
noirbot 4 days ago |
I'm mostly saying that not every dev is always just trying to do the perfect solution. I have plenty of times I've cut a corner just to get something done because I was tired of working on it, or just knew there was a better way but couldn't come up with it without taking way more time. Not every project demands a solution with no technical debt, or will even suffer from having some technical debt. Hell, I've had times I've had to explain to management why I wasn't doing something in the more elegant way because it simply wasn't going to be worth the extra time to do it that way given the scope of the project.
It's an easy out to just blame bad management for all the ills of a bad code base, and there's definitely plenty of times I've wanted to take longer to fix/prevent some tech debt and haven't been given the time, but it's self-serving to blame it all on outside forces.
It's also ignoring the times where management is making a justifiable decision to allow technical debt in order to meet some other goal, and the decision that a senior engineer often has to make is which technical debt to incur in order to work within the constraints.
shahzaibmushtaq 5 days ago |
All fresh bootcamp grads aren't going to understand what the author is talking about, and many senior developers (even mid-seniors) are looking for what prompts the author wrote to teach AI how to become a senior developer.
highcountess 5 days ago |
Dev palms just got that much more sweaty.
brundolf 5 days ago |
People are being very uncharitable in the comments for some reason
This is a short and sweet article about a very cool real-world result in a very new area of tooling possibilities, with some honest and reasonable thoughts
Maybe the "Senior vs Junior Developer" narrative is a little stretched, but the substance of the article is great
Can't help but wonder if people are getting mad because they feel threatened
namanyayg 5 days ago |
Felt a bit more cynic than usual haha.
jryan49 5 days ago |
It seems llms are very useful for some people and not so much for others. Both sides believe it's all or nothing. If it's garbage for me it must be garbage. If it's doing my work it must be able to do everyone's work... Everyone is very emotionally about it too because of the hype around it. Almost all conversations about llms, especially on hn are full of this useless bickering.
epolanski 5 days ago |
I am more and more convinced that many engineers are very defensive about AI and would rather point out any flaw than think how to leverage the tools to get any benefit out of them.
Just the other day I used cursor and iteratively implemented stories for 70 .vue files in few hours, while also writing documentation for the components and pages, and with the documentation being further fed to cursor, to write many E2Es, something that would've taken me at least few days if not a week.
When I shared that with some coworkers they went into a hunt to find all the shortcomings (often petty duplication of mocks, sometimes missing a story scenario, nothing major).
I found it striking as we really needed it and it provides tangible benefits:
- domain and UI stakeholders can navigate stories and think of more cases/feedback with ease on a UX/UI pov without having to replicate the scenarios manually doing multiple time consuming repetitive operations in the actual applications
- documentation proved to be very valuable to a junior that joined us this very january
- E2Es caught multiple bugs in their own PRs in the weeks after
And yet, instead of appreciating the cost/benefit ratio (something that should characterise a good engineer, after all, that's our job) of the solution, I was scolded because they (or I) would've done a more careful job missing that they never done that in the first place.
I have many such examples, such as automatically providing all the translation keys and translations for a new locale, just to find cherry picked criticism that this or that could've been spelled differently. Of course it can, what's your job if not being responsible for the localisation? That shouldn't diminish that 95% of the content was correct and provided in few seconds rather than days.
Why they do that? I genuinely feel some feel threatened, most of those reek insecurity.
I can understand some criticism towards those who build and sell hype with cherry picked results, but I cannot but find some of the worst critics suffering of Luddism.
krainboltgreene 4 days ago |
Given how much damage it's done to our industry without any appreciable impact on the actual system's efficacy it makes sense to me that experts in a mechanism are critical of people telling them how effective this "tool" is for the mechanism.
I suppose it's simply easier to think of them as scared and afraid of losing their lobs to robots, but the reality is most programmers already know someone who lost their job to a robot that doesn't even exist yet.
__loam 4 days ago |
To me it sounds like you're rushing the work and making mistakes then dismissing people who point it out.
epolanski 4 days ago |
Maybe I didn't express myself correctly but it was neither planned nor budgeted and my client is a small team with more projects than people.
I strongly believed in the value provided by setting up stories, writing more documentation and E2Es in the few hours I had and it did.
Due to the boilerplate-y nature of the task LLMs proved to be a great fit, having me reviewing more than writing thousands of lines of code across almost 80 files in few hours rather than multiple days.
The fact that the cost/benefit ratio is lost on so many people is appalling but unsurprising in a field that severely lacks the "engineering" part and is thus uneducated to think in those terms.
namanyayg 4 days ago |
I'd love to hear more about how you're using cursor for testing especially e2e tests for frontend applications.
In my experience since cursor doesn't know how a frontend app looks like nor can it run a browser, the tests it writes are often inane.
Can you tell me what testing stack do you use, and how do you approach the process of writing large tests for mature codebases with cursor?
epolanski 4 days ago |
What I've done was to use composer recursively.
First I had it write stories based on the pages and components. I had obviously to review the work and further add more cases.
Then I had it generate a markdown file where it documented the purpose and usage and apis for those and combined it with user stories written in our project management tool which I copy pasted in different files. It helped our user stories are written in a gerkin-like fashion (when/and/or/then) which is computer-friendly.
As most of the components had unique identifiers in terms of data-test attributes I could further ask it to implement more e2e cases.
Overall I was very satisfied of the cost/benefit ratio.
Stories were the most complicated part as Cursor tended to redeclare mocks multiple times rather than sharing them across, and it wasn't consistent in the API choices it made (storybook has too many ways to accomplish the same thing).
E2Es with Playwright were the easiest part, the criticism here was that I used data attributes (which users don't see) over elements like text. I very much agree with that, as I myself am a fan of testing the way that users would. Problem is that as our application is localized I had to compromise in order to keep them parallel and fast, as many tests do change locale settings which was interfering, as new pages loading had a different locale then expected. I'm not the only one using such attributes for testing, I know it's common practice in big cushy tech too.
One thing I want to note, you can't do it in few prompts, it feels like having to convince the agent to do what you ask him iteratively.
I'm still convinced of the cost/benefits ratio and with practice you get better at prompting. You try to get to the result you want by manual editing and chatting, then feed the example result to generate more.
fragmede 4 days ago |
> One thing I want to note, you can't do it in few prompts, it feels like having to convince the agent to do what you ask him iteratively.
Success with current day LLMs isn't about getting them to output perfect code. Having them do the pets their good at - rough initial revs - and then iterating from there, is more effective. The important metric is code (not LoC, mind you) that gets checked into git/revision control and sent for PR and merged. Realizing when convincing the LLM to output flawless code ends up taking you in circles and is unproductive, while not throwing away the LLM as a useful tool is where the sweet spot is.
swiftcoder 4 days ago |
Man generates 1000s of lines of (buggy) boilerplate across 70 files, which was neither budgeted nor planned, and his coworkers are annoyed at him... News at 11
epolanski 4 days ago |
There was no bugs, where did you read that?
swiftcoder 4 days ago |
You mentioned your coworkers found shortcomings, I assume that implies the presence of bugs (be it in code, documentation, or even just more unnecessary code to maintain)
epolanski 4 days ago |
No code at all is the best code, that's a given.
It was a matter of cost vs benefit ratio which ultimately resulted in net benefits. Stakeholders like designers and product don't see nor care that some mocks are repeated or that the sub optimal API is used in stories. Customers don't care why the application is broken, they care it is and the additional E2Es catched multiple bugs. Juniors would appreciate documentation and stories even if they might be redundant.
I think the biggest fallacy committed in evaluating the benefits of LLMs is in comparing them with the best output humans can generate.
But if those humans do not have the patience, energy or time budget to generate such output (and more often than not they don't) I think one should evaluate leveraging LLMs to lower the required effort and trying to find an acceptable sweet spot, otherwise you risk falling into Luddism.
Even as of 2025 humans outperform machines in tailoring 200 years after Luddism appeared, that doesn't change that it's thanks to machines that lifted humans out of a lot of the repetitive work that we can cloth virtually every human for pennies. That hasn't removed the need for human oversight in tailoring or that very same role behind higher quality clothes.
v3xro 4 days ago |
I think the mention of Luddism is a nice tell. From Wikipedia:
> The Luddites were members of a 19th-century movement of English textile workers who opposed the use of certain types of automated machinery due to concerns relating to worker pay and output quality.
I think the push for AI is the modern-day equivalent for software development - to move making programs from being labour intensive to being capital intensive. There are some of us who don't see it as a good thing.
As for my personal perspective - I view AI as the over-confident senior "salesman" programmer - it has no model for what sort of things it cannot do yet when it attempts anything it requires a lot of prompting to get somewhere which looks passable. My values for developing software is reliability and quality - which I had hoped we were going to achieve by further exploring advanced type systems (including effect systems), static analysis, model checkers, etc. Instead the "market" prioritises short-term garbage creation to push sales until the next quarterly cycle. I don't have much excitement for that.
gessha 4 days ago |
To add to your points, the job market is looking pretty grim, and it feels natural to feel threatened by layoffs and AI replacement. If there was no threat from layoffs, a lot less people would be against using a tool that makes their job easier.
uludag 4 days ago |
What your coworkers are experiencing could be signs of Luddism. This Luddism could be a very natural and human reaction.
The life situation of your coworkers could vary widely: maybe some are financially insecure living paycheck to paycheck, maybe some have made a significant purchase and cant afford to loose their job, maybe someone had a new child born and doesn't have the time to make huge investments in their workflow and is afraid of drowning in the rising tide. Maybe they're pushing back against performative business, not wanting everyone to feel that to be productive they need to constantly be modifying 100s of vue files.
Maybe they're jealous of you and your cybernetic superpowers; jealousy is a completely normal human feeling. Or maybe you were going about this in an ostentatious manner, appearing to others as tooting your own horn. Maybe there's a competition for promotions and others feel the need to make such political moves like shooting your work down. Maybe this work that you did was a political move.
Technologies are deployed and utilized in certain human contexts, inside certain organizational structures, and so on. Nothing is ever a plain and simple, cold hard cost-benefit analysis.
imoreno 5 days ago |
To me, articles like this are not so interesting for the results. I'm not reading them to find out just exactly what the performance of AIs is, exactly. Obviously it's not useful for that, it's not systematic, anecdotal, unscientific...
I think LLMs today, for all their goods and bads, can do some useful work. The problem is that there is still mystery on how to use them effectively. I'm not talking about some pie in the sky singularity stuff, but just coming up with prompts to do basic, simple tasks effectively.
Articles like that are great for learning new prompting tricks and I'm glad the authors are choosing to share their knowledge. Yes, OP isn't saying the last word on prompting, and there's a million ways it could be better. But the article is still useful to an average person trying to learn how to use LLMs more productively.
>the "Senior vs Junior Developer" narrative
It sounds to me like just another case of "telling the AI to explicitly reason through its answer improves the quality of results". The "senior developer" here is better able to triage aspects of the codebase to identify the important ones (and to the "junior" everything seems equally important) and I would say has better reasoning ability.
Maybe it works because when you ask the LLM to code something, it's not really trying to "do a good job", besides whatever nebulous bias is instilled from alignment. It's just trying to act the part of a human who is solving the problem. If you tell it to act a more competent part, it does better - but it has to have some knowledge (aka training data) of what the more competent part looks like.
dang 4 days ago |
No doubt the overstated title is part of the reason, so we've adopted the subtitle above, which is presumably more accurate.
namanyayg 4 days ago |
Hey dang, nice to see you on my post. Thanks for making HN the best moderated community!
ianbutler 5 days ago |
Code context and understanding is very important for improving the quality of LLM generated code, it’s why the core of our coding agent product Bismuth (which I won’t link but if you’re so inclined check my profile) is built around a custom code search engine that we’ve also built.
We segment the project into logical areas based on what the user is asking, then find interesting symbol information and use it to search call chain information which we’ve constructed at project import.
This gives the LLM way better starting context and we then provide it tools to move around the codebase through normal methods you or I would use like go_to_def.
We’ve analyzed a lot of competitor products and very few have done anything other than a rudimentary project skeleton like Aider or just directly feeding opened code as context which breaks down very quickly on large code projects.
We’re very happy with the level of quality we see from our implementation and it’s something that really feels overlooked sometimes by various products in this space.
Realistically, the only other product I know of approaching this correctly with any degree of search sophistication is Cody from SourceGraph which yeah, makes sense.
kmoser 5 days ago |
I wonder how the results would compare to simply prompting it to "analyze this as if you were a senior engineer"?
jappgar 5 days ago |
I do this all the time. Actually, I tell it that I am a senior engineer.
A lot of people tinkering with AI think it's more complex than it is. If you ask it ELI5 it will do that.
Often I will say "I already know all that, I'm an experienced engineer and need you to think outside the box and troubleshoot with me. "
It works great.
Arch-TK 5 days ago |
I am struggling to teach AI to stop dreaming up APIs which don't exist and failing to solve relatively simple but not often written about problems.
It's good when it works, it's crap when it doesn't, for me it mostly doesn't work. I think AI working is a good indicator of when you're writing code which has been written by lots of other people before.
Terr_ 5 days ago |
> I think AI working is a good indicator of when you're writing code which has been written by lots of other people before.
This is arguably good news for the programming profession, because that has a big overlap with cases that could be improved by a library or framework, the traditional way we've been trying to automate ourselves out of a job for several decades now.
joshka 4 days ago |
>I am struggling to teach AI to stop dreaming up APIs which don't exist and failing to solve relatively simple but not often written about problems.
Try giving them more context of APIs that actually exist then as part of the inputs...
mkl 4 days ago |
How do you do that when you don't know the APIs? AI code assistants seem more useful when you're doing something outside your existing experience: different domain, different libraries, different language, etc.
joshka 4 days ago |
I use GitHub CoPilot both in and ouside my existing experience. Mostly I'm working with Rust, so access to the API surface of a library is fairly simple (docs.rs/cratename) and something which can be provided to the context by pasting if needed. YMMV for other languages.
Arch-TK 3 days ago |
It doesn't help even if I paste the code in sometimes.
charles_f 5 days ago |
I wondered if there was a reason behind the ligature between c and t across the article (e.g. is it easier to read for people with dyslexia).
If like me you didn't know, apparently this is mostly stylistic, and comes from a historical practice that predates printing. There are other common ligatures such as CT, st, sp and th. https://rwt.io/typography-tips/opentype-part-2-leg-ligatures
patrickhogan1 5 days ago |
This is great. More context is better. Only question is after you have the AI your code why would you have to tell it basic things like this is session middleware.
deadbabe 5 days ago |
This strikes me as basically doing the understanding for the LLM and then having it summarize it.
redleggedfrog 5 days ago |
That's funny those are considered Senior Dev attributes. I would think you'd better be doing that basic kind of stuff from the minute your writing code for production and future maintenance. Otherwise your making a mess someone else is going to have to clean up.
guerrilla 5 days ago |
Today I learned I have "senior dev level awareness". This seems pretty basic to me, but impressive that the LLM was able to do it. On the other hand, this borderline reads like those people with their "AI" girlfriends.
riazrizvi 5 days ago |
Nice article. The comments are weird as fuck.
ptx 5 days ago |
Well, did you check if the AI's claims were correct?
Does PR 1234 actually exist? Did it actually modify the retry logic? Does the token refresh logic actually share patterns with the notification service? Was the notification service added last month? Does it use websockets?
stevenhuang 5 days ago |
Related article on how LLMs are force fed information line by line
https://amistrongeryet.substack.com/p/unhobbling-llms-with-k...
> Our entire world – the way we present information in scientific papers, the way we organize the workplace, website layouts, software design – is optimized to support human cognition. There will be some movement in the direction of making the world more accessible to AI. But the big leverage will be in making AI more able to interact with the world as it exists.
> We need to interpret LLM accomplishments to date in light of the fact that they have been laboring under a handicap. This helps explain the famously “jagged” nature of AI capabilities: it’s not surprising that LLMs struggle with tasks, such as ARC puzzles, that don’t fit well with a linear thought process. In any case, we will probably find ways of removing this handicap
disambiguation 4 days ago |
OP you only took this half way. We already know LLMs can say smart sounding things while also being wrong and irrelevant. You need to manually validate how many N / 100 LLM outputs are both correct and significant - and how much did it miss! Otherwise you might fall into a trap of dealing with too much noise for only a little bit of signal. The next step from there is comparing it with human level signal to noise ratio.
Jimmc414 4 days ago |
@namanyayg Thanks for posting this, OP. I created a prompt series based on this and so far I like the results. Here is the repo if you are interested.
https://github.com/jimmc414/better_code_analysis_prompts_for...
I used this tool to flatten the example repo and PRs into text:
https://github.com/jimmc414/1filellm
e12e 4 days ago |
> Related PR: #1234 (merged last week) modified the same retry logic. Consider adding backoff."
Is this an example of confabulation (hallucination)? It's difficult to tell from the post.
noisy_boy 4 days ago |
It stood out to me too - could be from some comment in the code.