Structured Editing and Incremental Parsing

133 points by ltratt 7 days ago | 53 comments

practal 7 days ago |
I think simpler is better when it comes to structured editing. Recursive teXt has the advantage that it proposes a simple block structure built into the text itself [1]. Of course, you will need to design your language to take advantage of this block structure.
[1] http://recursivetext.com
computably 7 days ago |
Since Lisp has been around since 1960... Congratulations, you're only about 64 years late.
practal 7 days ago |
No doubt, brackets of course also convey structure. But I think indentation is better for visualising block structure. Inside these blocks, you can still use brackets, and errors like missing opening or closing brackets will not spill over into other blocks.
And yeah, I am definitely coming for Lisp.
llm_trw 7 days ago |
I wrote a racket function that would need 35 levels of indentation ten minutes ago. White space isn't coming for lisps until we figure out 12 dimensional displays.
cjfd 7 days ago |
Is a function that would need 35 levels of indentation a good idea? I have seen C code with about 12 levels of indentation and that was not too great.
llm_trw 7 days ago |
What other languages use syntax for lisps use function applications for.
Viz. the array reference a[0] in algols is a function application in lisps (vector-ref a 0).
The same is true for everything else. Semantic white space in such a language is a terrible idea as everyone eventually finds out.
practal 7 days ago |
Recursive teXt (RX) would be a great fit for Lisp, although I am more interested in replacing Lisp entirely with a simpler language rooted in abstraction logic.
Note that RX is not like normal semantic white space, but simpler. It is hard coded into the text without taking its content into consideration. RX is basically nested records of text, making it very robust, but encoded as plain text instead of JSON or something like that.
llm_trw 7 days ago |
No it won't be. S-expressions are the simplest way to linearize a tree.
Everyone thinks there's something better and the very motivated write an interpreter for their ideal language in lisp.
The ideas inevitably have so much jank when used in anger that you always come back to sexp.
Now if you discover a way to linearize dags or arbitrary graphs without having to keep a table of symbols I'd love to hear it.
practal 6 days ago |
> S-expressions are the simplest way to linearize a tree.
S-expressions are one way to linearize a tree.
Now, "simple" can mean different things depending on what you are trying to achieve. RX is simpler than s-expressions if you prefer indentation over brackets, and like the robustness that it brings. Abstraction algebra terms are simpler than s-expressions if you want to actually reason about and with them.
llm_trw 6 days ago |
In your own examples you're using both brackets and white space to delineate structure. This is complex because you need two parsers to even start working on the input stream and the full parser must know how to switch between them.
In short: I get all the pain of semantic white space with all the pain of lisp s-exp's with the benefits of neither.
kazinator 6 days ago |
If you don't allow the indentation based parsing to be nested within a bracket-based expression, it doesn't look too bad. At the top level you have the indentation-based parser. When that sees an open bracket, it recurses to the regular parser which doesn't care about indentation.
llm_trw 6 days ago |
This works until you start using macros.
practal 6 days ago |
Yes, for a general language (such as abstraction algebra) you would want to allow mixing normal term language and blocks. In RX, that is easy to do: Just reuse the blocks that RX already gives you, while the lines of a block are there for the term language.
kazinator 6 days ago |
I'm reasonably sure you could conceal this surface syntax so that macros don't know it exists and work fine. You can call a macro by writing the invocation syntax using the indented format, or by the bracketed format, or a mixture.
It's been done before; see Scheme SRFI-110, a.k.a. Sweet Expressions or t-expressions:
https://srfi.schemers.org/srfi-110/srfi-110.html
practal 6 days ago |
In my examples I use RX for the outer structure, which is unproblematic, as RX itself is not complex at all, and parsing it is easy, as easy as parsing brackets.
What kind of content you put into the blocks, depends on you. How you parse one block is independent from how you parse another block, which means embedding DSLs and so on is painless. You could view the content of a block as RX, but you can also just see it as plain text that you can parse however you choose.
This also means if you make a syntax error in one block, that does not affect any other sibling block.
The benefits of RX, especially at the outer level, is that all those ugly brackets go away, and all you are left with is clear and pleasing structure. This is especially nice for beginners, but I am programming for over 30 years now, and I like it also much better.
If you don't see that as a benefit, good for you!
benatkin 7 days ago |
An everyday example is the difference between JSON.stringify(data, null, 2) and a pretty printer like ipython or Deno.inspect. Having only one node per line reduces the amount of data that can be displayed. It's the same with code.
practal 7 days ago |
Assuming this number of indentation is really necessary (which I doubt; maybe a few auxiliary definitions are in order?), obviously only the first few levels would be represented as their own Recursive teXt blocks.
llm_trw 7 days ago |
>obviously only the first few levels would be represented as their own Recursive teXt blocks.
This is not at all obvious.
practal 6 days ago |
It is obvious to me, nobody would want to represent 35 levels of nesting by indentation. So I would represent the first few (2-4) levels in RX, and the rest by other means, such as brackets. Your language should be designed such that the cutoff is up to you, the writer of the code, and really just a matter of syntax, not semantics.
Obviously, I (usually) would not want to write things like
+ * a b * c d
but rather
+ a * b c * d
or, even better of course,
(a * b) + (c * d)
I think of blocks more as representing high-level structure, while brackets are there for low-level and fine-grained structure. As the border between these can be fluid, where you choose the cutoff will depend on the situation and also be fluid.
llm_trw 6 days ago |
>Your language should be designed such that the cutoff is up to you, the writer of the code, and really just a matter of syntax, not semantics.
I have more important things to think about in my code than when I switch between two dialects of the language.
Especially since I get no extra expressive power by doing so.
>Obviously, I (usually) would not want to write things like
Or just use (+ (* a b) (* c d)) which is simpler that any of the example above. Then I can chose between any of:
( + (* a b) (* c d)) ( + ( * a b ) ( * c d ) )
Or whatever else you want to do.
>As the border between these can be fluid, where you choose the cutoff will depend on the situation and also be fluid.
It's only fluid because you've had XX years of infix notation caused brain damage to make you think that.
practal 6 days ago |
> I have more important things to think about in my code than when I switch between two dialects of the language.
Granted. The example I gave was just to demonstrate that switching between the styles is not a problem and can be fluid, if you need it to be.
> It's only fluid because you've had XX years of infix notation caused brain damage to make you think that.
No, infix is just easier to read and understand, it matches up better with the hardware in our brains for language. If that is different for you, well ... you mentioned brain damage first.
llm_trw 6 days ago |
The average programmer has 12 years of education before they have a chance to see superior notations like prefix and postfix.
Of course you'll think that the two are weird when you've never had a chance to use them before you're an adult.
Much like how adults who are native english speakers see nothing wrong with the spelling, but children and everyone else does.
Dylan16807 6 days ago |
> Or just use (+ (* a b) (* c d)) which is simpler that any of the example above. Then I can chose between any of:
That's the exact same flexibility but in a different order. It's not simpler.
shakna 7 days ago |
Wisp [0] is the big one talked about in Scheme land, and it wouldn't need 35 levels of indentation. Like the rest of Lisp-world, there's flexibility where you need it.
For example, this shows a bit of the mix available:
for-each λ : task if : and (equal? task "help") (not (null? (delete "help" tasks))) map λ : x help #f #f x delete "help" tasks run-task task . tasks
[0] https://srfi.schemers.org/srfi-119/srfi-119.html
computably 4 days ago |
> But I think indentation is better for visualising block structure.
Which is irrelevant, because you can visualize code however you want via editor extensions.
> And yeah, I am definitely coming for Lisp.
An endeavor which is simultaneously hopeless and pointless.
practal 2 days ago |
> Which is irrelevant, because you can visualize code however you want via editor extensions.
Semantically, of course this does not matter. A block is a block, no matter if delineated by indentation or brackets. But RX looks better as plain text, and there is much less of a disconnect between RX as plain text, and RX as presented in a special editor (extension) for RX, than there would be for Lisp.
> An endeavor which is simultaneously hopeless and pointless.
Challenge accepted.
shakna 7 days ago |
Trying to use any kind of syntax highlighter with TeX is a pain in the butt. I didn't mean LaTeX there. I mean TeX, which can rewrite it's own lexer, and a lot of libraries work by doing so. I move in and out of TeXInfo syntax and it basically just causes most editors to sit there screaming that everything is broken.
llm_trw 7 days ago |
Yes its pretty funny when you realise what a tiny corner of the design space of programs most users inhabit that they think things like lsp are an amazing tool instead of a weekend throwaway project.
What's even funnier is how much they attack anyone who points this out.
foo42 7 days ago |
perhaps the "attacks" relate to the condescending tone with which you relate your superior skills.
I think most people's amazement with lsp relates to the practical benefits of such a project _not_ being thrown away but taken that last 10% (which is 90% of the work) to make it suitable for so many use cases and selling people on the idea of doing so.
llm_trw 7 days ago |
What's amazing about lsp isn't the polish, it's that we've hobbled our selves so much that a tool like it is even useful.
Only having exposure to the algol family of languages does for your mental capabilities what a sugar only diet does for your physical capabilities. It used to be the case that all programmers had exposure to assembly/machine code which broke them out of the worst habits algols instill. No longer.
Pointing out that the majority of programmers today have the mental equivalent of scurvy is somehow condescending but the corp selling false teeth along with their sugar buckets is somehow commendable.
dzaima 7 days ago |
Knowing non-algol languages won't make editor actions any less useful for algol-like. If anything, it'll just make you pretend that you don't need such and as such will end up less productive than you could be.
And editor actions can be useful for any language which either allow you to edit things, or has more than one way to do the same thing (among a bunch of other things), which includes basically everything. Of course editor functionality isn't a thing that'd be 100% beneficial 100% of the time, but it's plenty above 0% if you don't purposefully ignore it.
lolinder 6 days ago |
> Pointing out that the majority of programmers today have the mental equivalent of scurvy is somehow condescending
You can (and many people do!) say the exact same thing in a different tone and with different word choice and have people nod along in agreement. If you're finding that people consistently react negatively to you when you say it, please consider that it might be because of the way in which you say it.
I'm one of those who would normally nod along in agreement and writing in support, but your comments here make me want to disagree on principle because you come off as unbearably smug.
llm_trw 6 days ago |
>I'm one of those who would normally nod along in agreement and writing in support, but your comments here make me want to disagree on principle because you come off as unbearably smug.
So much the worse for you.
itishappy 6 days ago |
Kinda feels like this might be an instance of simply simply holding your tools wrong. What about ASM prevents incremental parsing and structured editing from being useful?
Some concrete examples for us lesser mortals please.
llm_trw 6 days ago |
The fact that there is no separation between data, addresses and commands?
The general advice is that you shouldn't mix them, but the general advice today is that you shouldn't use ASM anyway.
throw10920 6 days ago |
This is sneering, where you don't respond to a particular poster's point, but instead attack a fictional group of people that you made up based on two the intersection of different attributes.
It's against the HN guidelines (https://news.ycombinator.com/newsguidelines.html), boring, unenlightening, not intellectually gratifying, degrades the quality of the site, and takes far less intelligence than the "mental equivalent of scurvy" that you name. Don't do it.
llm_trw 6 days ago |
So is yours:
> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.
sudahtigabulan 7 days ago |
> it is now clear to me that there is ongoing work on structured editing which either doesn’t know about incremental parsing in general, or Tim’s algorithms specifically. I hope this post serves as a useful advert to such folk
I'm curious about this unnamed ongoing work (that is unaware of incremental parsing).
Anyone know what he is referring to?
DannyBee 7 days ago |
I don't know specifically - but even now, i still end up having to explain to people that incremental parsing/lexing (particularly without error recovery) is not hard, it is not really complicated, and as the author here said, Tim (et al) have made beautiful algorithms that make this stuff easy.
Heck, incremental lexing is even easy to explain. For each token, track where the lexer actually looked in the input stream to make decisions. Any time that part of the input stream changes, every token to actually look at the changed portion of the input stream is re-lexed, and if the result changes, keep re-lexing until the before/after tokenstreams sync up again or you run out of input. That's it.
You can also make a dumber version that statically calculates the maximum lookahead (lookbehind if you support that too) of the entire grammer, or the maximum possible lookahead per token, and uses that instead of tracking the actual lookahead used. In practice, this is often harder than just tracking the actual lookahead used.
In an LL system like ANTLR, incremental parsing is very similar - since it generates top-down parsers, it's the same basic theory - track what token ranges were looked at as you parse. During incremental update, only descend into portions of the parse tree where the token ranges looked at contain modified tokens.
Bottom up is trickier. Error recovery is the meaningfully tricky part in all of this.
Before tree-sitter, I was constantly explaining this stuff to people (I followed the projects that these algorithms came out of - ENSEMBLE, HARMONIA, etc). After more people get that there are ways of doing this, but you still run into people who are re-creating things we solved in pretty great ways many years ago.
toomim 6 days ago |
It's really fun seeing my old research groups mentioned decades later. Thanks for writing this up.
HexDecOctBin 6 days ago |
How would you do incremental parsing in a recursive descent parser, especially if you don't control the language? All this research is fine, but most parsers actually being used are handwritten whether GCC, Clang or (I believe) MSVC.
bgoated01 7 days ago |
I'm not extremely familiar with the details of incremental parsing, but I have used Cursorless, a VSCode extension based on tree-sitter for voice controlled structured editing, and it is pretty powerful. You can use the structured editing when you want and also normal editing in between. Occasionally the parser will get things wrong and only change/take/select part of a function or what have you, but in general it's very useful, and I tend to miss it now that I am no longer voice coding much. I seem to remember that there was a similar extension for emacs (sans voice control). treemacs, or something? Anyone used that?
[0] https://www.cursorless.org/
codetrotter 7 days ago |
Does anything similar exist for JetBrains IDEs, but fully open source? (Open source plugin, and open source voice recognition model running locally.)
servilio 6 days ago |
The one I know about is Combobulate[1], it uses treesitter but without voice control.
[1] https://github.com/mickeynp/combobulate
setopt 6 days ago |
Treemacs is not TreeSitter-related, it’s just a file tree plug-in.
bgoated01 6 days ago |
You're right, I was thinking of Combobulate, linked in a sibling comment.
gushogg-blake 6 days ago |
I did some work on structural editing a while back, using Tree-sitter to get the AST (abstract syntax tree, the parse tree used for structural edits). I now use the editor as my daily driver but don't use or feel the need for structural editing commands that often - probably partly out of habit and partly because text edits are just better for most editing.
I do miss the "wrap" command when using other editors, but it could be implemented reasonably easily without a parse tree. I found that a lot of the structural edits correspond to indentation levels anyway, but the parse tree definitely helps.
HN post: https://news.ycombinator.com/item?id=29787861
etse 6 days ago |
> why structured editors haven’t taken over the world: most programmers find them so annoying to use in some situations that they don’t view the pros as outweighing the cons.
Seems like "annoying" refers to a user interface annoyance.
I'm guessing the following since I couldn't tell what structured editing is like from the article:
Keyboard entry is immediate, but prone to breaking the program structure. Structured editing through specific commands is an abstraction on top of key entry (or mouse), both of which add a layer of resistance. Another layer might come from having to recall the commands, or if recognizing them, having to peruse a list of them, at least while learning it.
What does the developer's experience with incremental parsing feel like?
Eliah_Lakhin 6 days ago |
> What does the developer's experience with incremental parsing feel like?
It's essentially the experience most of us already have when using Visual Studio, IntelliJ, or any modern IDE on a daily basis.
The term "incremental parsing" might be a bit misleading. A more accurate (though wordier) term would be a "stateful parser capable of reparsing the text in parts". The core idea is that you can write text seamlessly while the editor dynamically updates local fragments of its internal representation (usually a syntax tree) in real time around the characters you're typing.
An incremental parser is one of the key components that enable modern code editors to stay responsive. It allows the editor to keep its internal syntax tree synchronized with the user's edits without needing to reparse the entire project on every keystroke. This stateful approach contrasts with stateless compilers that reparse the entire project from scratch.
This continuous (or incremental) patching of the syntax tree is what enables modern IDEs to provide features like real-time code completion, semantic highlighting, and error detection. Essentially, while you focus on writing code, the editor is constantly maintaining and updating a structural representation of your program behind the scenes.
The article's author suggests an alternative idea: instead of reparsing the syntax tree incrementally, the programmer would directly edit the syntax tree itself. In other words, you would be working with the program's structure rather than its raw textual representation.
This approach could simplify the development of code editors. The editor would primarily need to offer a GUI for tree structure editing, which might still appear as flat text for usability but would fundamentally involve structural interactions.
Whether this approach improves the end-user experience is hard to say. It feels akin to graphical programming languages, which already have a niche (e.g., visual scripting in game engines). However, the challenge lies in the interface.
The input device (keyboard) designed for natural text input and have limitations when it comes to efficiently interacting with structural data. In theory, these hurdles could be overcome with time, but for now, the bottleneck is mostly a question of UI/UX design. And as of today, we lack a clear, efficient approach to tackle this problem.
quantadev 4 days ago |
I think almost all of this applies to ordinary word processing as well as computer code. For example, you could have a "block-based" document editor that handles each paragraph of text as a single "entity", like what Jupyter Notebooks does. Documents have major sections, sub sections, sub-sub sections, etc, and so they're also inherently organizable as a "Tree Structure" instead of just a monolighic string of characters. Microsoft hasn't really figured this out yet, but it's the future.
Google just recently figured this out (That Documents need to be Hierarchical):
https://lifehacker.com/tech/how-to-use-google-docs-tabs
Also interestingly both Documents and Code will some day be combined. Imagine a big tree structure that contains not only computer code but associated documentation. Again probably Jupyter Notebooks is the closest thing to this we have today, because it does incorporate code and text, but afaik it's not fully "Hierarchical" which is the key.
WillAdams 4 days ago |
Or, one could just use Literate Programming:
http://literateprogramming.com/
quantadev 4 days ago |
Yep that was the original idea, from way back, although I'm sure somebody in the 1960s thought of it too.
I've had a tree-based block-editor CMS (as my side project) for well over a decade and when Jupyter came out they copied most of my design, except for the tree part, because trees are just hard. That was good, because now when people ask me what my app "is" or "does" I can just say it's mostly like Jupyter, which is easier than starting from scratch with "Imagine if a paragraph was an actual standalone thing...yadda yadda."
bjourne 4 days ago |
I'm a bit skeptical because real source can be quite messy. Parsing C source code works because there are neither comments nor preprocessor directives to consider. You have clean input which you can map into an AST using well-defined rules. But source code you edit contain both comments and preprocessor directives. So a parser for structured editing has to acknowledge their existance and can't just ignore them, unlike a parser for a compiler. And that is hard because preprocessor directives and comments can show up almost anywhere. While comments are "mute", directives affect parsing:
#define FOO } int main() { FOO
"No one writes code like that!" Actually, they do, and mature C code-bases are full of such preprocessing magic.