You can easily add new string and comment syntaxes to Forth, though. For example, you can add BCPL-style // comments to end of line with this line of code in, I believe, all standard Forths, though I've only tested it in GForth:
: // 10 word drop ; immediate
Getting it to work in block files requires more work but is still only a few lines of code. The standard word \ does this, and see \ decompiles the GForth implementation as : \
blk @
IF >in @ c/l / 1+ c/l * >in ! EXIT
THEN
source >in ! drop ; immediate
This kind of thing was commonly done for text editor commands, for example; you might define i as a word that reads text until the end of the line and inserts it at the current position in the editor, rather than discarding it like my // above. Among other things, the screen editor in F83 does exactly that.So, as with Perl, PostScript, TeX, m4, and Lisps that support readmacros, you can't lex Forth without executing it.
$a = <<<EOL
This is
not indented
but this has 4 spaces of indentation
EOL;
But the spaces have to match, if any line has less spaces than the EOL it gives an error. class Foo\u007b}
and this assert will not trigger: assert
// String literals can have unicode escapes like \u000A!
"Hello World".equals("\u00E4");
public class Foo {
public static void main(String[] args) {
double \u03c0 = 3.14159265;
System.out.println("\u03c0 = " + \u03c0);
}
}
Output: $ javac Foo.java && java Foo
π = 3.14159265
Of course, nowadays we can simply write this with any decent editor: public class Foo {
public static void main(String[] args) {
double π = 3.14159265;
System.out.println("π = " + π);
}
}
Support for Unicode escape sequences is a result of how the Java Language Specification (JLS) defines InputCharacter. Quoting from Section 3.4 of JLS <https://docs.oracle.com/javase/specs/jls/se23/jls23.pdf>: InputCharacter:
UnicodeInputCharacter but not CR or LF
UnicodeInputCharacter is defined as the following in section 3.3: UnicodeInputCharacter:
UnicodeEscape
RawInputCharacter
UnicodeEscape:
\ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
UnicodeMarker:
u {u}
HexDigit:
(one of)
0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
RawInputCharacter:
any Unicode character
As a result the lexical analyser honours Unicode escape sequences absolutely anywhere in the program text. For example, this is a valid Java program: public class Bar {
public static void \u006d\u0061\u0069\u006e(String[] args) {
System.out.println("hello, world");
}
}
Here is the output: $ javac Bar.java && java Bar
hello, world
However, this is an incorrect Java program: public class Baz {
// This comment contains \u6d.
public static void main(String[] args) {
System.out.println("hello, world");
}
}
Here is the error: $ javac Baz.java
Baz.java:2: error: illegal unicode escape
// This comment contains \u6d.
^
1 error
Yes, this is an error even if the illegal Unicode escape sequence occurs in a comment!Source control clients could apply this automatically upon checkin/checkout, so that clients with different platform encodings can work together. Alternatively, IDEs could do this when saving/loading Java source files. That never quite caught on, and the general advice was to stick to ASCII, at least outside comments.
[0] Since JDK 18, the default encoding defaults to UTF-8. This probably also extends to javac, though I haven’t verified it.
[1] https://docs.oracle.com/javase/8/docs/technotes/tools/window...
Hmm, it seems to be alive, but based more on behavior than syntax.
The comment character in Git commit messages can be a problem when you insist on prepending your commits with some "id" and the id starts with `#`. One suggestion was to allow backslash escapes in commit messages since that makes sense to a computer scientist.[1]
But looking at all of this lexical stuff I wonder if makes-sense-to-computer-scientist is a good goal. They invented the problem of using a uniform delimiter for strings and then had to solve their own problem. Maybe it was hard to use backtick in the 70’s and 80’s, but today[2] you could use backtick to start a string and a single quote to end it.
What do C-like programming languages use single quotes for? To quote characters. Why do you need to quote characters? I’ve never seen a literal character which needed an "end character" marker.
Raw strings would still be useful but you wouldn’t need raw strings just to do a very basic thing like make a string which has typewriter quotes in it.
Of course this was for C-like languages. Don’t even get me started on shell and related languages where basically everything is a string and you have to make a single-quote/double-quote battle plan before doing anything slightly nested.
[1] https://lore.kernel.org/git/[email protected]/
[2] Notwithstanding us Europeans that use a dead-key keyboard layout where you have to type twice to get one measly backtick (not that I use those)
https://git-scm.com/docs/git-commit#Documentation/git-commit...
There’s surprising layers to this. That the reporter in that thread says that git-commit will “happily” accept `#` in commit messages is half-true: it will accept it if you don’t edit the message since the `default` cleanup (that you linked to) will not remove comments if the message is given through things like `-m` and not an editing session. So `git commit -m'#something' is fine. But then try to do rebase and cherry-pick and whatever else later, maybe get a merge commit message with a commented "conflicted" files. Well it can get confusing.
That's how quoting works by default in m4 and TeX, both defined in the 70s. Unfortunately Unicode retconned the ASCII apostrophe character ' to be a vertical line, maybe out of a misguided deference to Microsoft Windows, and now we all have to suffer the consequences. (Unless we're using Computer Modern fonts or other fonts that predate this error, such as VGA font ROM dumps.)
In the 70s and 80s, and into the current millennium on Unix, `x' did look like ‘x’, but now instead it looks like dogshit. Even if you are willing to require a custom font for readability, though, that doesn't solve the problem; you need some way to include an apostrophe in your quoted string!
As for end delimiters, C itself supports multicharacter literals, which are potentially useful for things like Macintosh type and creator codes, or FTP commands. Unfortunately, following the Unicode botch theme, the standard failed to define an endianness or minimum width for them, so they're not very useful today. You can use them as enum values if you want to make your memory dumps easier to read in the debugger, and that's about it. I think Microsoft's compiler botched them so badly that even that's not an option if you need your code to run on it.
Unicode does not precribe the appearance of characters. Although in the code chart¹ it says »neutral (vertical) glyph with mixed usage« (next to »apostrophe-quote« and »single quote«), font vendors have to deal with this mixed usage. And with Unicode the correct quotation marks have their own code points, making it unnecessary to design fonts where the ASCII apostrophe takes their form, but rendering all other uses pretty ugly.
I would regard using ` and ' as paired quotation marks as a hack from times when typographic expression was simply not possible with the character sets of the day.
_________
¹
0027 ' APOSTROPHE
= apostrophe-quote (1.0)
= single quote
= APL quote
• neutral (vertical) glyph with mixed usage
• 2019 ’ is preferred for apostrophe
• preferred characters in English for paired quotation marks are 2018 ‘ & 2019 ’
• 05F3 ׳ is preferred for geresh when writing Hebrew
→ 02B9 ʹ modifier letter prime
→ 02BC ʼ modifier letter apostrophe
→ 02C8 ˈ modifier letter vertical line
→ 0301 $́ combining acute accent
→ 030D $̍ combining vertical line above
→ 05F3 ׳ hebrew punctuation geresh
→ 2018 ‘ left single quotation mark
→ 2019 ’ right single quotation mark
→ 2032 ′ prime
→ A78C ꞌ latin small letter saltillo«
Good point. And it was in m4[1] I saw that backtick+apostrophe syntax. I would have probably not thought of that possibility if I hadn’t seen it there.
[1] Probably on Wikipedia since I have never used it
> Unfortunately Unicode retconned the ASCII apostrophe character ' to be a vertical line, maybe out of a misguided deference to Microsoft Windows, and now we all have to suffer the consequences. (Unless we're using Computer Modern fonts or other fonts that predate this error, such as VGA font ROM dumps.)
I do think the vertical line looks subpar (and I don’t use it in prose). But most programmers don’t seem bothered by it. :|
> In the 70s and 80s, and into the current millennium on Unix, `x' did look like ‘x’, but now instead it looks like dogshit.
Emacs tries to render it like ‘x’ since it uses backtick+apostrophe for quotes. With some mixed results in my experience.
> Even if you are willing to require a custom font for readability, though, that doesn't solve the problem; you need some way to include an apostrophe in your quoted string!
Aha, I honestly didn’t even think that far. Seems a bit restrictive to not be able to use possessives and contractions in strings without escapes.
> As for end delimiters, C itself supports multicharacter literals, which are potentially useful for things like Macintosh type and creator codes, or FTP commands.
I should have made it clear that I was only considering C-likes and not C itself. A language from the C trigraph days can be excused. To a certain extent.
C multicharacter literals are unrelated to trigraphs. Trigraphs were a mistake added many years later in the ANSI process.
git config core.commentchar <char>
This is helpful where you want to use use say, markdown to have tidily formatted commit messages make up your pull request body too.By nature, strings contain arbitrary text. Paired delimiters have one virtue over unary: they nest, but this virtue is only evident when a syntax requires that they must nest, and this is not the case for strings. It's but a small victory to reduce the need for some sort of escaping, without eliminating it.
Of the bewildering variety of partial solutions to the dilemma, none fully satisfactory, I consider the `backtick quote' pairing among the worst. Aside from the aesthetic problems, which can be fixed with the right choice of font, the bare apostrophe is much more common in plain text than an unmatched double quote, and the convention does nothing to help.
This comes at the cost of losing a type of string, and backtick strings are well-used in many languages, including by you in your second paragraph. What we would get in return for this loss is, nothing, because `don't' is just as invalid as 'don't' and requires much the same solution. `This is `not worth it', you see', especially as languages like to treat strings as single tokens (many exceptions notwithstanding) and this introduces a push-down to that parse for, again, no appreciable benefit.
I do agree with you about C and character literals, however. The close quote isn't needed and always struck me as somewhat wasteful. 'a is cleaner, and reduces the odds of typing "a" when you mean 'a'.
select'select'select
is a perfectly valid SQL query, at least for Postgres.Languages' approach to whitespace between tokens is all over the place
so in theory you could notice "```python" or whatever and then start restricting to valid python code. (in least in theory, not sure how feasible/possible it would be in practice w/ their grammar format.)
for code i'm not sure how useful it would be since likely any model that is giving you working code wouldn't be struggling w/ syntax errors anyway?
but i have had success experimentally using the feature to drive fiction content for a game from a smaller llm to be in a very specific format.
i think i just like the idea of restricting LLM output, it has a lot of interesting use cases
TeX with its arbitrarily reprogrammable lexer: how adorable
Presumably this is also true in Python - IIRC the brace-delimited fields within f-strings may contain arbitrary expressions.
More generally, this must mean that the lexical grammar of those languages isn't regular. "Maintaining a stack" isn't part of a finite-state machine for a regular grammar - instead we're in the realm of pushdown automata and context-free grammars.
Is it even possible to support generalized string interpolation within a strictly regular lexical grammar?
Almost certainly not, a fun exercise is to attempt to devise a Pumping tactic for your proposed language. If it doesn’t exist, it’s not regular.
https://en.m.wikipedia.org/wiki/Pumping_lemma_for_regular_la...
`stuff${
}stuff${
}stuff`
so the ${ and } are extra closing and opening string delimiters, leaving the nesting to be handled by the parser.You need a lexer hack so that the lexer does not treat } as the start of a string literal, except when the parser is inside an interpolation but all nested {} have been closed.
1. https://github.com/cmur2/joe-syntax/blob/joe-4.4/misc/HowItW...
2. https://gist.github.com/irdc/6188f11b1e699d615ce2520f03f1d0d...
The downside is your rulesets tend to get more verbose and are a little bit harder to structure than they might ideally be in other languages more suited towards the purpose, but I actually think that's an advantage, as it's much easier to reason about every production rule when looking at the code.
Console.WriteLine("""""");
Console.WriteLine("""""");
Are 2 triplequoted empty strings and not one "\nConsole.WriteLine(" sixtuplequoted string?https://learn.microsoft.com/en-us/dotnet/csharp/programming-...
For a multi-line string the quotes have to be on their own line.
I imagine implementing those things took several iterations :)
Unterminated raw string literal.
https://replit.com/@Wei-YenYen/DistantAdmirableCareware#main...> but TypeScript, Swift, Kotlin, and Scala take string interpolation to the furthest extreme of encouraging actual code being embedded inside strings
Many more languages support that:
C# $"{x} plus {y} equals {x + y}"
Python f"{x} plus {y} equals {x + y}"
JavaScript `${x} plus ${y} equals ${x + y}`
Ruby "#{x} plus #{y} equals #{x + y}"
Shell "$x plus $y equals $(echo "$x+$y" | bc)"
Make :) echo "$(x) plus $(y) equals $(shell echo "$x+$y" | bc)"
> TclTcl is funny because comments are only recognized in code, and since it's a homoiconic, it's very hard to distinguish code and data. { } are just funny string delimiters. E.g.:
xyzzy {#hello world}
Is xyzzy a command that takes a code block or a string? There's no way to tell. (Yes, that means that the Tcl tokenizer/parser cannot discard comments: only at evaluation time it's possible to tell if something is a comment or not.)> SQL
PostgreSQL has the very convenient dollar-quoted strings: https://www.postgresql.org/docs/current/sql-syntax-lexical.h... E.g. these are equivalent:
'Dianne''s horse'
$$Dianne's horse$$
$SomeTag$Dianne's horse$SomeTag$
my $foo = 5;
my $bar = 'x';
my $quux = "I have $foo $bar\'s: @{[$bar x $foo]}";
print "$quux\n";
This prints out: I have 5 x's: xxxxx
The "@{[...]}" syntax is abusing Perl's ability to interpolate an _array_ as well as a scalar. The inner "[...]" creates an array reference and the outer "@{...}" dereferences it.For reasons I don't remember, the Perl interpreter allows arbitrary code in the inner "[...]" expression that creates the array reference.
...because it's an array value? Aside from how the languages handle references, how is that part any different from, for example, this in python:
>>> [5 * 'x']
['xxxxx']
You can put (almost) anything there, as long as it's an expression that evaluates to a value. The resulting value is what goes into the array.Why? Surely it is easier for both the language and the programmer to have a rule for what you can do when constructing references to anonymous arrays, without having to special case whether that anonymous array is or is not in a string (or in any one of the many other contexts in which such a construct may appear in Perl).
my $bar = x;
should give the same result.Good luck with lexing that properly.
My view on this is that it shouldn’t be interpreted as code being embedded inside strings, but as a special form of string concatenation syntax. In turn, this would mean that you can nest the syntax, for example:
"foo { toUpper("bar { x + y } bar") } foo"
The individual tokens being (one per line): "foo {
toUpper
(
"bar {
x
+
y
} bar"
)
} foo"
If `+` does string concatenation, the above would effectively be equivalent to: "foo " + toUpper("bar " + (x + y) + " bar") + " foo"
I don’t know if there is a language that actually works that way. var str = string.Format("{0} is {1} years old.", name, age);
When interpolated strings were introduced later, it was natural to have them compile to this instead of concatenation. name + " is " + age + " years old."
couldn't compile to exactly the same. (Other than maybe `string.Format` having some additional customizable behavior, I don't know C# that well.)> just very simple syntactic sugar for normal string concatenation.
Maybe. There’s also possibly a string conversion. It seems reasonable to want to disallow implicit string conversion in a concatenation operator context (especially if overloading +) while allowing it in the interpolation case.
> here’s also possibly a string conversion. It seems reasonable to want to disallow implicit string conversion in a concatenation operator context (especially if overloading +) while allowing it in the interpolation case.
Many languages have a string contenation operator that does implicit conversion to string, while still having a string interpolation syntax like the above. It's kind of my point that both are much more similar to each other than many people seem to realize.
Which is fun, because correct highlighting depends on language version. Haskell has similar problems where different compiler flags require different parsers. Close enough is sufficient for syntax highlighting, though.
Python is also a bit weird because it calls the format methods, so objects can intercept and react to the format specifiers in the f-string while being formatted.
>>> print(f"foo {"bar"}")
SyntaxError: f-string: expecting '}'
Only this works: >>> print(f"foo {'bar'}")
foo bar
Python 3.12.7 (main, Oct 3 2024, 15:15:22) [GCC 14.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print(f"foo {"bar"}")
foo bar
That should probably not be one token.
> My view on this is that it shouldn’t be interpreted as code being embedded inside strings
I’m not sure exactly what you’re proposing and how it is different. You still can’t parse it as a regular lexical grammar.
How does this change how you highlight either?
Whatever you call it, to the lexer it is a special string, it has to know how to match it, the delimiters are materially different than concatenation.
I might be being dense but I’m not sure what’s formally distinct.
> That should probably not be one token.
It's exactly the point that this is one token. It's a string literal with opening delimiter `"` and closing delimiter `{`, and that whole token itself serves as a kind of opening "brace". Alternatively, you can see `{` as a contraction of `" +`. Meaning, aside from the brace balancing requirement, `"foo {` does the same a `"foo " +` would.
Still alternatively, you could imagine a language that concatenates around string literals by default, similar to how C behaves for sequences of string literals. In C,
"foo" "bar" "baz"
is equivalent to "foobarbaz"
Similarly, you could imagine a language where "foo" some_variable "bar"
would perform implicit concatenation, without needing an explicit operator (as in `"foo" + x + "bar"`). And then people might write it without the inner whitespace, as: "foo"some_variable"bar"
My point is that "foo{some_variable}bar"
is really just that (plus a condition requiring balanced pairs of braces). You can also re-insert the spaces for emphasis: "foo{ some_variable }bar"
The fact that people tend to think of `{some_variable}` as an entity is sort-of an illusion.> How does this change how you highlight either?
You would highlight the `"...{`, `}...{`, and `}..."` parts like normal string literals (they just use curly braces instead of double quotes at one or both ends), and highlight the inner expressions the same as if they weren't surrounded by such literals.
Fair enough. The point, as you have acknowledged, being that unlike + you have to treat { specially for balancing (and separately from the “).
> The fact that people tend to think of `{some_variable}` as an entity is sort-of an illusion.
I guess. I just don’t know what being an illusion means formally. It’s not an illusion to the person that has to implement the state machine that balances the delimiters.
> You would highlight the `"...{`, `}...{`, and `}..."` parts like normal string literals (they just use curly braces instead of double quotes at one or both ends), and highlight the inner expressions the same as if they weren't surrounded by such literals
Emacs does it this way FWIW. But I’m not sure how important it is to dictate that the brace can’t be a different color.
In any event, I can agree your design is valid (Kotlin works this way), but I don’t necessarily agree it is any more valid than say how Python does it where there can format specifiers, implicit conversion to string is performed whereas not with concatenation. I’m not seeing the clear definitive advantage of interpolated strings being an equivalent to concatenation vs some other type of method call.
The other detail is order of evaluation or sequencing. String concat may behave differently. Not sure I agree it is wrong, because at the end of the day it is distinct looking syntax. Illusion or not, it looks like a neatly enclosed expression, and concatenation looks like something else. That they might parse, evaluate or behave different isn't unreasonable.
"$x plus $y equals $((x+y))"
But if I'm not mistaken, it originated in csh.
> "$x plus $y equals $((x+y))"
No, it is specified in POSIX: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V...
I did not know that. Today I learned.
Julia as well:
Julia "$x plus $y equals $(x+y)"
There is a record constructor syntax in VHDL using attribute invocation syntax: RECORD_TYPE'(field1expr, ..., fieldNexpr). This means that if your record has a first field a subtype of a character type, you can get record construction expression like this one: REC'('0',1,"10101").
Good luck distinguishing between '(' as a character literal and "'", "(" and "'0'" at lexical level.
Haskell.
Haskell has context-free syntax for bracketed ("{-" ... "-}") comments. Lexer has to keep bracketed comment syntax balanced (for every "{-" there should be accompanying "-}" somewhere).
Shell "$x plus $y equals $((expr $x + $y))"
Make :) echo "$(x) plus $(y) equals $(shell echo "$x+$y" | bc)"
I'm guessing this is the reason for the :) but to be clear for anyone else: Make is only doing half of the work, whatever comes after "shell" is being passed to another executable, then make captures its stdout and interpolates that. The other executable is "sh" by default but can be changed to whatever.Note about Scala's string interpolation. They can be used as pattern match targets.
val s"${a} + ${b}" = "1 + 2";
println(a) // 1
println(b) // 2
puts "This is #{<<HERE.strip} evil"
incredibly
HERE
Just to combine the string interpolation with her concern over Ruby heredocs.My other favorite evil quirk in Ruby is that whitespace is a valid quote character in Ruby. The string (without the quotes) "% hello " is a quoted string containing "hello" (without the quotes), as "%" in contexts where there is no left operand initiates a quoted string and the next characters indicates the type of quotes. This is great when you do e.g. "%(this is a string)" or "%{this is a string}". It's not so great if you use space (I've never seen that in the wild, so it'd be nice if it was just removed - even irb doesn't handle it correctly)
That's so going in the blog post later today.
puts "hello #{<<onoz.strip} world"
recursion is #{<<onoz.strip}
recursive
onoz
onoz
puts "that was fun"
yields hello recursion is recursive world
that was fun
and then there's its backtick friend puts "hello #{<<`onoz`.strip} world"
date -u
onoz
coughs up hello Sun Nov 3 17:25:32 UTC 2024 world
and for those trying out your percent-space trick, be aware that it only tolerates such a thing in a standalone expression context so puts (% hello )+" world"
# or
x = % hello #
puts x
because when I tried it "normally" I got $ /usr/bin/ruby -e 'puts % hello + "world"'
-e:1:in `<main>': undefined local variable or method `hello' for main:Object (NameError)
$ /usr/bin/ruby -v
ruby 2.6.10p210 (2022-04-12 revision 67958) [universal.x86_64-darwin21]
but, at the intersection is "ruby parsing is the 15th circle of hell" ruby -e 'puts (% #{<<FOO.strip} )+ " world"
hello
FOO
'
Yes, it's roughly limited in use to places where it is not ambiguous whether it would be the start of a quoted string or the modulus operator, and after a method name would be ambiguous.
> but, at the intersection is "ruby parsing is the 15th circle of hell"
It's surprisingly (not this part, anyway) not that hard. You "just" need to create a forward reference, and keep track of heredocs in your parser, and when you come to the end of a line with heredocs pending, you need to parse them and assign them to the variable references you've created.
It is dirty, though, and there are so many dark corners of parsing Ruby. Having written a partial Ruby parser, and being a fan of Wirth-style grammar simplicity while enjoying using Ruby is a dark, horrible place to live in. On the one hand, I find Ruby a great pleasure to use, on the other hand, the parser-writer in me wants to spend eternity screaming into the void in pain.
I wish PG had dollar-bracket quoting where you have to use the closing bracket to close, that way vim showmatch would work trivially. Something like ${...}$.
log.trace($"Entering iteration {i} for customer {c.ID} [{c.ShortName}]");
in a hot loop would call string.Concat every time it was called before the logger could bail out of the method.C# lets you declare an overload that accepts a `DefaultInterpolatedStringHandler` (or your own custom implementation of the handler pattern) and this overload will take precedence and allow you to delay the building of the string until after you've checked whether logging it is required.
Which is sort of ironic because learning how to do structural editing on lisps has absolutely been more hurdle than help so far, but I'm sure it'll pay off eventually.
And I think an complex syntax is far easier to read and write than a simple syntax with complex semantics. You also get a faster feedback loop in case the syntax of your code is wrong vs the semantics (which might be undiscovered until runtime).
but for example a language like C has many different syntaxes for different operations, like function declaration or variable or array syntax, or if/switch-case etc etc.
so to know C syntax you need to learn all these different ways to do different things, but in lisp you just need to know how to match parenthesis.
But of course you still want to declare variables, or have if/else and switch case. So you instead need to learn the builtin macros (what GP means by semantics) and their "syntax" that is technically not part of the language's syntax but actually is since you still need all those operations enough that they are included in the standard library and defining your own is frowned upon.
1. 1 language to rule them all, fancy syntax
2. Many languages, 1 simple syntax to rule them all
3. Many languages and many fancy syntaxes
Here in the wreckage of the tower of babel, 1. isn't really on the table. But 2. might have benefits because the inhumanity of the syntax need only be confronted once. The cumulative cost of all the competing opinionated fancy syntaxes may be the worst option. Think of all the hours lost to tabs vs spaces or braces vs whitespace.
I don’t think we can have 1 language that satisfies the needs of all people who write code, and thus, we can’t have 1 syntax that does that either.
3 seems the only sensible solution to me, and we have it.
I admire the widget makers, especially those wrangling the gaps between languages. I just wish their work could be made easier.
The main benefit of putting things in the syntax seems to be that many errors would become visually obvious.
1: There have been several attempts at Universal ASTs, including (unsurprisingly) a JVM-centric one from JetBrains https://github.com/JetBrains/intellij-community/blob/idea/24...
With just these four things you will be 95% there, enjoying the fruits of paredit without any complexity — all the remaining tricks you can learn later when you feel like you’re fluent.
<rant> It's not so much the editing itself but the unfamiliarity of the ecosystem. It seems it's a square-peg I've been crafting a round hole of habits for it:
I guess I should use emacs? How to even configure it such that these actions are available? Or maybe I should write a plugin for helix so that I can be in a familiar environment. Oh, but the helix plugin language is a scheme, so I guess I'll use emacs until I can learn scheme better and then write that plugin. Oh but emacs keybinds are conflicting with what I've configured for zellij, maybe I can avoid conflicts by using evil mode? Oh ok, emacs-lisp, that's a thing. Hey symex seems like it aligns with my modal brain, oh but there goes another afternoon of fussing with emacs. Found and reported a symex "bug" but apparently it only appears in nix-governed environments so I guess I gotta figure out how to report the packaging bug (still todo). Also, I guess I might as well figure out how to get emacs to evaluate expressions based on which ones are selected, since that's one of the fun things you can do in lisps, but there's no plugin for the scheme that helix is using for its plugin language (which is why I'm learning scheme in the first place), but it turns out that AI is weirdly good at configuring emacs so now my emacs config contains most that that plugin would entail. Ok, now I'm finally ready to learn scheme, I've got this big list of new actions to learn: https://countvajhula.com/2021/09/25/the-animated-guide-to-sy.... Slurp, barf, and raise you say? excellent, I'll focus on those.
I'm not actually trying to critique the unfamiliar space. These are all self inflicted wounds: me being persnickety about having it my way. It's just usually not so difficult to use something new and also have it my way.</rant>
Don't do that. ;)
Emacs is a graphical application! Don't use it in the terminal unless you really have to (i.e., you're using it on a remote machine and TRAMP will not do).
> it turns out that AI is weirdly good at configuring emacs
I was just chatting with a friend about this. ChatGPT seems to be much better at writing ELisp than many other languages I've asked it to work with.
Also while you're playing with it, you might be interested in checking out kakoune.el or meow, which provide modal editing in Emacs but with the selection-first ordering for commands, like in Kakoune and Helix rather than the old vi way.
PS: symex looks really interesting! Hadn't been that one
For example, https://pyret.org/
It really isn’t simple or necessarily uniform.
A CL implementation will implement reading lists, symbols, numbers, arrays, strings, structures, characters, pathnames, ... via reader macros. Additionally the reader implements various forms of control operations: conditional reading, reading and evaluation, circular datastructures, quoting and comments.
This is user programmable&configurable. Most uses will be in the two above categories: data structure syntax and control. For example we could add a syntax for hash tables to s-expressions. An example for a control extension would be to add support for named readtables. For example a Common Lisp implementation could add a readtable for reading s-expressions from Scheme, which has a slightly different syntax.
Reader macros were optimized for implementing s-expressions, thus the mechanism isn't that convenient as a lexer/parser for actual programming languages. It's a a bit painful to do so, but possible.
A typical reader macro usage, beyond the usage described above, is one which implements a different token or expression syntax. For example there are reader macros which parse infix expressions. This might be useful in Lisp code where arithmetic expressions can be written in a more conventional infix syntax. The infix reader macro would convert infix expressions into prefix data.
Syntax coloring has to handle context: different rules for material nested in certain ways.
Vim's syntax higlighter lets you declare two kinds of items: matches and regions. Matches are simpler lexical rules, whereas regions have separate expressions for matching the start and end and middle. There are ways to exclude leading and trailing material from a region.
Matches and regions can declare that they are contained. In that case they are not active unless they occur in a containing region.
Contained matches declare which regions contain them.
Regions declare which other regions they contain.
That's the basic semantic architecture; there are bells and whistles in the system due to situations that arise.
I don't think even Justine could develop that in an interview, other than as an overnight take home.
This is the "genman" script which takes the raw output of a manpage to HTML converter, and massages it to form the HTML version of the TXR manual:
https://www.kylheku.com/cgit/txr/tree/genman.txr
Everything that is white (not colored) is literal template material. Lisp code is embedded in directives, like @(do ...). In this scheme, TXR keywords appear purple, TXR Lisp ones green. They can be the same; see the (and ...) in line 149, versus numerous occurrences of @(and).
Quasistrings contain nested syntax: see 130 where `<a href ..> ... </a>` contains an embedded (if ...). That could itself contain a quasistring with more embedded code.
TXR's txr.vim" and tl.vim* syntax definition files are both generated by this:
It’s not only running time, but also ease of implementation.
A good syntax highlighter should do a decent job highlighting both valid and invalid programs (rationale: in most (editor, language) pairs, writing a program involves going through moments where the program being written isn’t a valid program)
If you decide to use an AST, that means you need to have good heuristics for turning invalid programs into valid ones that best mimic what the programmer intended. That can be difficult to achieve (good compilers have such heuristics, but even if you have such a compiler, chances are it isn’t possible to reuse them for syntax coloring)
If this simpler approach gives you most of what you can get with the AST approach, why bother writing that?
Also, there are languages where some programs can’t be perfectly parsed or syntax colored without running them. For those, you need this approach.
Not so sure I’d put money on that opinion ;)
And every Standard ML programmer might find this to be a surprising limitation. The following is a valid Standard ML program:
(* (* Nested (**) *) comment *)
val _ = print "hello, world\n"
Here is the output: $ sml < hello.sml
Standard ML of New Jersey (64-bit) v110.99.5 [built: Thu Mar 14 17:56:03 2024]
- = hello, world
$ mlton hello.sml && ./hello
hello, world
Given how C was considered one of the "expressive" languages when it arrived, it's curious that nested comments were never part of the language.It is not quite clear to me why the lack of single-line comments is such a surprising limitation. After all, a single-line block comment can easily serve as a substitute. However, there is no straightforward workaround for the lack of nested block comments.
> I’ve never heard someone refer to C as “expressive”, but maybe it was in 1972 when compared to assembly.
I was thinking of Fortran in this context. For instance, Fortran 77 lacked function pointers and offered a limited set of control flow structures, along with cumbersome support for recursion. I know Fortran, with its native support for multidimensional arrays, excelled in numerical and scientific computing but C quickly became the preferred language for general purpose computing.
While very few today would consider C a pinnacle of expressiveness, when I was learning C, the landscape of mainstream programming languages was much more restricted. In fact, the preface to the first edition of K&R notes the following:
"In our experience, C has proven to be a pleasant, expressive and versatile language for a wide variety of programs."
C, Pascal, etc. stood out as some of the few mainstream programming languages that offered a reasonable level of expressiveness. Of course, Lisp was exceptionally expressive in its own right, but it wasn't always the best fit for certain applications or environments.
> And what bearing does the comment syntax have on the expressiveness of a language?
Nothing at all. I agree. The expressiveness of C comes from its grammar, which the language parser handles. Support for nested comments, in the context of C, is a concern for the lexer, so indeed one does not directly influence the other. However, it is still curious that a language with such a sophisticated grammar and parser could not allocate a bit of its complexity budget to support nested comments in its lexer. This is a trivial matter, I know, but I still couldn't help but wonder about it.
I can imagine both pro and con arguments for supporting nested comments, but regardless of what I think, C certainly could have added support for nested comments at any time, and hasn’t, which suggests that there isn’t sufficient need for it. That might be the entire explanation: not even worth a little complexity.
Is true that many languages had single line comments? Maybe I’m forgetting more, but I remember everything else having single line comments… asm, basic, shell. I used Pascal in the 80s and apparently forgot it didn’t have line comments either?
After C89 was ratified, adding nested comments to C would have risked breaking existing code. For instance, this is a valid program in C89:
#include <stdio.h>
int main() {
/* /* Comment */
printf("hello */ world");
return 0;
}
However, if a later C standard were to introduce nested comments, it would break the above program because then the following part of the program would be recognised as a comment: /* /* Comment */
printf("hello */
The above text would be ignored. Then the compiler would encounter the following: world");
This would lead to errors like undeclared identifier 'world', missing terminating " character, etc.That is precisely why nested comments would end up breaking the C89 code example I provided above. I elaborate this further below.
> There’s no reason to assume the comment terminator wouldn’t be ignored in strings.
There is no notion of "comment terminator in strings" in C. At any point of time, the lexer is reading either a string or a comment but never one within the other. For example, in C89, C99, etc., this is an invalid C program too:
#include <stdio.h>
int main() {
/* Comment
printf("hello */ world");
return 0;
}
In this case, we wouldn't say that the lexer is "honoring the comment terminator in a string" because, at the point the comment terminator '*/' is read, there is no active string. There is only a comment that looks like this: /* Comment
printf("hello */
The double quotation mark within the comment is immaterial. It is simply part of the comment. Once the lexer has read the opening '/*', it looks for the terminating '*/'. This behaviour would hold even if future C standards were to allow nested comments, which is why nested comments would break the C89 example I mentioned in my earlier HN comment.> And even today, you can safely write printf("hello // world\n"); without risking a compile error, right?
Right. But it is not clear what this has got to do with my concern that nested comments would break valid C89 programs. In this printf() example, we only have an ordinary string, so obviously this compiles fine. Once the lexer has read the opening quotation mark as the beginning of a string, it looks for an unescaped terminating quotation mark. So clearly, everything until the unescaped terminating quotation mark is a string!
But we did have dummy procedures, which covered one of the important use cases directly, and which could be abused to fake function/subroutine pointers stored in data.
#if 0
This is a
#if 0
nested comment!
#endif
#endif
(unifdef has some evil code to support using C-style preprocessor directives with non-C source, which mostly boils down to ignoring comments. I don’t recommend it!)
Are you sure? I just tried on godbolt and that’s not true with gcc 14.2. I’ve definitely put syntax errors intentionally into #if 0 blocks and had it compile. Are you thinking of some older version or something? I thought the pre-processor ran before the lexer since always…
• program is lexed into preprocessing tokens; comments turn into whitespace
• preprocessor does its thing
• preprocessor tokens are turned into proper tokens; different kinds of number are disambiguated; keywords and identifiers are disambiguated
If you put an unclosed comment inside #if 0 then it won’t work as you might expect.
Is that a joke?
I guess the article could be called Falsehoods Programmers Assume of Programming Language Syntaxes.
do_action() ??!??! handle_error()
It almost looks like special error handling syntax but still remains satisfying once you realize it's an || logical-or statement and it's using short circuiting rules to execute handle error if the action returns a non-zero value.An attempt to answer that: In English, mixing ?! at the end of a question is a way of indicating bewilderment. Like "What was that?!"
[1] https://journals.linguisticsociety.org/proceedings/index.php...
FWIW the ??!??! double trigraph as error processing is funny because of the meaning of ?! and various combinations of ? and !. It is funny and it has trigraphs. That’s the whole point.
I gave the question a +1 because I, as previously stated, read it to be genuine curiosity. Maybe a smiley would’ve helped, I don’t know. ¯\_(ツ)_/¯
But I am free to be curious and ask the author why he choose it! We are not computers but human beings! There is no HN rule that says that I cannot be curious and asks a question that arised from a thread but it is not connected to that! [1].
[1] https://www.iflscience.com/charles-babbage-once-sent-the-mos...
They chose it because the discussion is about trigraphs. It is not a particularly surprising choice to talk about trigraphs in that context.
> There is no HN rule that says that I cannot be curious and asks a question that arised from a thread but it is not connected to that!
I did not write that you did not follow HN rules, just that your question was very strange in the context. I did not downvote you, but I understand why some people did. Your question was a bit passive agressive, even if you did not mean it.
Your comments seem to suggest your reading was “one of my favorite trigraphs [to use is]”, which is understandable as a valid interpretation.
The whole of this chain of comments is a misunderstanding on the original comment’s ambiguity.
I think that is uncharitable. The question ("Did you choose the legacy C trigraphs over || for aesthetic purposes?") makes perfect sense to me. I think context makes it reasonably clear that the answer is 'yes,' but that doesn't mean that the question doesn't make sense, only perhaps that it didn't need to be asked.
??! Is converted to |
So ??!??! Becomes || i.e. “or”
Is this a "no third party dependencies" thing?
But fair enough - fewer dependencies is always nice, especially in C++ (which doesn't have a modern package manager) and in ML where an enormous janky Python installation is apparently a perfectly normal thing to require.
Microsoft is such an odd duck, sometimes, but I'm glad to take advantage of their "good years" while it lasts
And here - https://lobste.rs/s/9huy81/tbsp_tree_based_source_processing...
For new projects I use Chumsky. It's a pure Rust parser which is nice because it means you avoid the generated C, and it also gives you a fully parsed and natively typed output, rather than Tree Sitter's dynamically typed tree of nodes, which means there's no extra parsing step to do.
The main downside is it's more complicated to write the parser (some fairly extreme types). The API isn't stable yet either. But overall I like it more than Tree Sitter.
Perl's syntax is undecidable, because the difference between treating some characters as a comment or as a regex can depend on the type of a variable that is only determined e.g. based on whether a search for a Collatz counterexample terminates, or just, you know, user input.
https://perlmonks.org/?node_id=663393
C++ templates have a similar issue, I think.
There’s a function called intuit_more which works out if $var[stuff] inside a regex is a variable interpolation followed by a character class, or an array element interpolation. Its result can depend on whether something in the stuff has been declared as a variable or not.
But even if you ignore the undecidability, the rest is still ridiculously complicated.
TIL! I went and dug up a citation: https://blog.reverberate.org/2013/08/parsing-c-is-literally-...
Parsing Bash is Undecidable - https://www.oilshell.org/blog/2016/10/20.html
I remember a talk from Larry Wall on Perl 6 (now Raku), where he says this type of thing is a mistake. Raku can be statically parsed, as far as I know.
Morbig: A static parser for POSIX shell - https://scholar.google.com/scholar?cluster=15754961728999604...
(at the time I wrote the post about bash, I hadn't implemented aliases yet)
But it's a little different since it is an intentional feature, not an acccident. It's designed to literally reinvoke the parser at runtime. I think it's not that good/useful a feature, and I tend to avoid it, but many people use it.
The world corpus of software would be much better documented if everywhere else had stolen this from Perl. Inline POD is great.
Well, they're not doing themselves any favors by just willy nilly mixing C with "user-facing" defuns <https://emba.gnu.org/emacs/emacs/-/blob/ed1d691184df4b50da6b...>. I was curious if they could benefit from "literate programming" since OrgMode is the bee's knees but not with that style coding they can't
The Python standard library manual is also exemplary, and also necessarily organized differently from the source code.
Maybe parts of it are, but as a concrete example https://docs.python.org/3/library/re.html#re.match is just some YOLO about what, specifically, is the first argument to re.match: string, or compiled expression? Well, it's both! Huzzah! I guess they get points for consistency because the first argument to re.compile is also "both"
But, any idea what type re.compile returns? cause https://docs.python.org/3/library/re.html#re.compile is all "don't you worry about it" versus its re.match friend who goes out of their way to state that it is an re.Match object
Would it have been so hard to actually state it, versus requiring someone to invoke type() to get <class 're.Pattern'>?
But this ability to pass a compiled regexp rather than a string can't have been an accidental feature, so I don't know why it isn't documented.
Probably it would be good to have an example of invoking re.match with a literal string in the documentation item for re.match that you linked. There are sixteen such examples in the chapter, the first being re.match(r"(\w+) (\w+)", "Isaac Newton, physicist"), so you aren't going to be able to read much of the chapter without figuring out that you can pass a string there, but all sixteen of them come after that section. A useful example might be:
>>> [s for s in ["", " ", "a ", " a", "aa"] if re.match(r'\w', s)]
['a ', 'aa']
It's easy to make manuals worse by adding too much text to them, but in this case I think a small example like that would be an improvement.As for what type re.compile returns, the section you linked to says, "Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods, described below." Is your criticism that it doesn't explicitly say that the regular expression object is returned (as opposed to, I suppose, stored in a table somewhere), or that it says "a regular expression object" instead of saying "an re.Pattern object"? Because the words "regular expression object" are a link to the "Regular Expression Objects" section, which begins by saying, "class re.Pattern: Compiled regular expression object returned by re.compile()." To me the name of the class doesn't seem like it adds much value here—to write programs that work using the re module, you don't need to know the name of the class the regular expression objects belong to, just what interface they support.
(It's unfortunate that the class name is documented, because it would be better to rename it to a term that wasn't already defined to mean "a string that can be compiled to a regular expression object"!)
But possibly I've been using the re module long enough that I'm blind to the deficiencies in its documentation?
Anyway, I think documentation questions like this, about gradual introduction, forward references, sequencing, publicly documented (and thus stable) versus internal-only names, etc., are hard to reconcile with the constraints of source code, impossible in most languages. In this case the source code is divided between Python and C, adding difficulty.
A few (admittedly silly) questions about the list:
1. Why no Erlang, Elixir, or Crystal?
Erlang appears to be just at the author's boundary at #47 on the TIOBE index. https://www.tiobe.com/tiobe-index/
2. What is "Shell"? Sh, Bash, Zsh, Windows Cmd, PowerShell..?
3. Perl but no Awk? Curious why, because Awk is a similar but comparatively trivial language. Widely used, too.
To be fair, Awk, Erlang, and Elixir rank low on popularity. Yet m4, Tcl, TeX, and Zig aren't registered in the top 50 at all.
What's the methodology / criteria? Only things the author is already familiar with?
Still a fun article.
I'd wish that purple would stop lending it any credibility.
Of course, for syntax highlighting this is only relevant if you want to highlight the multiplication operator differently from the dereferencing operator, or declarations differently from expressions.
More generally, however, I find it useful to highlight (say) types differently from variables or functions, which in some (most?) popular languages requires full parsing and symbol table information. Some IDEs therefore implement two levels of syntax highlighting, a basic one that only requires lexical information, and an extended one that kicks in when full grammar and type information becomes available.
I would expect her to know C quite well, and that's probably an understatement.
Can you mention one editor which does that?
/* given foo.c */
int main() {
int a, *b;
a = 5 * 10;
b = &a;
printf("a is %d\n", *b);
}
and then M-x c-ts-mode followed by navigating to each * and invoking M-x treesit-inspect-node-at-point in turn produces, respectively: (declaration declarator: (pointer_declarator "*"))
right: (binary_expression operator: "*")
arguments: (argument_list (pointer_expression operator: "*"))
1: https://www.emacswiki.org/emacs/Tree-sitter return (A)*(B);
which depends on A being a type or a variable.Even parsing is not enough, as it’s possible to redefine what each character does. You can make it do things like “and now K means { and C means }”.
Yes, you can find papers on arXiv that use this god-forsaken feature.
Among the niceties listed here, the one I'd wish for Julia to have would be C#'s "However many quotes you put on the lefthand side, that's what'll be used to terminate the string at the other end". Documentation that talks about quoting would be so much easier to read (in source form) with something like that.
#= this one
has a #= nested
comment =# inside of it
and that works fine! =#
It's documented, but you need $250 to spare: https://www.iso.org/standard/59579.html
Also, while doing some digging I found there actually are a number of the standards that are legitimately publicly available https://standards.iso.org/ittf/PubliclyAvailableStandards/in...
The Ruby everyone uses is much better defined by RubySpec etc. via test cases, but that's not complete either.
Maybe I am in favor of the death penalty after all
Fast, but tcc *compiles* C to binary code at 29 MB/s on a really old computer: https://bellard.org/tcc/#speed Should be possible to go much faster but probably not needed
I am surprised Smalltalk and Prolog are in there though.
Edit: same question with answer here https://news.ycombinator.com/item?id=42026554