Getting silly with C, part (void*)2

114 points by justmarc 6 hours ago | 52 comments

sylware 6 hours ago |
C syntax is already way too rich and complex.
We need a C- ore µC:
No implicit cast except for literals and void* (explicit compile time/runtime casts), one loop statement (loop{}), no switch/enum/generic/_thread/typeof/etc, no integer promotion, only sized primitive types (u64 s32 f32 etc...), no anonymous code block, real compiler hard/compile time constant declaration, many operators have to go (--,++, a?b:c, etc)... and everything I am forgetting right now (the dangerous struct pack attribute...). But we need inline keywords for memory barriers, atomics for modern hardware architecture programming.
short_sells_poo 5 hours ago |
Sooo, assembly :)
poincaredisk 4 hours ago |
No. Assembly is not portable, not typed and not structured (as in structured programming).
Another usecase for microC: most decompilers decompile to a custom C-like language. It pretends to be C, but in reality this is just a representation of a recovered AST that is often-but not always-a valid C code. MicroC would be a better target, because it would supposedly have less weird edge cases.
9rx 4 hours ago |
> Assembly [...] not typed and not structured (as in structured programming).
That depends on the assembly language. Some have structure constructs, some are typed. Portability is out.
But if you accept a slightly higher abstraction, WebAssembly is portable, typed, and structured.
PittleyDunkin 4 hours ago |
> But if you accept a slightly higher abstraction, WebAssembly is portable, typed, and structured.
Does WebAssembly support AOT compilation to native binaries? I thought it was just a VM.
jcranmer 3 hours ago |
I'd argue that unsafe Rust is a better target here (although I don't know if &raw has made it into stable Rust yet, which you need for doing an address-of that doesn't have a risk of UB like & does). Rust's typesystem is less silly (there's only one multiple-types-for-same-value, namely usize), there's no implicit conversions, the core CFG primitives are a little bit richer (labeled break/continue, although those are now added in C2y), and there's a decent usable intrinsic class for common operators-not-in-C like rotations or popcount.
If your goal is a simple language to compile, well, Rust won't ever fit the bill; but as a target for a decompiler, I think the unsafe subset is worth exploring.
steveklabnik 3 hours ago |
&raw landed in the release just before this latest one.
And even before that release, there were macros the poly fill it, so you could have used those to target an earlier version.
uecker 2 hours ago |
Why would you target a language that is less portable and less stable?
jcranmer 2 hours ago |
For the specific question of decompiling, as noted in the GP comment, you're already forced to decompile to a not-quite-C language because C's semantics just aren't conducive to accurately representing the hardware semantics. Not that Rust doesn't have its own foibles, but I suspect its core semantics are easier to map to than C's semantics without having to resort as much to a not-quite-Rust target.
It's definitely something I would like to see at least explored!
uecker 2 hours ago |
I am pretty sure somebody will do this. My guess is that it will make basically no difference, but the end result is likely less useful...
uecker 3 hours ago |
I don't think any weird edge case is a problem when targeting C. You just do not produce such cases when emitting the code.
Frenchgeek 4 hours ago |
Sphinx C-- maybe?
accelbred 5 hours ago |
Does Zig fit your bill?
glouwbug 5 hours ago |
C really just needs if / else / while / and void functions. Function inputs should be in/out (const type* or type*).
bregma 4 hours ago |
So, FORTRAN IV except for the else.
mystified5016 5 hours ago |
You want assembly with some sugar.
Read up on Forth languages. It's pretty much exactly what you're after.
mananaysiempre 4 hours ago |
Forth is kind of weak dealing with value types of unknown size. For example, suppose you're writing a cross-compiler, and an address on the target machine might take one or two cells on the host machine depending on the host bitness. Now suppose you need to accept a (host-machine) cell and a target-machine address (e.g. an opcode and an immediate) and manipulate them a bit. Trivial in C, definitely possible in Forth, but supremely annoying and the elegance kind of falls apart.
butterisgood 4 hours ago |
Pre-scheme?
wongarsu 3 hours ago |
There is C0, a stripped-down version of C popular in academia [1]. Great for teaching because it's conceptually simple and easy to write a compiler for. But with a couple of additions (like sized primitive types) it might match what you are imagining
1: https://c0.cs.cmu.edu/docs/c0-reference.pdf
GranPC 5 hours ago |
Can anyone explain how the last snippet works?
a12k 5 hours ago |
Basically they execute in order of array initialization, not index order, so it outputs hello cruel world rather than cruel world hello.
andreyv 5 hours ago |
It creates a compound literal [1] of type array of int, and initializes the specified array positions using designated initializers [2] with the results of calls to puts().
Using designated initializers without the = symbol is an obsolete extension.
[1] https://gcc.gnu.org/onlinedocs/gcc/Compound-Literals.html [2] https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html
matheusmoreira 5 hours ago |
Note that this is GNU C, not standard C. GNU has extended the normal C language with features such as forward parameter declarations and numeric ranges in switch cases. Lots of people don't know about these things.
dzaima 5 hours ago |
Note that switch case ranges might be coming in C2y though.
mananaysiempre 5 hours ago |
Also forward parameter declarations, or is that proposal dead?
wahern 4 hours ago |
Basically dead. The main motivation would be to make it easier to use variably modified types in function parameters, where the (length) identifier is declared after the variably modified type, as in
> void foo(int a[m][m], int m)
Currently you can only do:
> void foo(int m, int a[m][m])
The holy grail is being able to update the prototypes of functions like snprintf to something like:
> int snprintf(char buf[bufsiz], size_t bufsiz, const char *, ...);
However, array pointer decay means that foo above is actually:
> void foo(int (*a)[m], int m)
Likewise, the snprintf example above would be little different than the current definition.
There's related syntax, like
> foo (int m, int a[static m])
But a is still just a pointer, and while it can help some static analyzers to detect mismatched buffer size arguments at the call site, the extent of the analysis is very limited as decay semantics effectively prevent tracing the propagation of buffer sizes across call chains, even statically.
There's no active proposal at the moment to make it possible to pass VM arrays (or rather, array references) directly to functions--you can only pass pointers to VM array types. That actually works (sizeof *a == sizeof (int) * m when declaring int (*a)[m] in the prototype), but the code in the function body becomes very stilted with all the syntactical dereferencing--and it's just syntactical as the same code is generated for a function parameter of `int (*a)[m]` as for `int *a` (underneath it's the same pointer value rather than an extra level of memory indirection). There are older proposals but they all lost steam because there aren't any existing implementation examples in any major production C compilers. Without that ability, the value of forward declarations is greatly diminished. Because passing VM array types to functions already requires significant refactoring, most of the WG14 felt it wasn't worth the risk of adopting GCC's syntax when everybody could (and should?) just start declaring size parameters before their respective buffer parameters in new code.
Gibbon1 3 hours ago |
That's related to something I would like which is to be able to set the number of elements in an incomplete struct.
struct foo { size_t elements; int data[]; }; foo foo123 = {.elements = array_size(data), .data = {1, 2, 3}};
or struct str { size_t sz; char str[]; };
str s123 = {.sz = strlen(.str), .str = "123"};
uecker 3 hours ago |
Clang and GCC just got the [[counted_by()]] attribute to help protect such structs in the kernel. But yes, native syntax for this would be nice.
uecker 3 hours ago |
I hope it is not "basically" dead. I just resubmitted it at the request of several people.
And yes, for new APIs you could just change the order, but it does help also with legacy APIs. It does even when not using pointers to arrays: https://godbolt.org/z/TM5Mn95qK (I agree that new APIs should pass a pointer to a VLA).
(edited because I am agreeing with most of what you said)
mananaysiempre 3 hours ago |
> everybody could (and should?) just start declaring size parameters before their respective buffer parameters in new code
I know that was a common opinion pre-C23, but it feels like the committee trying to reshape the world to their desires (and their designs). It's a longstanding convention that C APIs accept (address, length) pairs in that order. So changing that will already get you a score of -4 on the Hard to Misuse List[1], for "Follow common convention and you'll get it wrong". (The sole old exception in the standard is the signature of main(), but that's somewhat vindicated by the fact that nobody really needs to call main(); there is a new exception in the standard in the form of Meneide's conversion APIs[2], which I seriously dislike for that reason.)
The reason I was asking is that 'uecker said it was requested at the committee draft stage for C23 by some of the national standards orgs. That's already ancient history of course, but I hoped the idea itself was still alive, specifically because I don't want to end up in the world where half of C APIs are (address, length) and half are (length, address), when the former is one of the few C conventions most everyone agrees on currently.
[1] https://ozlabs.org/~rusty/index.cgi/tech/2008-04-01.html
[2] https://thephd.dev/_vendor/future_cxx/papers/C%20-%20Restart...
svilen_dobrev 4 hours ago |
hehe. similar to
How to Get Fired Using Switch Statements & Statement Expressions:
https://blog.robertelder.org/switch-statements-statement-exp...
kazinator 3 hours ago |
Without information about how identifiers are declared, you do not know how to parse this:
(A)(B);
It could be a cast of B to type A, or function A being called with argument B.
Or this (like the puts(puts) in the article):
A(B):
Could be a declaration of B as an identifier of type A, or a call to a function A with argument B.
Back in 1999 I made a small C module called "sfx" (side effects) which parses and identifies C expressions that could plausibly contain side effects. This is one of the bits provided in a small collection called Kazlib.
This can be used to make macros safer; it lets you write a #define macro that inserts an argument multiple times into the expansion. Such a macro could be unsafe if the argument has side effects. With this module, you can write the macro in such a way that it will catch the situation (albeit at run time!). It's like a valgrind for side effects in macros, so to speak.
https://git.savannah.gnu.org/cgit/kazlib.git/tree/sfx.c
In the sfx.c module, there is a rudimentary C expression parser which has to work in the absence of declaration info. In other words it has to make sense of an input like (A)(B).
I made it so that when the parser encounters an ambiguity, it will try parsing it both ways, using backtracking via exception handling (provided by except.c). When it hits a syntax error, it can backtrack to an earlier point and parse alternatively.
Consider (A)(A+B). When we are looking at the left part (A), that could plausibly be a cast or declaration. In recursive descent mode, we are going left to right and looking at left derivations. If we parse it as a declaration, we will hit a syntax error on the +, because there is no such operator in the declarator grammar. So we backtrack and parse it as a cast expression, and then we are good.
Hard to believe that was 26 years ago now. I think I was just on the verge of getting into Lisp.
I see the sfx.c code assumes it would never deal with negative character values, so it cheerfully uses the <ctype.h> functions without a cast to unsigned char. It's a reasonable assumption there since the inputs under the intended use case would be expressions in the user's program, stringified by the preprocessor. Funny bytes would only occur in a multi-byte string literal (e.g. UTF-8). When I review code today, this kind of potential issue immediately stands out.
The same exception module is (still?) used in the Ethereal/Wireshark packet capture and analysis tool. It's used to abort "dissecting" packets that are corrupt or truncated.
utopcell 3 hours ago |
C[++] never ceases to amaze me. Just yesterday I saw this, totally valid, code snippet:
void g(); void f() { return g(); }
gpderetta 2 hours ago |
What's wrong with that?
gizmo686 2 hours ago |
In C function prototypes with no arguments are strange.
void g();
Means that g is a function which takes a not-specified number of arguments (but not a variable number of arguments).
Almost always what you want to do is
void g(void)
Which says that g takes 0 arguments.
Having said that, declaring it as g() should work fine as long as you always provided that you always invoke it with the correct arguments. If you try invoking it with the wrong arguments, then the compiler will let you, and your program may just break in fun and exciting ways.
Edit: looking closer, it looks like the intent might have been to alias f and g. But, as discussed above, it does so in a way that will break horribly if g expects any arguments.
utopcell 2 hours ago |
Well, at the risk of stating the obvious: `return expr;` is meant to return whatever expr evaluates to.
Here, g()'s return type is void, so there's no value to return and at the same time, f()'s return type is void, so return should not have an expr to begin with.
This statement is effectively equivalent to: "g(); return;".
frizlab 2 hours ago |
so?
josephg 2 hours ago |
> so return should not have an expr to begin with.
Thats one way to think about it. Another is that void is a type - which is obviously true given you can have void* pointers and functions can return void. In this example, f() returns a void expression, so that’s a perfectly fine thing to return from g.
utopcell an hour ago |
void* is leaps-and-bounds different from void. It is an expression, it has storage, you can actually assign something to it. You can't have a variable of type void; "void var;" is meaningless. A function doesn't "return" void. void simply denotes that a function doesn't return anything.
int_19h an hour ago |
You can't have a variable of type void, but you can have an expression of type void - calling a function that returns void does just that (in C++, "throw" is also a void-typed expression).
Narishma 44 minutes ago |
Not the parent but I too don't see what's wrong with that.
dzaima 43 minutes ago |
That's not valid standard C; gcc and clang give a warning with '-pedantic'. It's valid C++ though.
And IMO it's quite a nice feature, useful sometimes for reducing boilerplate in early returns. It's the obvious consequence if you don't treat void as some extremely-special syntax but rather as just another type, perhaps alike an empty struct (though that's not valid C either ¯\_(ツ)_/¯) that's just implicitly returned at the end of a void-returning function, and a "return;" statement implicitly "creates" a value of void.
In fact in Rust (and probably a bunch of other languages that I'm too lazy to remember) void-returning functions are done via returning a 0-tuple.
jwilk 2 hours ago |
First part discussed on HN:
https://news.ycombinator.com/item?id=40835274 (113 comments)
psychoslave 2 hours ago |
Usually I'm more in the camp of "let's preserve everything we can as cultural heritage, yes even those awful Nazi propaganda material" and I'm confident that some distant archeologist (or current close neighbor) will be glad we did.
But as time pass, I'm more and more convinced that wiping-out every peace of C that was ever produced would be one of the greatest possible gesture for the future of humanity.
I also have a theory that universe exits so we can have some opportunities to taste chocolate. Surely in that perspective, even C can be an unfortunate but acceptable byproduct.
adonovan an hour ago |
Remember that C's contemporary languages were either inefficient (e.g. ALGOL 68, PL/1, Lisp), functionally obsolete (e.g. FORTRAN didn't have recursion or heap allocation), or even lower level (Assembly, B). C eliminated the need for need for assembly in programs that were low level (like OS kernels) or high performance (math, graphics, signal processing), and that was surely a huge improvement in type safety and expressiveness.
chasil an hour ago |
But how interesting would your life become if SQLite ceased to function anywhere around you?
Should this misfortune befall you, please don't get on an airplane (with me).
H8crilA an hour ago |
Too many people still don't understand C's greatest failure: the undefined behavior. Most people assume that if you write past an array then the result may be a program crash; but actually undefined behavior includes other wonderful options such as stealing all your money, encrypting all your files and extorting you for some bitcoin, or even partially destroying a nuclear isotopes processing facility. Undefined really means undefined, theoretically some demons may start flying out of your nose and it would be completely up to spec. If you think that this is justified by "performance gains" or some other nonsense then I really don't know what to tell you!
uecker an hour ago |
I don't get that C hate. That terse syntax can be misused to produce unreadable code in C, does not change that I usually find it more readable than more verbose syntax.
mystified5016 an hour ago |
Forward parameter declaration is an insane feature. It makes perfect sense in the context of C's other forward declarations but just bonkers.
I can't wait to slip this into some production code to confuse the hell out of some intern in a few years
AKluge an hour ago |
An old classic: Remember, C was never intended to be taken seriously. https://www.cs.cmu.edu/~jbruce/humor/unix_hoax.html
GrantMoyer 24 minutes ago |
I should keep this link handy for when people claim C is a simple language. Even without the GNU extensions, the examples here are pretty wretched.
betimsl 2 minutes ago |
Where can I find more about this BASIC compatibility mode? Thnx