Unix core utilities implemented in Haskell
243 points by ocean_moist 8 days ago | 95 comments
  • anardil 4 days ago |
    Wow, hello! This is my repository. I'm happy to answer any questions.
    • aeonik 4 days ago |
      You specify "fast", can you elaborate on the performance of the collection? How does it compare to the standard core utils?

      Great work, looks amazing.

      • anardil 4 days ago |
        Performance (execution, memory) is generally in the same ballpark as the BSD versions, with some caveats specific to utils that do lots of in place data manipulation.

        cut comes to mind as an example, slicing and dicing lines into fields quickly without a ton of copies isn't easy. Using Streaming.ByteString generally makes a huge difference, but it's extremely difficult to use unless you get can your mind to meld with the types it wants. Picking it up again months later takes some serious effort.

    • faragon 4 days ago |
      Very beautiful implementation of the awk interpreter in less than 600 lines!

      https://github.com/Gandalf-/coreutils/blob/master/Coreutils/...

      • anardil 4 days ago |
        Thank you! This is one of my favorites. User declared variables are next on the todo list, when I get back to it.
        • OskarS 4 days ago |
          It is really gorgeously written Haskell. I’ve only dabbled in Haskell, but you’re really shetting my appetite for digging in deeper.
    • cosmic_quanta 4 days ago |
      Could you speak to the advantages of Haskell's lazy IO? I only hear about its disadvantages usually
      • habitue 4 days ago |
        I imagine for streaming tools like these it's pretty convenient. You don't have to manage buffers etc, just write code against a massive string and haskell takes care of streaming it for you and pulling in more data when needed.

        There are libraries that handle it, but they probably have weird types, you can just use functions in the prelude to write a lot of these basic utilities.

        • jerf 3 days ago |
          Unfortunately, while that may be the dream, it doesn't work out that way if you want good performance. If you look at the source you'll see that it uses things like https://hackage.haskell.org/package/streaming-bytestring-0.3... a lot.

          For one thing, a "string" in Haskell by default is a linked list of unicode characters, so right out of the gate you've got big performance problems if you want to use strings. The exact way laziness is done also has serious performance consequences as well; when dealing with things as small as individual characters all the overhead looms large as a percentage basis. One of the major purposes of any of the several variants of ByteString is to bundle the bytes together, but that means you're back to dealing with chunks. Haskell does end up with a nice API that can abstract over the chunks but it still means you sometimes have to deal with chunks as chunks; if you turn them back into a normal Haskell "string" you lose all the performance advantages.

          It can still come out fairly nice, but if you want performance it is definitely not just a matter of opening a file and pretending you've just got one big lazy string and you can just ignore all the details; some of the details still poke out.

          • habitue 3 days ago |
            I mean, I'm aware of the downsides, the OP asked why someone might use it. Ease of use seems like a reasonable upside
      • anardil 4 days ago |
        It definitely has some sharp edges. One advantage is skipping computations (and the IO they'd need) that don't end up getting used, which let's you do some clever looking things/ ignore some details. That's hard to take advantage of in practice, I think.

        The other advantage is just deferring IO. For instance in split or tee, you could decide that you need 500 output files and open all the handles together in order to pass them to another function that will consume them. I'd squint at someone who wrote `void process_fds(int fds[500]);`, but here it doesn't matter.

      • mrkeen 4 days ago |
        If your language doesn't give you laziness, you're reinventing it yourself with strict primitives each time.
        • vacuity 4 days ago |
          On the other hand, when you don't want laziness you really won't like if it's present anyways.
          • weebull 4 days ago |
            Lazy to strict is reasonably easy to do though. The problem is normally that once one bit goes strict, most other things implicitly do too.

            Strict to lazy is normally a rewrite.

            • vacuity 3 days ago |
              I think the scope of lazy constructs should usually be far less than that of strict constructs, so it's only in the cases where the librarified lazy abstractions don't fit that you need a rewrite. Lazy to strict isn't hard, but I don't want the performance and cognitive overhead of lazy-by-default.
          • weebull 4 days ago |
            Lazy to strict is reasonably easy to do though. The problem is normally that once one bit goes strict, most other things implicitly do too.

            Strict to lazy is normally a rewrite.

    • bts 4 days ago |
      Hi! A few years ago I found myself wanting an equivalent of `column` that didn’t strip color codes. After I implemented it in Haskell, I found it was useful to use Nix to force statically linking against libraries like gmp to reduce startup time. Perhaps what I ended up doing might be helpful for you too: https://github.com/bts/columnate/blob/master/default.nix
      • anardil 4 days ago |
        Thank you for the suggestion, I'll give this a whirl! I've fussed around with `--ghc-options '-optl-static -fPIC'` and the like in years past without success.
    • Vosporos 4 days ago |
      Fantastic work, thank you so much!
    • anacrolix 3 days ago |
      LOTR fan detected
  • wizerno 4 days ago |
    The author has a few blog posts covering the tee, split, which, and cat commands [1].

    [1] https://www.anardil.net/tag/coreutils.html

  • chasil 4 days ago |
    "On Windows - Symlinking doesn't appear to change the name reported by System.Environment.getProgName, so you'll need to create copies of the binary with different names."

    I have not found this to be the case with "mklink /h".

    • zanie 4 days ago |
      That'd be a hard link though
  • eviks 4 days ago |
    > Symlinking doesn't appear to change the name reported by System.Environment.getProgName, so you'll need to create copies of the binary with different names

    Would hardlinks help (would avoid multiple copies)?

    • anardil 4 days ago |
      I'll have to give this a try, thank you for the suggestion!
  • Waterluvian 4 days ago |
    I’m curious if there are any that were especially uncomfortable to implement in Haskell.

    I’m a very very novice “functional programmer” but on occasion I find problems that feel just ridiculous to implement FP style.

    • sfn42 4 days ago |
      I'm not much of a functional programmer either, but generally in the beginning you'll feel this way because you're not used to thinking about problems in a functional way. You're used to solving problems with imperative structures. So you try to implement that in a functional language and it turns out completely stupid because the language isn't made to solve problems that way. But it usually has good ways to solve things.
      • galangalalgol 4 days ago |
        I got stuck trying to have an intuition about what would explode my memory usage due to lazy eval, I never got a grip on the transition from pseudofunctional at the io boundaries into real functional internally either.
      • anonzzzies 4 days ago |
        I like fp and array programming and logic programming (and the overlap) to solve problems that are (considered or for real; I don't usually know going in) really a bad fit for those and then implement them shorter and faster than in their imperative implementation. For some things it does not work obviously, but it's a really nice exercise for when in the gym, walking, cooking, driving and such; the performant or LoC optimising parts fit in my head so I can do the rewrites there and find ways to make them fit. For instance; going from an imperative piece of code to array programming (k/j), you have to find a way to translate the problem to a vector/matrix manipulation problem and that you can do while doing something else entirely. When you find a good one fit, it can be many, many times faster than the original just like that.
    • anardil 4 days ago |
      Definitely. Depending on how long you've spent staring at the contents of /bin/ and /usr/bin/ you'll notice there are definitely some array or matrix oriented utils (or options) missing like column.

      cut comes to mind as a difficult one. In C, you can just hop around the char buffer[] and drop nulls in place for fields, etc before printing. You could go that way a Data.Array Char, but that's hard to justify as functional.

      • Muromec 4 days ago |
        Shouldn't cut but the easiest one to do in functional style?

        You basicaly map one line of a stream to another with some filtering and joining. Do I miss the part where it's terribly slow and/or not doable in Haskell or something?

  • amelius 4 days ago |
    It would be cool if RFCs would be turned into Haskell, so we could use __that__ as the specification instead.

    Same for all the remaining web standards.

    • fuzztester 4 days ago |
      executable specs?
    • johnisgood 4 days ago |
      Opinion: I would rather prefer OCaml or just C (instead of say, Rust or Python).
      • bionade24 4 days ago |
        C is way too ambiguous to serve for that purpose.
        • johnisgood 4 days ago |
          Are we referring to reference implementations? I have always found it easier to understand the C code, as Rust and Python tend to obscure too many details behind high-level abstractions.

          C is simple enough to know what is going on, IMO.

          For example:

              numbers = [1, 2, 3, 4, 5]
              squared_numbers = [num * num for num in numbers]
              print(squared_numbers)
          
          vs.

              fn main() {
                  let numbers = vec![1, 2, 3, 4, 5];
                  let squared_numbers: Vec<i32> = numbers.iter().map(|&num| num * num).collect();
          
                  println!("{:?}", squared_numbers);
              }
          
          vs.

              #include <stdio.h>
          
              int main() {
                  int numbers[] = {1, 2, 3, 4, 5};
                  int squared_numbers[5];
              
                  // Squaring each number
                  for (int i = 0; i < 5; i++) {
                      squared_numbers[i] = numbers[i] * numbers[i];
                  }
              
                  // Printing the squared numbers
                  printf("[");
                  for (int i = 0; i < 5; i++) {
                      printf("%d", squared_numbers[i]);
                      if (i < 4) printf(", ");
                  }
                  printf("]\n");
          
                  return 0;
              }
          
          Python and Rust here seem concise and elegant, but can be more difficult to follow to the unaccustomed due to obscurity. I am sure there are examples that involve more higher-level abstractions, and reference implementations typically seem to use a lot of said abstractions. In case of this Rust code, abstractions (like iterators and closures) can also make the code more challenging to follow for those who are not accustomed to functional programming paradigms.

          What I am trying to say is that the C implementation is more straightforward, IMO.

          • yazzku 4 days ago |
            [num**2 for num in numbers]

            There's no way that C implementation is easier to read than the Python one. There's no high-level abstraction going on there and the list comprehension reads like natural language.

            I agree Rust often gets distracting with unwrapping and lifetime notation.

            • johnisgood 4 days ago |
              I know it is easier to read, and it is not difficult to figure out what is going on (in this case), but many languages do not support list comprehensions (for example), so you have to implement it in a way that is similar to the C version, making the C version more helpful.

              The example above may not highlight what I was trying to, however.

              Perhaps Ada could work even better than C? It is verbose, yes, but you will immediately know what is going on.

              My point is, that in my opinion verbosity is better than conciseness in cases of reference implementations.

          • SkiFire13 4 days ago |
            > as Rust and Python tend to obscure too many details behind high-level abstractions.

            C on the other hand forces handling too many low level details for a reference implementation IMO. Rust also isn't really good in that regard.

            In a reference implementation what needs to be made explicit is the logic that needs to be implemented, rather than details like memory management. For that I would prefer something like Datalog instead.

            • johnisgood 4 days ago |
              > what needs to be made explicit is the logic that needs to be implemented, rather than details like memory management

              I agree.

          • trealira 4 days ago |
            People who write C nowadays generally try to write it in the most "boring" and easy-to-read way possible, I think. But C can also be hard to understand if it uses a lot of side-effectful expressions and pointer arithmetic. Just look at "How Old School C Programmers Process Arguments", which looks over an excerpt from K&R [0]. To make your code slightly harder to read, while still being plausible, you could change it to use pointer arithmetic, like so:

              // Printing the squared numbers
              printf("[");
              for (int n = 5, *p = squared_numbers; --n >= 0; p++)
                  printf("%d%s", *p, n > 0 ? ", " : "]\n");
            
            
            [0]: https://www.usrsb.in/How-Old-School-C-Programmers-Process-Ar...
            • thfuran 3 days ago |
              >--n >= 0

              But n-->0 is prettier.

      • amelius 4 days ago |
        I personally think Haskell is a better choice because it is purely functional.
  • tightbookkeeper 4 days ago |
    Awesome work!

    I noticed that they are about as long and complex as the C versions. In early C++/STL days part of the pitch was reimplementing core utils in a much more concise way. In wonder if the is a result of focusing on that application, or a coincidence of design decisions.

  • mytec 4 days ago |
    I'm used to reading about a given C/C++ program being implemented in Rust, and was delighted to see such an effort in a functional programming language.

    I know little about functional programming languages but I've always been interested in how languages like Ada and now Rust can help programmers write safer code. I'm curious what advantages a rewrite of a C/C++ app in a FP language provides and also what advantages a FP language brings in comparison to a language like Rust.

  • Maro 4 days ago |
    I looked at the wc implementation, there's no way it's compatible with the GNU tools, how it handles wide characters, which is heavily ties to the idiosyncratic C implementation. I know because I once wrote a wc implementation in modern C++ for my own education, and that's where most of my time and code complexity went. Also, the C versions are devilishly fast, with all flags, with optimized code branches.

    https://bytepawn.com/tag/wc.html

    • galangalalgol 4 days ago |
      Poking at the rust coreutils project it seems like they all faster than the c versions. The issue is the decades of features that have accumulated that someone somewhere uses but seem safe to disregard because so few do.
      • Maro 4 days ago |
        There are millions of lines of shell scripts relying on those flags.
        • chongli 4 days ago |
          I would place the blame for that on the GNU coreutils project. At best, it’s feature creep. At worst, it’s embrace and extend on the POSIX standard.
          • erik_seaberg 3 days ago |
            POSIX doesn't forbid new flags or options. It's up to the author to read the spec and test his portability, or else willingly rely on certain distros. Some GNU tools have had strict modes as a courtesy (they used to jokingly call it POSIX_ME_HARDER).
        • prmoustache 4 days ago |
          And those scripts should check they are indeed calling the gnu coreutils version before they proceed.
          • tankenmate 3 days ago |
            in the real world this would not work; it's putting the cart in front of the horse.
        • wodenokoto 2 days ago |
          Would they work on BSD/MacOS? It’s not just gnu vs rust out there.
      • Arcuru 4 days ago |
        They're faster? They must have been working on performance then, maybe I should take another look at how that project has been maturing.

        When I last looked 'dd' was significantly slower, though I did make it a bit closer a while back - https://jackson.dev/post/rust-coreutils-dd/

        A lot of the Rust coreutils were written by people learning the language, and the code is often _more_ complicated than the corresponding code for the Unix utils. They don't seem to get enough experienced programmers fixing them up or usage for me to actually recommend anybody use them, but maybe that's changing.

        • mustache_kimono 3 days ago |
          > When I last looked 'dd' was significantly slower, though I did make it a bit closer a while back - https://jackson.dev/post/rust-coreutils-dd/

          I read your blog. Maybe you should take a look at some of the other utils. I worked on sort and ls and both have long been faster than their GNU equivalents.

          > They don't seem to get enough experienced programmers fixing them up or usage for me to actually recommend anybody use them, but maybe that's changing.

          The issue is obviously compatibility and the uutils won't be broadly "usable" or stable until is hits 100% PASS/0% FAIL on the GNU test coverage, although the numbers are getting better and better. See: https://raw.githubusercontent.com/uutils/coreutils-tracking/...

          > A lot of the Rust coreutils were written by people learning the language,

          I really, really hate this way of framing the project -- which implicitly sounds like GNU is full of pros. Yes, lots of code was written by people just learning Rust and systems programming. Once it reaches parity on GNU test coverage, which it is rapidly approaching, GNU coreutils would wish it had this kind of enthusiasm behind it.

          > and the code is often _more_ complicated than the corresponding code for the Unix utils.

          I think it really depends. Much of the code is much simpler. For instance -- I think it's much easier to generically do hashing in Rust, and it shows.

          • oguz-ismail 3 days ago |
            Yeah just two more weeks
      • diath 4 days ago |
        ...and that's precisely why they are faster, if you can throw out decades of compatibility flags you can avoid expensive branching and such to make them faster. That's why such projects should be called alternatives, and not replacements, it's not a replacement if it cannot be seamlessly replaced and decades worth of scripts can continue working.
        • 7bit 4 days ago |
          When is it the time to talk about alternatives replacing the current apps in favor of faster apps? 10 years or rare usage? Twenty? At some point the edge cases should be covered by the scripts not by the app, only to make it slower for everyone.

          I get what you're saying, but I don't care how it's called. Some things must die. Eg. Python 3 and the depreciation of was a very controversial step, but ultimately the right choice at the right time.

          • kanbankaren 4 days ago |
            > When is it the time to talk about alternatives

            When all standards bodies and governments decide that POSIX is a hindrance which might take a few decades. And that is if they decide.

            • galangalalgol 4 days ago |
              The FBI just said everyone needs a roadmap by 2026 to eliminate c/c++ from critical infrastructure.
              • cozzyd 3 days ago |
                Perhaps by 2126, Linux will no longer have C in it, but that's probably hugely optimistic.
                • niobe 3 days ago |
                  We'll all have uploaded to the Metaverse by then, running purely on beams of light and upvotes.
              • viraptor 3 days ago |
                That document does not have the 2026 deadline. Instead, "At the same time, the authoring agencies acknowledge the commercial reality that transitioning to MSLs will involve significant investments and executive attention. Further, any such transition will take careful planning over a period of years." Also for the roadmap "agencies urge executives of software manufacturers to prioritize using MSLs in their products", not eliminate C. And a lot is about accountability: "The authoring agencies encourage software manufacturers to lead from the top by publicly naming a business executive who will personally drive the elimination of memory safety vulnerabilities from the product line."

                https://www.cisa.gov/sites/default/files/2023-12/The-Case-fo...

              • Maro 3 days ago |
                So they will not use Linux (or any other popular *NIX), nor Windows, by 2026?
          • cozzyd 4 days ago |
            Python 3 created a huge amount of unnecessary effort and made a lot of people hesitant about using python again
            • andrepd 4 days ago |
              Well it's one of the most popular languages in the planet so it cannot have gone that bad.
            • jhayward 4 days ago |
              Not a mainstream view.

              Python 3 is a hugely successful language and implementation, and almost no one regrets that it exists aside from a few noisy holdouts, and people who never liked any Python anywhere at any time.

              • cozzyd 4 days ago |
                I don't disagree that python3 is hugely successful, but that doesn't mean the Python 2->3 pain was necessary. Certainly many Python users started using it too recently to remember it though.
              • erik_seaberg 3 days ago |
                python3 had no compatibility mode, so everyone needed 100% of their dependencies to migrate. This was so painful that some teams abandoned their legacy python2 code and reimplemented in languages with better back compat stories.
                • katzenversteher 2 days ago |
                  Python 3 was announced way ahead and came with migration tools that worked pretty well. Besides character handling most of the stuff was pretty similar at the beginning and I never understood why apparently nobody or only a few projects made the switch early on.

                  That lead to a chicken and egg situation: if you depended on those libraries that did not migrate to python3 you where stuck at python 2 as well.

                  I believe being nice to the community and supporting python 2 for a long time was a mistake. They should have made a hard break and enforce the migration...

              • majormajor 3 days ago |
                I saw a solid dozen Python 2 projects leave Python entirely across a couple companies back during that timeframe.

                The Python ecosystem has been growing overall especially because of the success of things like Pandas, but a lot of backend/fullstack web app programming did move away from it and never looked back.

                (Though you might say the more interesting question is: would they have moved away to things like Node for async/perf or JVM-stuff for "maintainability of old large codebase with lots of devs" issues? Maybe? But at this point Python has added in a lot of things from those languages; but maybe if they'd been there five years earlier with a cleaner upgrade story the migrations wouldn't have happened.)

            • 7bit 3 days ago |
              Factually wrong. Python and JS are the two most used languages on GitHub.
          • Brian_K_White 3 days ago |
            python3 was practically the reason I'll never rely on it for anything.

            But it's not even just that. Even within a major version it changes so much every few months and between different distros and platforms that random non-packaged scripts never work when moved from the authors fedora box to someone else's debian box, or god forbid bsd or sco or windows, or the same box a year later due to any number of random library/module differences.

            It's great for the author and miserable for every other user.

            It's ok for writing and using today and not at all for writing to keep.

            It's ok for packaged apps where the package manager ensures that the script has exactly the environment it expects.

            It's ok for controlled environments where the gods at the top have decreed that it is just part of the environment, so, large companies with IT departments and static published environments that everyone must use the same.

            That's a lot of "it's ok"'s and so that's why it exists and is used in many places, but none of those changes why it's quite terrible.

            • Toorkit 3 days ago |
              Hear, hear.

              I totally understand why developers love python, but as an end user I dread seeing the .py extension.

        • wrs 4 days ago |
          No problem. Don’t branch in the middle of the algorithm, choose the appropriate optimized algorithm based on the task. That’s why we haven’t had a separate fgrep binary for decades.
      • johnisgood 4 days ago |
        Faster? I do not know, what I do know is that there has been some really crazy vulnerabilities in one (or some) of these utilities (logic errors).
      • fire_lake 2 days ago |
        This can be solved By Nix - stay on an old version if you need it but the rest of the world can move forward
    • anardil 4 days ago |
      You're right! Data.ByteString.Lazy is Word8 under the covers, so wide characters are truncated. tr takes a similar short cut. Swapping to Data.Text would fix that.

      Where simplicity conflicted with compatibility, I've chosen the former so far. Targeting the BSD options and behavior is another example of that. The primary goal is to feel out the data flow for each utility, rather than dig into all the edges.

    • johnisgood 4 days ago |
      Speaking of fast: https://redman.xyz/git/turbowc/tree/turbowc.c seems to be really fast (it is not a replacement for wc, however).
      • stabbles 3 days ago |
        AVX2 is pretty much ubiquitous, the implementation above is only SSE2.

        I wrote a blog post with the main AVX2 tricks [1] which includes how to deal with repeated white space when counting words

        [1] https://stoppels.ch/2022/11/30/io-is-no-longer-the-bottlenec...

      • codedokode 3 days ago |
        It seems to use hand-written code for SIMD. Cannot one use normal C code using loops so that compiler optimizes it into SIMD? I know that gcc can convert some loops into SIMD code.
        • adgjlsfhk1 3 days ago |
          autovectorizers are surprisingly bad
    • MBCook 4 days ago |
      The readme mentions feature parity with the BSD coreutils, so I’m assuming that’s the target. Not GNU.
  • parasti 4 days ago |
    Ive seen several coreutils reimplementations just using the name "coreutils" for themselves. I doubt anyone in the original project cares, but seems in bad taste to me.
    • bqmjjx0kac 4 days ago |
      I mean, it gets the point across and they're not claiming to be "GNU coreutils". Maybe some derivative name would be better for uniqueness.
  • xolve 4 days ago |
    TIL that GitHub usernames can end with a `-`
    • anardil 4 days ago |
      The rules were different in 2014 when I made my account! It's actually quite annoying because lots of 3rd party GitHub integrations puke immediately saying I have an invalid username.
  • hello_computer 4 days ago |
    The original coreutils are very weak on wide-char formatting, item delimiting (i.e. -null / -print0), flag name consistency, discoverability, and overall input/output uniformity. Solving the edge cases rigorously would probably require minor breakages, but would be worthwhile.

    Most coreutil re-workings i’ve seen either double-down on JSON output, 24-bit color, sixels, & other lipsticks on the pig—without addressing any of the basic breakages we’ve become accustomed to—or they go off into an entirely different direction like nushell.

    There is definitely room for improvement, but it is a thankless job.

  • mhd 4 days ago |
    This reminds me of a similar attempt with Perl[0], done a few… erm, twenty years ago. Might be interesting to do a three-way comparison for a few well-covered cases.

    [0]: https://metacpan.org/release/CWEST/ppt-0.14

  • niobe 3 days ago |
    Read the readme, only one question, why?
  • haskman 3 days ago |
    Great work! As a nitpick, I noticed that there is some contrived indirection going on inside `Coreutils.Util`. Instead of the Utility class they could have just used an datatype. The current way could be confusing to newcomers looking to learn from this repo (if that was the goal).
    • anardil 3 days ago |
      I'll admit it's mostly this way because I thought ExistentialQuantification sounded cool and wanted to give a try with classes - this could definitely be tidied up