The $5000 Compression Challenge (2001)
95 points by ekiauhce a day ago | 67 comments
  • andrei-akopian a day ago |
    I though he would try to net profit through probability by requesting to regenerate the file and hope to get a compressible one in at least 5000/100=50 times. (Though I don't think that would work.)
  • its-summertime a day ago |
    > I think that I did not make a mistake except to believe that you truly intended to attempt to compress the data I was sending you.

    Weird phrase to hear from a digital carnival scam artist.

  • Supermancho a day ago |
    > It's not my fault that a file system uses up more space storing the same amount of data in two files rather than a single file.

    It's not interesting to see people play word games with each other in the HN comments, so why is it interesting to see this kind of disingenuous behavior? Trolls have always existed. I don't get what's interesting about it.

    • recursive 21 hours ago |
      The act of offering a cash prize to prove you wrong kind of makes you look like a jerk. It's assumed that you will use every loop hole at your disposal. Seeing the situation reversed so elegantly makes me root for the challenger.
    • ncruces 21 hours ago |
      Who's the bigger troll?

      Putting up the initial challenge, especially with the accompanying commentary, is just as much trolling as that.

    • stavros 21 hours ago |
      As Patrick said, if he had gotten a 1000-byte file and sent a 999-byte file/decompressor file back, nobody would have counted the inode data as a part of the solution. Why count it now?
      • Supermancho 21 hours ago |
        Because the solution specifically was for 1 file submitted, 2 files back (compressed and decompressor). This topic ise the next level trolling that went on. The summary and ng faw, explained why these file metagames aren't useful. They obscure the lack of a compression strategy (eg if the inode equivalent was 2ft of tape) by nature of dispersing information into the fs.
        • stavros 21 hours ago |
          > Because the solution specifically was for 1 file submitted, 2 files back (compressed and decompressor).

          Patrick:

          I meant can I send you a decompressor and several compressed files whose total file size is less than the original uncompressed file and from which I can regenerate the original uncompressed file

          Mike:

          Sure.

    • Timwi 19 hours ago |
      I do agree that this last sentence (the one you quoted) is very unfortunate and doesn't match the wit of the rest. A better closing remark might have been something on the lines of, “I never claimed to have performed any compression. I only claimed to have met the terms of the challenge. I asked if I can send multiple files and the response was ‘sure’. I provided the files; they meet the total file size criterion; and the script correctly reconstructs the original files.”
  • fxtentacle a day ago |
    This guy clearly failed because he didn't actually do any compression, he just ab-used the filesystem to store parts of the data and then tried to argue that metadata was not data...

    But FYI someone else actually managed to compress that exact same data: https://jasp.net/tjnn/2018/1.xhtml

    • recursive 21 hours ago |
      That argument was made, and addressed in the emails. How are you defining compression? Actual compassion doesn't seem to have been a requirement in the challenge.
    • maxmcd 21 hours ago |
      Is there any more information about this solution?
    • stavros 21 hours ago |
      If Mike didn't want multiple files, he should have said no to multiple files. The fact that he wanted $100 per try shows that he just got outhustled at his own hustle. He should have been more particular about the rules, and upheld the ones he himself set.
      • PaulHoule 19 hours ago |
        Should have charged $100 a file.
    • Timwi 19 hours ago |
      He clearly failed at doing any actual compression, but he did not fail at the challenge. He satisfied every single condition.

      He obviously knows that. He knows exactly what compression is and that his solution is not it. He showed (successfully) that meeting the challenge as it was worded did not require compression, which the setter of the challenge didn't realize.

    • elahieh 17 hours ago |
      I'm rather skeptical of this claim that the data was compressed in 2018 because there is no further information, apart from a hash value given.

      If it's a true claim they must have identified some "non-random" aspect of the original data, and then they could have given more info.

  • johnfn a day ago |
    This is hilarious, but Mike’s behavior boils my blood. To switch the rules when fairly beaten is just so scummy!

    Of course Patrick’s solution occurred to me immediately once he started talking about splitting up the file. :)

    Then I got to wondering if it would be possible to compress a large enough file and save a few bytes, in the spirit of what Mike actually wanted. For instance, if you tried encrypting with a bunch of random keys, you’ll find some such that the result ends with n 0s (say), so then your result could be something like encrypted + key + n. Typically you’d expect to find that the length of the encryption key would be roughly the length of the number of 0s, so you wouldn’t actually net any savings. But I think if you searched long enough, you might “get lucky” and find a key that generated more 0s then the length of the key, so you could theoretically win the challenge. Of course you’d have to somehow embed a program which could decrypt and reassemble everything in a tiny amount of space, but it still doesn’t seem impossible like Mike seems to suggest. Perhaps just requires an intractable amount of computing power. :)

    I’m not sure if any of that made sense. I think I’m using the wrong terms for things.

  • permo-w a day ago |
    this sounds like a Nigerian Prince scammer set-up
  • ccleve 21 hours ago |
    I wonder if it would have been possible to win the challenge legitimately?

    If a randomly-generated file happened to contain some redundancy through sheer chance, you could hand-craft a compressor to take advantage of it. This compressor would not work in general for random data, but it could work for this one particular case.

    It's a bet worth taking, because the payoff, 50:1 ($5,000 to $100), is pretty good. Play the game 50 times and you might get a file you could compress.

    The challenge, then, would be for the person offering the bet to generate a really random file that contained no such redundancy. That might not be easy.

    • dmurray 21 hours ago |
      I think you can make some argument about why this isn't possible at 50:1 odds.

      A plausible "decompressor" is at least, say, 30 or 100 bytes, so the random file needs to have 30 bytes less entropy than you expected, which happens with probability X where X << 1/50. Sum over the whole domain of reasonable decompressors, and you still don't get there.

      This argument could do with more rigor, but I think it's correct. Give me 100 million to 1 odds, though, and I'll take my chances trying to brute force a compressor.

      • lambdaone 21 hours ago |
        This is actually an extremely interesting question. 'Weak' files that are more easily compressable than others certainly exist, but with low probability.

        For example, the all-zeros file is a member of the set of all random 3 megabyte files, and it would certainly be possible to compress that, if by great good fortune you were lucky enough to receive it - albeit something that is unlikely to ever happen in the possible lifetime of the universe, given current physical theories.

        Is it possible to quantify this idea of a 'weak' file more accurately?

        • ccleve 21 hours ago |
          I know very little about this, but a little googling suggests that the measure you're looking for is entropy, which has a mathematical definition: https://en.wikipedia.org/wiki/Entropy_(information_theory)
        • l33t7332273 19 hours ago |
          One thing you can do, as the other commenter pointed out, is consider entropy of the file.

          However, this restriction is too much for the purposes of this challenge. We don’t actually need a file with low entropy, in fact I claim that a weak file exists for files with entropy 8 (the maximum entropy value) - epsilon for each epsilon > 0.

          What we actually need is a sufficiently large chunk in a file to have low entropy. The largeness is in absolute terms, not relative terms.

          A very simple file would be taking a very large file with maximum entropy and adding 200 0’s to the end. This would not decrease the entropy of the file much, but it gives way to a compression algorithm that should be able to save ~100 bytes

          • kevinventullo 12 hours ago |
            Note that if this large chunk occurs in the middle of the file, then you will need extra space to encode that position. For example, a random bit string of length 2^n is decently likely to have a run of n zeroes. But this doesn’t help you because you need n bits just to encode where that run happens.
    • kittoes 21 hours ago |
      What if we didn't even attempt to compress the file? Theoretically, there is a random number generator + seed combination that would output the same bytes as the source file.
      • changoplatanero 21 hours ago |
        Neat idea but chances are the length of the seed is equal to the length of the original file.
        • guepe 21 hours ago |
          There is a polynomial expression that will match the file. I wonder if expressing it would be compressible to a lower size than original file? [edit] I’m wrong - not all sequences can be expressed with a polynomial.
          • l33t7332273 18 hours ago |
            This reminds me of a data compression scheme I came up with once:

            Treat an n bit file as a polynomial over the finite field with characteristic 2. Now, there are some irreducible polynomials in this field, but many polynomials have factors of x and of (x+1). Factor the polynomial into P(x)x^n (x+1)^m. and just collect these terms, storing only P, n, and m.

    • Retr0id 21 hours ago |
      Somewhere (discussed on HN) someone devised a "better-than-perfect" compressor. Most inputs get compressed (smaller than input), except for one input that does not. That one input is cryptographically impossible to find - or something along those lines.

      Unfortunately I can't find the article I'm describing here, maybe someone else can? It was a long time ago so I might be misrepresenting it slightly.

      • spencerflem 21 hours ago |
        That's how all compressors work, in that likely files (eg. ASCII, obvious patterns, etc) become smaller and unlikely files become bigger.
        • Dylan16807 19 hours ago |
          > likely files (eg. ASCII, obvious patterns, etc) become smaller

          Likely files for a real human workload are like that, but if "most inputs" is talking about the set of all possible files, then that's a whole different ball game and "most inputs" will compress very badly.

          > unlikely files become bigger

          Yes, but when compressors can't find useful patterns they generally only increase size by a small fraction of a percent. There aren't files that get significantly bigger.

        • PaulHoule 19 hours ago |
          In some cases it can be certain, the ascii encoded in the usual 8 bits has fat to trim even if it is random in that space.
      • Dylan16807 20 hours ago |
        You could design the opposite, a system where ultra-rare files get compressed a lot but most don't compress.

        But as stated, what you said is not possible. To compress by a single byte, only 1/256 files can be compressible, or you'd have multiple different inputs compressing down to the same output.

  • hyperpape 21 hours ago |
    It’s amazing that this commentary on the value of prediction markets was written in 2001.
  • benchmarkist 21 hours ago |
    Take random bytes from /dev/urandom, encrypt it. There is no way any compressor can reduce the size of that file. I'm pretty sure there is no way the second player can win this game. Actually, the encryption part isn't even necessary. There is no compressor that can reduce the size of a random stream of bytes.
    • cortesoft 21 hours ago |
      There is no general purpose compressor that can achieve compression on all random streams of bytes... but there is a possibility to compress a specific random stream of bytes, if you get lucky
      • lambdaone 21 hours ago |
        Absolutely! Can this probability be quantified?
        • benchmarkist 20 hours ago |
          You'd have to define what you mean by luck which would be equivalent to choosing some pattern that your compressor can recognize in a random stream of bytes and then computing the probability that pattern appeared in some random byte sequence. It's not a winnable game in any practical sense because whatever compressor you choose the probability of getting a sequence from your opponent that will conform to the patterns recognizable by your compressor is essentially 0.
    • Retric 21 hours ago |
      But you don’t need to compress every possible file to make playing such a game a good idea.

      Suppose if you shave 1 bit off of a sample bit stream you win and you lose if that fails. Your compressor looks at the first 2 bits and if it starts 11 it starts with 1 and otherwise it starts 0 then the bit stream. Now you have a 1 in 4 chance to compress a random string by 1 bit and. A 3 in 4 chance of making it longer by 1 bit.

      It’s ultimately not about compression but what odds are worth it.

    • quuxplusone 19 hours ago |
      The encryption part might actually be harmful: for (silly) example, if you encrypt using the Bacon cipher, which turns every letter into five letters all of which are either "a" or "b". :) You'd need to pick an encryption method that doesn't expand the text at all, not even via a header or footer block. Better to stick with bytes generated uniformly at random, in which case you are correct.

      [1] - https://en.wikipedia.org/wiki/Bacon%27s_cipher

  • omoikane 21 hours ago |
    The original email thread was from 2001, and it gets posted to HN periodically:

    https://news.ycombinator.com/from?site=patrickcraig.co.uk

    For another compression challenge that is still ongoing, try "500000€ Prize for Compressing Human Knowledge" (also known as "Hutter Prize"):

    http://prize.hutter1.net/

    https://news.ycombinator.com/item?id=37502329 - Hutter Prize for compressing human knowledge (2023-09-13, 215 comments)

    • vlovich123 20 hours ago |
      I have a fundamental problem with the Hutter prize stating that intelligence is related to compression & then sponsoring a prize for lossless compression. Intelligence is related to lossy compression. Lossless is mainly a mechanistic act.
      • echoangle 20 hours ago |
        Isn’t the intelligence shown by compressing lossless the scheme you use? Applying the algorithm is the easy part, the proof of intelligence is inventing the algorithm which compresses.
        • AlotOfReading 20 hours ago |
          Hutter put up the prize as a way of getting people to approximate his AIXI algorithm, which he already considers to be the pinnacle of artificially intelligence. That's also why lossy compression isn't interesting. It'd be an approximation of an approximation and not particularly useful for hutter's purpose.
      • _hark 20 hours ago |
        Say we have some dataset composed of D bytes.

        Next, say I find some predictive model of the data M, where M is composed of N bytes. Furthermore, let us say that the entropy of the dataset under the model is H bytes.

        Then, if N + H < D, my model compresses the data.

        It doesn't matter if the model is deterministic or probabilistic: a probabilistic model can be used to (losslessly) compress a dataset with entropy coding.

        One more argument for compression being equivalent to intelligence: Across many fields of statistical machine learning, there are generalization bounds which have the form:

        test error <= train error + model complexity

        That is, we don't expect to do any worse on new (test) data, than the sum of the train error and the model complexity (smallest compressed size). Notice that if you interpret the train error as the entropy of the data (i.e. error under a cross entropy loss), then the models which satisfy the statistical generalization bound correspond to those which best compress the data. In other words: the model which produces the shortest description of the data is the one which is expected to generalize best.

        • dooglius 19 hours ago |
          This seems to assume that there is a tractable way to encode H efficiently, but this seems very difficult given a model that is focused on understanding the content. Ex: I can easily write a program that can do basic arithmetic, but given say a bitmap scan of elementary school math materials, such a program gives me no easy way to compress that; rather something generic like PNG (that does not know or understand the material) will far outperform.
          • _hark 18 hours ago |
            Great point. This points to the related issue: what do we want to compress? Do we want to compress "the answer", here the arithmetic expression's solution, or do we want to compress the image?

            You can formalize this with rate--distortion theory, by defining a distortion function that says what your objective is. That implies a well-defined complexity relative to that distortion function.

            Okay to be clear, I've written a paper on exactly this topic which will be announced in a week or so. So you won't find anything on the subject yet, haha. But I use almost exactly this example.

            • Jerrrry 16 hours ago |

                >Okay to be clear, I've written a paper on exactly this topic which will be announced in a week or so. So you won't find anything on the subject yet, haha. But I use almost exactly this example.
              
              
              I would use Floating Point arithmetic as the example/analogy: one trades off accuracy/precision for exactness/correct-ness when in the extremities. Answers near the more granular representations will be only be able to represented by their nearest value. If this is forsee-able, the floating point implementation can be adjusted to change where the floating point's "extremities'" are.

              I used the above analogy and the following when articulating the magnitude of near-lossless-ness that large LLM's have managed to attain, especially when all of the humanities corpus is compressed into a USB flash drive; the Kolmogorov complexity re-mitted/captured is similar to a master-piece like the Mona Lisa having to be described in X many brush-strokes to achieve Y amount of fidelity.

  • misiek08 21 hours ago |
    I read „decompressor and compressed file”, not „fileS”. There is only one person correct here
    • stavros 21 hours ago |
      You should read the rest of the thread.
  • moonchild 21 hours ago |
    suppose the data were generated by a csprng with 256 bits of state. then in fact there are just slightly more than 256 bits of entropy there (prng state plus the length of the generated file). only, you'd have a snail's chance in hell of actually finding that seed
    • echoangle 20 hours ago |
      If he really wanted to make sure he kept the money, he could just use a hardware RNG. I don’t know how common they were in 2001 but you could probably even generate some practically unpredictable random numbers using radio noise.
  • stavros 21 hours ago |
    Sorry, if you're trying to hustle people by xstging $100 per try, don't catch the sleight of hand in the "multiple files" question, and accept, you were beaten at your own game, fair and square.
    • l33t7332273 19 hours ago |
      I feel like if the FAQ requires not using filename shenanigans then the slight of hand was illegal the whole way.
      • stavros 11 hours ago |
        He didn't use filenames, he used files, and if that were illegal, Mike shouldn't have accepted it.
  • PaulKeeble 21 hours ago |
    I would not have accepted multiple files nor scripting. Scripts offers up the entire contents of the linux operating system and this could allow you to use operating system files as partial data. To really test compression/decompression the rules need to be a lot tighter. The challenge was met and beat because the criteria was bad. It leaves open a number of dishonest approaches.

    The decompression program needs to be constrained to itself and the singular file to decompress and can not utilise anything other than basic operating system functions for the purpose of reading and writing the constrained files. Any bytes the program accesses otherwise can be used as part of the total count.

    It's very hard to constrain this without some form of chest being possiblemespecially if you have some years to set it up.

    • echoangle 20 hours ago |
      > Scripts offers up the entire contents of the linux operating system and this could allow you to use operating system files as partial data.

      I also thought about this, but when thinking more about the argument by the challenge creator, I understood that the data in the OS doesn’t help you.

      If you’re given a file with 100 bits, there are 2^100 possibilities. If you want to compress to 99 bits, there are only 2^99 possible compressed outputs. So when generating and trying to compress every 100 bit file, you’re not going to be able to compress half of them. The external data doesn’t help you unless you get to modify it after receiving the file to compress. If you can get patched merged into the Linux kernel, you could use the data, but if the data is fixed, you can’t reference the data with less data than the original file (for some cases atleast).

      Edit: actually, when thinking about it: you could actually use the OS as another few bits of out of band information you get for free, which makes it possible again. If you have 8 different OS to choose from, you could use their differences to compress any file. At least theoretically, the 3 bits you gain practically probably aren’t enough to make a difference.

      • cozzyd 19 hours ago |
        My decompressor:

        curl ftp://path/to/original.dat

      • PaulKeeble 8 hours ago |
        One possibility I considered is that your script could use an "apt install" to provide out of band data, you could put the information you need into the operating system from the universe repositories/your own package server.
        • echoangle 7 hours ago |
          The computer used for the test probably doesn’t have network connectivity, otherwise you could just download the data with curl.
  • echoangle 20 hours ago |
    I didn’t quite get the method used to „compress“ the data from the article, maybe this rephrasing helps someone:

    You basically split the file every time you encounter a specific character, and your compressor just combines all files it finds with the character you split by. If you split at every „X“ Char which might occur 1000 times in the file, the compressor only needs a small script which joins all files and puts an „X“ between them, which is less than 1000 bytes. The „trick“ is storing the location of the Xs you removed in the file sizes of the individual files.

  • avmich 18 hours ago |
    I particularly like this:

    > It's not my fault that a file system uses up more space storing the same amount of data in two files rather than a single file.

  • teo_zero 13 hours ago |
    Mike supports $SPORT team the Compressors. He's so sure they are unbeatable that he accepts 50:1 bets against them. Patrick bets 100$ that they won't win this year's championship; Mike accepts. Later, the Compressors announce financial troubles and can't pay the fee to enter the championship, which is then won by another team. Patrick reclaims his 5000$. Mike refuses to pay saying that the Compressors have not been beaten, and the championship result does not disprove his claim that the Compressors are unbeatable, which was the spirit of the bet since the beginning. Mike offers to refund the 100$. Patrick holds that the terms of the bet were met and he deserves the 5000$.
  • nojvek an hour ago |
    I’d take this bet.

    > With this offer, you can tune your algorithm to my data.

    One can generate a 1GB file or 10GB file. It is highly likely that there is some form of a repeatable pattern in there to shave off 50-100 bytes by sliding window search.

    Then the decompressor is essentially - at this index, expand this pattern. The compressed file excludes the range of pattern.

    One may not always get such a pattern, but on multiple tries it’s very feasible. Payout just needs one win in 50 tries.

    You could generate a 100GB file. The bigger the file the higher the chance of repeating pattern.

    The challenge is won if compressed_file + decompressor is less one byte than original file.

    One could have a self executing decompressor to save some file overhead bytes.

    • Jerrrry an hour ago |
      good compression = pattern eventually

      maximum compression = indelineable from random data

  • dang 29 minutes ago |
    Related. Others?

    The $5000 Compression Challenge - https://news.ycombinator.com/item?id=9163782 - March 2015 (168 comments)

    The $5000 Compression Challenge - https://news.ycombinator.com/item?id=5025211 - Jan 2013 (61 comments)

    The $5000 Compression Challenge - https://news.ycombinator.com/item?id=4616704 - Oct 2012 (119 comments)