Weird phrase to hear from a digital carnival scam artist.
It's not interesting to see people play word games with each other in the HN comments, so why is it interesting to see this kind of disingenuous behavior? Trolls have always existed. I don't get what's interesting about it.
Putting up the initial challenge, especially with the accompanying commentary, is just as much trolling as that.
Patrick:
I meant can I send you a decompressor and several compressed files whose total file size is less than the original uncompressed file and from which I can regenerate the original uncompressed file
Mike:
Sure.
But FYI someone else actually managed to compress that exact same data: https://jasp.net/tjnn/2018/1.xhtml
He obviously knows that. He knows exactly what compression is and that his solution is not it. He showed (successfully) that meeting the challenge as it was worded did not require compression, which the setter of the challenge didn't realize.
If it's a true claim they must have identified some "non-random" aspect of the original data, and then they could have given more info.
Of course Patrick’s solution occurred to me immediately once he started talking about splitting up the file. :)
Then I got to wondering if it would be possible to compress a large enough file and save a few bytes, in the spirit of what Mike actually wanted. For instance, if you tried encrypting with a bunch of random keys, you’ll find some such that the result ends with n 0s (say), so then your result could be something like encrypted + key + n. Typically you’d expect to find that the length of the encryption key would be roughly the length of the number of 0s, so you wouldn’t actually net any savings. But I think if you searched long enough, you might “get lucky” and find a key that generated more 0s then the length of the key, so you could theoretically win the challenge. Of course you’d have to somehow embed a program which could decrypt and reassemble everything in a tiny amount of space, but it still doesn’t seem impossible like Mike seems to suggest. Perhaps just requires an intractable amount of computing power. :)
I’m not sure if any of that made sense. I think I’m using the wrong terms for things.
If a randomly-generated file happened to contain some redundancy through sheer chance, you could hand-craft a compressor to take advantage of it. This compressor would not work in general for random data, but it could work for this one particular case.
It's a bet worth taking, because the payoff, 50:1 ($5,000 to $100), is pretty good. Play the game 50 times and you might get a file you could compress.
The challenge, then, would be for the person offering the bet to generate a really random file that contained no such redundancy. That might not be easy.
A plausible "decompressor" is at least, say, 30 or 100 bytes, so the random file needs to have 30 bytes less entropy than you expected, which happens with probability X where X << 1/50. Sum over the whole domain of reasonable decompressors, and you still don't get there.
This argument could do with more rigor, but I think it's correct. Give me 100 million to 1 odds, though, and I'll take my chances trying to brute force a compressor.
For example, the all-zeros file is a member of the set of all random 3 megabyte files, and it would certainly be possible to compress that, if by great good fortune you were lucky enough to receive it - albeit something that is unlikely to ever happen in the possible lifetime of the universe, given current physical theories.
Is it possible to quantify this idea of a 'weak' file more accurately?
However, this restriction is too much for the purposes of this challenge. We don’t actually need a file with low entropy, in fact I claim that a weak file exists for files with entropy 8 (the maximum entropy value) - epsilon for each epsilon > 0.
What we actually need is a sufficiently large chunk in a file to have low entropy. The largeness is in absolute terms, not relative terms.
A very simple file would be taking a very large file with maximum entropy and adding 200 0’s to the end. This would not decrease the entropy of the file much, but it gives way to a compression algorithm that should be able to save ~100 bytes
Treat an n bit file as a polynomial over the finite field with characteristic 2. Now, there are some irreducible polynomials in this field, but many polynomials have factors of x and of (x+1). Factor the polynomial into P(x)x^n (x+1)^m. and just collect these terms, storing only P, n, and m.
Unfortunately I can't find the article I'm describing here, maybe someone else can? It was a long time ago so I might be misrepresenting it slightly.
Likely files for a real human workload are like that, but if "most inputs" is talking about the set of all possible files, then that's a whole different ball game and "most inputs" will compress very badly.
> unlikely files become bigger
Yes, but when compressors can't find useful patterns they generally only increase size by a small fraction of a percent. There aren't files that get significantly bigger.
But as stated, what you said is not possible. To compress by a single byte, only 1/256 files can be compressible, or you'd have multiple different inputs compressing down to the same output.
Suppose if you shave 1 bit off of a sample bit stream you win and you lose if that fails. Your compressor looks at the first 2 bits and if it starts 11 it starts with 1 and otherwise it starts 0 then the bit stream. Now you have a 1 in 4 chance to compress a random string by 1 bit and. A 3 in 4 chance of making it longer by 1 bit.
It’s ultimately not about compression but what odds are worth it.
https://news.ycombinator.com/from?site=patrickcraig.co.uk
For another compression challenge that is still ongoing, try "500000€ Prize for Compressing Human Knowledge" (also known as "Hutter Prize"):
https://news.ycombinator.com/item?id=37502329 - Hutter Prize for compressing human knowledge (2023-09-13, 215 comments)
Next, say I find some predictive model of the data M, where M is composed of N bytes. Furthermore, let us say that the entropy of the dataset under the model is H bytes.
Then, if N + H < D, my model compresses the data.
It doesn't matter if the model is deterministic or probabilistic: a probabilistic model can be used to (losslessly) compress a dataset with entropy coding.
One more argument for compression being equivalent to intelligence: Across many fields of statistical machine learning, there are generalization bounds which have the form:
test error <= train error + model complexity
That is, we don't expect to do any worse on new (test) data, than the sum of the train error and the model complexity (smallest compressed size). Notice that if you interpret the train error as the entropy of the data (i.e. error under a cross entropy loss), then the models which satisfy the statistical generalization bound correspond to those which best compress the data. In other words: the model which produces the shortest description of the data is the one which is expected to generalize best.
You can formalize this with rate--distortion theory, by defining a distortion function that says what your objective is. That implies a well-defined complexity relative to that distortion function.
Okay to be clear, I've written a paper on exactly this topic which will be announced in a week or so. So you won't find anything on the subject yet, haha. But I use almost exactly this example.
>Okay to be clear, I've written a paper on exactly this topic which will be announced in a week or so. So you won't find anything on the subject yet, haha. But I use almost exactly this example.
I would use Floating Point arithmetic as the example/analogy: one trades off accuracy/precision for exactness/correct-ness when in the extremities. Answers near the more granular representations will be only be able to represented by their nearest value. If this is forsee-able, the floating point implementation can be adjusted to change where the floating point's "extremities'" are.I used the above analogy and the following when articulating the magnitude of near-lossless-ness that large LLM's have managed to attain, especially when all of the humanities corpus is compressed into a USB flash drive; the Kolmogorov complexity re-mitted/captured is similar to a master-piece like the Mona Lisa having to be described in X many brush-strokes to achieve Y amount of fidelity.
The decompression program needs to be constrained to itself and the singular file to decompress and can not utilise anything other than basic operating system functions for the purpose of reading and writing the constrained files. Any bytes the program accesses otherwise can be used as part of the total count.
It's very hard to constrain this without some form of chest being possiblemespecially if you have some years to set it up.
I also thought about this, but when thinking more about the argument by the challenge creator, I understood that the data in the OS doesn’t help you.
If you’re given a file with 100 bits, there are 2^100 possibilities. If you want to compress to 99 bits, there are only 2^99 possible compressed outputs. So when generating and trying to compress every 100 bit file, you’re not going to be able to compress half of them. The external data doesn’t help you unless you get to modify it after receiving the file to compress. If you can get patched merged into the Linux kernel, you could use the data, but if the data is fixed, you can’t reference the data with less data than the original file (for some cases atleast).
Edit: actually, when thinking about it: you could actually use the OS as another few bits of out of band information you get for free, which makes it possible again. If you have 8 different OS to choose from, you could use their differences to compress any file. At least theoretically, the 3 bits you gain practically probably aren’t enough to make a difference.
curl ftp://path/to/original.dat
You basically split the file every time you encounter a specific character, and your compressor just combines all files it finds with the character you split by. If you split at every „X“ Char which might occur 1000 times in the file, the compressor only needs a small script which joins all files and puts an „X“ between them, which is less than 1000 bytes. The „trick“ is storing the location of the Xs you removed in the file sizes of the individual files.
> It's not my fault that a file system uses up more space storing the same amount of data in two files rather than a single file.
> With this offer, you can tune your algorithm to my data.
One can generate a 1GB file or 10GB file. It is highly likely that there is some form of a repeatable pattern in there to shave off 50-100 bytes by sliding window search.
Then the decompressor is essentially - at this index, expand this pattern. The compressed file excludes the range of pattern.
One may not always get such a pattern, but on multiple tries it’s very feasible. Payout just needs one win in 50 tries.
You could generate a 100GB file. The bigger the file the higher the chance of repeating pattern.
The challenge is won if compressed_file + decompressor is less one byte than original file.
One could have a self executing decompressor to save some file overhead bytes.
maximum compression = indelineable from random data
The $5000 Compression Challenge - https://news.ycombinator.com/item?id=9163782 - March 2015 (168 comments)
The $5000 Compression Challenge - https://news.ycombinator.com/item?id=5025211 - Jan 2013 (61 comments)
The $5000 Compression Challenge - https://news.ycombinator.com/item?id=4616704 - Oct 2012 (119 comments)