 | Tim Arheit wrote: > On Tue, 11 Jan 2005 19:17:05 -0500, David Bernier > wrote: > > >John Baez wrote: > >[...] > > > >>>Are you still assuming there's 5 bits/char? > >> > >> > >> No, since the data was decimal digits, I was assuming > >> ln(10)/ln(2) or about 3.3 bits per character. > >> > >> But maybe you've solved the puzzle: maybe whatever data > >> compression Bernier used to generate his .zip file > >> was not smart enough to take full advantage of the > >> fact that the data was just numbers! Seems dumb, but... > >> > >> At 5 bits/char we'd get 5 million bits or 625,000 bytes; > >> then some not-terribly-smart program might compress that > >> down to 484,760 bytes and feel proud of itself. > >[...] > > > >Using WinRK 2.0, the 1,000,000-byte file > >of 1,000,000 digits is compressed to a file > >with an amazing 419,732 bytes! > > Not surprising at all given that the million digits are all characters > 0-9. (ie., lots of wasted bits to make it human readable). > > In binary form the original million digits file is only 415,251 > bytes. It can be downloaded from > http://www.datacompression.info/Miscellaneous/AMillionRandomDigits.bin > (at least for now). > > -Tim
That should be 415,241 bytes. Anyway there is a long standing challenge to compress this file. The compressed size has to include the size of the decompression program, but this may packed into an archive along with the compressed file, along the lines of the rules of the Calgary corpus challenge. http://mailcom.com/challenge/ (but without the prize money).
So far we know that the sum of every 50th digit is even (50 bits of redundancy) and there is a slight bias in the distribution of pairs of digits spaced 50 or 100 places apart. This bias is very small, perhaps in the tens of bits, not enough to write a decompressor. I am pretty sure that there are larger biases, enough to meet the challenge, waiting to be discovered.
|
|