knowledge-database (beta)

Current group: comp.compression

Re: too much information!

Re: too much information!  
matmahoney at yahoo.com
From:matmahoney at yahoo.com
Subject:Re: too much information!
Date:12 Jan 2005 09:32:57 -0800

Tim Arheit wrote:
> On Tue, 11 Jan 2005 19:17:05 -0500, David Bernier
> wrote:
>
> >John Baez wrote:
> >[...]
> >
> >>>Are you still assuming there's 5 bits/char?
> >>
> >>
> >> No, since the data was decimal digits, I was assuming
> >> ln(10)/ln(2) or about 3.3 bits per character.
> >>
> >> But maybe you've solved the puzzle: maybe whatever data
> >> compression Bernier used to generate his .zip file
> >> was not smart enough to take full advantage of the
> >> fact that the data was just numbers! Seems dumb, but...
> >>
> >> At 5 bits/char we'd get 5 million bits or 625,000 bytes;
> >> then some not-terribly-smart program might compress that
> >> down to 484,760 bytes and feel proud of itself.
> >[...]
> >
> >Using WinRK 2.0, the 1,000,000-byte file
> >of 1,000,000 digits is compressed to a file
> >with an amazing 419,732 bytes!
>
> Not surprising at all given that the million digits are all
characters
> 0-9. (ie., lots of wasted bits to make it human readable).
>
> In binary form the original million digits file is only 415,251
> bytes. It can be downloaded from
>
http://www.datacompression.info/Miscellaneous/AMillionRandomDigits.bin
> (at least for now).
>
> -Tim


That should be 415,241 bytes. Anyway there is a long standing
challenge to compress this file. The compressed size has to include
the size of the decompression program, but this may packed into an
archive along with the compressed file, along the lines of the rules of
the Calgary corpus challenge.
http://mailcom.com/challenge/ (but without the prize money).

So far we know that the sum of every 50th digit is even (50 bits of
redundancy) and there is a slight bias in the distribution of pairs of
digits spaced 50 or 100 places apart. This bias is very small, perhaps
in the tens of bits, not enough to write a decompressor. I am pretty
sure that there are larger biases, enough to meet the challenge,
waiting to be discovered.
   

Copyright © 2006 knowledge-database   -   All rights reserved