|
|
 | | From: | John Baez | | Subject: | Re: too much information! | | Date: | Tue, 11 Jan 2005 21:06:27 +0000 (UTC) |
|
|
 | In article , wrote:
>In article , > baez@galaxy.ucr.edu (John Baez) wrote:
>>In article , >>David Bernier wrote:
>>>In the 1950's, RAND produced a table of a million >>>random digits, and another one with >>>"100,000 Normal Deviates". ( link is below)
I keep on reading this as "100,000 normal deviants", which seems like a description of usenet news. Apparently the librarians thought similarly, since this book was first filed under psychology.
>>>Having downloaded the 1 million digits (line numbers included), >>>I wrote a program to save them as a file with >>>12,500 lines at 80 digits per line. >>> >>>This got compressed to a 484,760-byte *.zip file.
>>Someone else claimed that these "random" digits were >>complete crap and could be significantly compressed; let's see >>if your zip file reflects that. >> >>1 million decimal digits should be about 3.3 million bits; >>divide by 8 and get about 410,000 bytes. >> >>Wait a minute - that's LESS than your supposedly compressed >>file! What's going on? It can't be those line numbers. >>Am I making some dumb mistake?
>Are you still assuming there's 5 bits/char?
No, since the data was decimal digits, I was assuming ln(10)/ln(2) or about 3.3 bits per character.
But maybe you've solved the puzzle: maybe whatever data compression Bernier used to generate his .zip file was not smart enough to take full advantage of the fact that the data was just numbers! Seems dumb, but...
At 5 bits/char we'd get 5 million bits or 625,000 bytes; then some not-terribly-smart program might compress that down to 484,760 bytes and feel proud of itself.
>Also the data was on cards. There's a different encoding with >cards that has to do with two fields and one hole punch in each. >Geography of the holes was how characters were defined.
Oh-oh - now this is getting too complicated for me to understand.
>Also 20,000 cards ain't very many cards. It's only 10 boxes.
That's irrelevant, your honor! What's at stake here is the number of bits, and whether the defendent compressed the data in an ill-advised manner.
|
|
 | | From: | jmfbahciv at aol.com | | Subject: | Re: too much information! | | Date: | Wed, 12 Jan 05 12:44:25 GMT |
|
|
 | In article , baez@galaxy.ucr.edu (John Baez) wrote: >In article , wrote: > >>In article , >> baez@galaxy.ucr.edu (John Baez) wrote: > >>>In article , >>>David Bernier wrote: > >>>>In the 1950's, RAND produced a table of a million >>>>random digits, and another one with >>>>"100,000 Normal Deviates". ( link is below) > >I keep on reading this as "100,000 normal deviants", which seems >like a description of usenet news. Apparently the librarians thought >similarly, since this book was first filed under psychology.
It's a good place for it. Whenever we talked about this stuff we went nuts.
> >>>>Having downloaded the 1 million digits (line numbers included), >>>>I wrote a program to save them as a file with >>>>12,500 lines at 80 digits per line. >>>> >>>>This got compressed to a 484,760-byte *.zip file. > >>>Someone else claimed that these "random" digits were >>>complete crap and could be significantly compressed; let's see >>>if your zip file reflects that. >>> >>>1 million decimal digits should be about 3.3 million bits; >>>divide by 8 and get about 410,000 bytes. >>> >>>Wait a minute - that's LESS than your supposedly compressed >>>file! What's going on? It can't be those line numbers. >>>Am I making some dumb mistake? > >>Are you still assuming there's 5 bits/char? > >No, since the data was decimal digits, I was assuming >ln(10)/ln(2) or about 3.3 bits per character.
Which gives you four, not five, for max storage needed for _one_ decimal number. > >But maybe you've solved the puzzle: maybe whatever data >compression Bernier used to generate his .zip file >was not smart enough to take full advantage of the >fact that the data was just numbers! Seems dumb, but...
Not really. It's a common error for those who don't breathe binary or octal. If the guy only thought in hex, he'ld have other problems. There are lots of "formulas" for extracting characters out of packed bits. RADIX-50 was commonly used. This repeated reference smells of Bardot.. I just felt my brain spaz so this word is incorrect. > >At 5 bits/char we'd get 5 million bits or 625,000 bytes; >then some not-terribly-smart program might compress that >down to 484,760 bytes and feel proud of itself.
And, if I could write a 2-line FORTRAN program that generated those 5 million bits, I'd have "compressed" the data down to a maximum of 144 characters. I won't complicate your thinking with mumblings about how many extra characters it takes to store these 5 million numbers on media based on electricity.
> >>Also the data was on cards. There's a different encoding with >>cards that has to do with two fields and one hole punch in each. >>Geography of the holes was how characters were defined. > >Oh-oh - now this is getting too complicated for me to understand.
Yep. That's how the conversation usually goes as soon as we have to get practical and descent the ivory towers ;-). > >>Also 20,000 cards ain't very many cards. It's only 10 boxes. > >That's irrelevant, your honor! What's at stake here is the >number of bits,
You are counting the bits which are place holders.
> ...and whether the defendent compressed the data >in an ill-advised manner.
I would need to know the method of storage; the method of measuring the size before _and_ after; and I usually want to know how the original was manufactured. It has been my observation that each pass at data modifies it; that's just how people work.
/BAH
Subtract a hundred and four for e-mail.
|
|
 | | From: | Willem | | Subject: | Re: too much information! | | Date: | Tue, 11 Jan 2005 22:16:59 +0000 (UTC) |
|
|
 | John wrote: )>>Someone else claimed that these "random" digits were )>>complete crap and could be significantly compressed; let's see )>>if your zip file reflects that. )>> )>>1 million decimal digits should be about 3.3 million bits; )>>divide by 8 and get about 410,000 bytes.
415,241 bytes and change, if my calculator is accurate enough.
SaSW, Willem -- Disclaimer: I am in no way responsible for any of the statements made in the above text. For all I know I might be drugged or something.. No I'm not paranoid. You all think I'm paranoid, don't you ! #EOT
|
|
 | | From: | David Bernier | | Subject: | Re: too much information! | | Date: | Tue, 11 Jan 2005 19:17:05 -0500 |
|
|
 | John Baez wrote: [...]
>>Are you still assuming there's 5 bits/char? > > > No, since the data was decimal digits, I was assuming > ln(10)/ln(2) or about 3.3 bits per character. > > But maybe you've solved the puzzle: maybe whatever data > compression Bernier used to generate his .zip file > was not smart enough to take full advantage of the > fact that the data was just numbers! Seems dumb, but... > > At 5 bits/char we'd get 5 million bits or 625,000 bytes; > then some not-terribly-smart program might compress that > down to 484,760 bytes and feel proud of itself. [...]
Using WinRK 2.0, the 1,000,000-byte file of 1,000,000 digits is compressed to a file with an amazing 419,732 bytes!
David Bernier
P.S. It took about 15 seconds with CPU being Athlon 2200+
|
|
 | | From: | Tim Arheit | | Subject: | Re: too much information! | | Date: | 12 Jan 2005 15:55:26 GMT |
|
|
 | On Tue, 11 Jan 2005 19:17:05 -0500, David Bernier wrote:
>John Baez wrote: >[...] > >>>Are you still assuming there's 5 bits/char? >> >> >> No, since the data was decimal digits, I was assuming >> ln(10)/ln(2) or about 3.3 bits per character. >> >> But maybe you've solved the puzzle: maybe whatever data >> compression Bernier used to generate his .zip file >> was not smart enough to take full advantage of the >> fact that the data was just numbers! Seems dumb, but... >> >> At 5 bits/char we'd get 5 million bits or 625,000 bytes; >> then some not-terribly-smart program might compress that >> down to 484,760 bytes and feel proud of itself. >[...] > >Using WinRK 2.0, the 1,000,000-byte file >of 1,000,000 digits is compressed to a file >with an amazing 419,732 bytes!
Not surprising at all given that the million digits are all characters 0-9. (ie., lots of wasted bits to make it human readable).
In binary form the original million digits file is only 415,251 bytes. It can be downloaded from http://www.datacompression.info/Miscellaneous/AMillionRandomDigits.bin (at least for now).
-Tim
|
|
|