knowledge-database (beta)

Current group: comp.compression.

Re: too much information!

Re: too much information!  
John Baez
 Re: too much information!  
jmfbahciv at aol.com
 Re: too much information!  
Willem
 Re: too much information!  
David Bernier
 Re: too much information!  
Tim Arheit
From:John Baez
Subject:Re: too much information!
Date:Tue, 11 Jan 2005 21:06:27 +0000 (UTC)
In article , wrote:

>In article ,
> baez@galaxy.ucr.edu (John Baez) wrote:

>>In article ,
>>David Bernier wrote:

>>>In the 1950's, RAND produced a table of a million
>>>random digits, and another one with
>>>"100,000 Normal Deviates". ( link is below)

I keep on reading this as "100,000 normal deviants", which seems
like a description of usenet news. Apparently the librarians thought
similarly, since this book was first filed under psychology.

>>>Having downloaded the 1 million digits (line numbers included),
>>>I wrote a program to save them as a file with
>>>12,500 lines at 80 digits per line.
>>>
>>>This got compressed to a 484,760-byte *.zip file.

>>Someone else claimed that these "random" digits were
>>complete crap and could be significantly compressed; let's see
>>if your zip file reflects that.
>>
>>1 million decimal digits should be about 3.3 million bits;
>>divide by 8 and get about 410,000 bytes.
>>
>>Wait a minute - that's LESS than your supposedly compressed
>>file! What's going on? It can't be those line numbers.
>>Am I making some dumb mistake?

>Are you still assuming there's 5 bits/char?

No, since the data was decimal digits, I was assuming
ln(10)/ln(2) or about 3.3 bits per character.

But maybe you've solved the puzzle: maybe whatever data
compression Bernier used to generate his .zip file
was not smart enough to take full advantage of the
fact that the data was just numbers! Seems dumb, but...

At 5 bits/char we'd get 5 million bits or 625,000 bytes;
then some not-terribly-smart program might compress that
down to 484,760 bytes and feel proud of itself.

>Also the data was on cards. There's a different encoding with
>cards that has to do with two fields and one hole punch in each.
>Geography of the holes was how characters were defined.

Oh-oh - now this is getting too complicated for me to understand.

>Also 20,000 cards ain't very many cards. It's only 10 boxes.

That's irrelevant, your honor! What's at stake here is the
number of bits, and whether the defendent compressed the data
in an ill-advised manner.
From:jmfbahciv at aol.com
Subject:Re: too much information!
Date:Wed, 12 Jan 05 12:44:25 GMT
In article ,
baez@galaxy.ucr.edu (John Baez) wrote:
>In article , wrote:
>
>>In article ,
>> baez@galaxy.ucr.edu (John Baez) wrote:
>
>>>In article ,
>>>David Bernier wrote:
>
>>>>In the 1950's, RAND produced a table of a million
>>>>random digits, and another one with
>>>>"100,000 Normal Deviates". ( link is below)
>
>I keep on reading this as "100,000 normal deviants", which seems
>like a description of usenet news. Apparently the librarians thought
>similarly, since this book was first filed under psychology.

It's a good place for it. Whenever we talked about this stuff
we went nuts.

>
>>>>Having downloaded the 1 million digits (line numbers included),
>>>>I wrote a program to save them as a file with
>>>>12,500 lines at 80 digits per line.
>>>>
>>>>This got compressed to a 484,760-byte *.zip file.
>
>>>Someone else claimed that these "random" digits were
>>>complete crap and could be significantly compressed; let's see
>>>if your zip file reflects that.
>>>
>>>1 million decimal digits should be about 3.3 million bits;
>>>divide by 8 and get about 410,000 bytes.
>>>
>>>Wait a minute - that's LESS than your supposedly compressed
>>>file! What's going on? It can't be those line numbers.
>>>Am I making some dumb mistake?
>
>>Are you still assuming there's 5 bits/char?
>
>No, since the data was decimal digits, I was assuming
>ln(10)/ln(2) or about 3.3 bits per character.

Which gives you four, not five, for max storage needed for _one_
decimal number.

>
>But maybe you've solved the puzzle: maybe whatever data
>compression Bernier used to generate his .zip file
>was not smart enough to take full advantage of the
>fact that the data was just numbers! Seems dumb, but...

Not really. It's a common error for those who don't
breathe binary or octal. If the guy only thought in hex,
he'ld have other problems. There are lots of "formulas"
for extracting characters out of packed bits. RADIX-50
was commonly used. This repeated reference smells of Bardot..
I just felt my brain spaz so this word is incorrect.
>
>At 5 bits/char we'd get 5 million bits or 625,000 bytes;
>then some not-terribly-smart program might compress that
>down to 484,760 bytes and feel proud of itself.

And, if I could write a 2-line FORTRAN program that generated
those 5 million bits, I'd have "compressed" the data down
to a maximum of 144 characters. I won't complicate your thinking
with mumblings about how many extra characters it takes to store
these 5 million numbers on media based on electricity.

>
>>Also the data was on cards. There's a different encoding with
>>cards that has to do with two fields and one hole punch in each.
>>Geography of the holes was how characters were defined.
>
>Oh-oh - now this is getting too complicated for me to understand.

Yep. That's how the conversation usually goes as soon as we
have to get practical and descent the ivory towers ;-).
>
>>Also 20,000 cards ain't very many cards. It's only 10 boxes.
>
>That's irrelevant, your honor! What's at stake here is the
>number of bits,

You are counting the bits which are place holders.

> ...and whether the defendent compressed the data
>in an ill-advised manner.

I would need to know the method of storage; the method of measuring
the size before _and_ after; and I usually want to know how the
original was manufactured. It has been my observation that each
pass at data modifies it; that's just how people work.

/BAH

Subtract a hundred and four for e-mail.
From:Willem
Subject:Re: too much information!
Date:Tue, 11 Jan 2005 22:16:59 +0000 (UTC)
John wrote:
)>>Someone else claimed that these "random" digits were
)>>complete crap and could be significantly compressed; let's see
)>>if your zip file reflects that.
)>>
)>>1 million decimal digits should be about 3.3 million bits;
)>>divide by 8 and get about 410,000 bytes.

415,241 bytes and change, if my calculator is accurate enough.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
From:David Bernier
Subject:Re: too much information!
Date:Tue, 11 Jan 2005 19:17:05 -0500
John Baez wrote:
[...]

>>Are you still assuming there's 5 bits/char?
>
>
> No, since the data was decimal digits, I was assuming
> ln(10)/ln(2) or about 3.3 bits per character.
>
> But maybe you've solved the puzzle: maybe whatever data
> compression Bernier used to generate his .zip file
> was not smart enough to take full advantage of the
> fact that the data was just numbers! Seems dumb, but...
>
> At 5 bits/char we'd get 5 million bits or 625,000 bytes;
> then some not-terribly-smart program might compress that
> down to 484,760 bytes and feel proud of itself.
[...]

Using WinRK 2.0, the 1,000,000-byte file
of 1,000,000 digits is compressed to a file
with an amazing 419,732 bytes!

David Bernier

P.S. It took about 15 seconds with CPU being Athlon 2200+
From:Tim Arheit
Subject:Re: too much information!
Date:12 Jan 2005 15:55:26 GMT
On Tue, 11 Jan 2005 19:17:05 -0500, David Bernier
wrote:

>John Baez wrote:
>[...]
>
>>>Are you still assuming there's 5 bits/char?
>>
>>
>> No, since the data was decimal digits, I was assuming
>> ln(10)/ln(2) or about 3.3 bits per character.
>>
>> But maybe you've solved the puzzle: maybe whatever data
>> compression Bernier used to generate his .zip file
>> was not smart enough to take full advantage of the
>> fact that the data was just numbers! Seems dumb, but...
>>
>> At 5 bits/char we'd get 5 million bits or 625,000 bytes;
>> then some not-terribly-smart program might compress that
>> down to 484,760 bytes and feel proud of itself.
>[...]
>
>Using WinRK 2.0, the 1,000,000-byte file
>of 1,000,000 digits is compressed to a file
>with an amazing 419,732 bytes!

Not surprising at all given that the million digits are all characters
0-9. (ie., lots of wasted bits to make it human readable).

In binary form the original million digits file is only 415,251
bytes. It can be downloaded from
http://www.datacompression.info/Miscellaneous/AMillionRandomDigits.bin
(at least for now).

-Tim
   

Copyright © 2006 knowledge-database   -   All rights reserved