knowledge-database (beta)

Current group: comp.sources.d

Re: Hash-table versus B-Tree

Re: Hash-table versus B-Tree  
CBFalconer
 Re: Hash-table versus B-Tree  
Alex Fraser
 Re: Hash-table versus B-Tree  
CBFalconer
 Re: Hash-table versus B-Tree  
Alex Fraser
 Re: Hash-table versus B-Tree  
Nick Landsberg
 Re: Hash-table versus B-Tree  
Alex Fraser
 Re: Hash-table versus B-Tree  
Nick Landsberg
 Re: Hash-table versus B-Tree  
CBFalconer
 Re: Hash-table versus B-Tree  
Nick Landsberg
 Re: Hash-table versus B-Tree  
Alex Fraser
 Re: Hash-table versus B-Tree  
CBFalconer
 Re: Hash-table versus B-Tree  
Alex Fraser
 Re: Hash-table versus B-Tree  
CBFalconer
From:CBFalconer
Subject:Re: Hash-table versus B-Tree
Date:Tue, 07 Dec 2004 13:00:47 GMT
"kar1107@gmail.com" wrote:
> KP Bhat wrote:
>
> I assume by B-Tree we also include balanced trees like AVL.
> Hash collision is handled thru a linked list.
>
>> I am trying to compare and contrast the use of Hash-table and
>> B-Trees as indexing mechanisms for record-based files. Here's
>> what I have got so far. Kindly suggest additions/modifications.
>>
>> Hash-table
>> 1. Simple indexing scheme
>> 2. Insert and delete operations very efficient
>
> Delete is not efficient. A potentially long linked list need to
> be traversed in case of collisions.

Not necessarily so. See hashlib, reference later. No linked list
is involved, and the operation is O(1).

>
>> 3. Retrieve operation very efficient
>> 4. Need to provide a hashing function that hashes record keys
>> into some number

This is usually a trivial operation. The trick is to supply a hash
function suitable for the data involved, and good systems can be
quite forgiving. The penalty for a bad function is poorer
performance.

>> 5. Supporting GetNext and GetBulk operations quite expensive

Depends on your definition of these. I have no idea what you have
in mind.

>> 6. Does not scale that well as the size of the data increases

Scales just fine until the data base no longer fits in virtual
memory. Then the B-tree is far superior.

>
> 7. Collision handling can lead to poor searches ("retrieve
> operation"). The complexity is O(n) for search and delete.

Not so. Again, see the hashlib reference at end. Search and
delete times are O(1).

>
>>
>> B-Tree
>> 1. Complex indexing scheme
>> 2. Insert and delete operations slightly expensive due to need
>> to retain the B-Tree property (i.e. splitting/joining nodes)
>> 3. Retrieve operation very efficient
>> 4. No need to provide a hashing function
>> 5. Supporting the GetNext and GetBulk operations very efficient
>> 6. Scales extremely well as the size of the data increases
>
> 7. search, insert, delete always O(log n) (thus supports #6 above)
>
> Once a solid library is implemented for the tree, its good to go
> with tree. You never know you may have to handle large data in
> future.

You can conveniently examine the characteristics of hash systems
via hashlib, which is available under GPL at:



--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
Available for consulting/temporary embedded and systems.
USE worldnet address!
From:Alex Fraser
Subject:Re: Hash-table versus B-Tree
Date:Tue, 7 Dec 2004 15:42:29 -0000
"CBFalconer" wrote in message
news:41B5A6AC.FCE24212@yahoo.com...
> "kar1107@gmail.com" wrote:
> > KP Bhat wrote:
> >
> > I assume by B-Tree we also include balanced trees like AVL.
> > Hash collision is handled thru a linked list.
> >
> >> I am trying to compare and contrast the use of Hash-table and
> >> B-Trees as indexing mechanisms for record-based files. Here's
> >> what I have got so far. Kindly suggest additions/modifications.
> >>
> >> Hash-table
[snip]
> >> 6. Does not scale that well as the size of the data increases
>
> Scales just fine until the data base no longer fits in virtual
> memory. Then the B-tree is far superior.

Why does a B-tree become far superior in that case?

There are only two fundamental advantages to a B-tree that I can think of:
records are effectively sorted, and there's no wasted space. If neither of
these are important then AFAICS a hash table will offer greater performance,
greater scalability (ignoring the space issue), and probably a simpler
implementation. Conversely, if the fact that records are sorted _is_
important - and I think it often is - then a hash table is simply not
practical.

Alex
From:CBFalconer
Subject:Re: Hash-table versus B-Tree
Date:Tue, 07 Dec 2004 18:23:44 GMT
Alex Fraser wrote:
> "CBFalconer" wrote in message
> > "kar1107@gmail.com" wrote:
> > > KP Bhat wrote:
> > >
> > > I assume by B-Tree we also include balanced trees like AVL.
> > > Hash collision is handled thru a linked list.
> > >
> > >> I am trying to compare and contrast the use of Hash-table and
> > >> B-Trees as indexing mechanisms for record-based files. Here's
> > >> what I have got so far. Kindly suggest additions/modifications.
> > >>
> > >> Hash-table
> [snip]
> > >> 6. Does not scale that well as the size of the data increases
> >
> > Scales just fine until the data base no longer fits in virtual
> > memory. Then the B-tree is far superior.
>
> Why does a B-tree become far superior in that case?

Because it does not depend on pointers, which in turn will depend
on loading order, and other outside influences. It is designed to
operate efficiently with only one record per level in memory.

>
> There are only two fundamental advantages to a B-tree that I can
> think of: records are effectively sorted, and there's no wasted
> space. If neither of these are important then AFAICS a hash table
> will offer greater performance, greater scalability (ignoring the
> space issue), and probably a simpler implementation. Conversely,
> if the fact that records are sorted _is_ important - and I think
> it often is - then a hash table is simply not practical.

Sorting is perfectly possible, with virtually no additional space
required, with a hash system. See the examples I provide with
hashlib. A b-tree is intrinsically sorted (more or less) only on
the key. The hashlib system allows you to sort on any field.
Also, the hashlib system wastes little space - the fundamental
table of pointers is maintained between roughly 45 and 90% full,
and non-existant records take no space whatsoever.

It is perfectly possible to maintain a continuously ordered binary
tree within a hash system. This is not the same thing as a sorted
list, however, and will change insertion from O(1) to O(logN).

Access times in a b-tree are O(logN). Most hash table operations
are, or can be, O(1). You have to consider the operations your
system really needs when selecting a method.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
Available for consulting/temporary embedded and systems.
USE worldnet address!
From:Alex Fraser
Subject:Re: Hash-table versus B-Tree
Date:Tue, 7 Dec 2004 22:16:54 -0000
"CBFalconer" wrote in message
news:41B5F129.DE9DAF44@yahoo.com...
> Alex Fraser wrote:
> > "CBFalconer" wrote in message
[snip]
> > > [A hash table] scales just fine until the data base no longer fits in
> > > virtual memory. Then the B-tree is far superior.
> >
> > Why does a B-tree become far superior in that case?
>
> Because it does not depend on pointers, which in turn will depend
> on loading order, and other outside influences.

Why do you need pointers?

Maybe we have different things in mind. See my other post for what I was
thinking of.

[snip]
> Sorting is perfectly possible, with virtually no additional space
> required, with a hash system. See the examples I provide with
> hashlib.

Which example(s)?

> Also, the hashlib system wastes little space - the fundamental
> table of pointers is maintained between roughly 45 and 90% full,
> and non-existant records take no space whatsoever.

Well, space for the pointer only.

Alex
From:Nick Landsberg
Subject:Re: Hash-table versus B-Tree
Date:Tue, 07 Dec 2004 16:51:41 GMT
Alex Fraser wrote:

> "CBFalconer" wrote in message
> news:41B5A6AC.FCE24212@yahoo.com...
>
>>"kar1107@gmail.com" wrote:
>>
>>>KP Bhat wrote:
>>>
>>>I assume by B-Tree we also include balanced trees like AVL.
>>>Hash collision is handled thru a linked list.
>>>
>>>
>>>>I am trying to compare and contrast the use of Hash-table and
>>>>B-Trees as indexing mechanisms for record-based files. Here's
>>>>what I have got so far. Kindly suggest additions/modifications.
>>>>
>>>>Hash-table
>
> [snip]
>
>>>>6. Does not scale that well as the size of the data increases
>>
>>Scales just fine until the data base no longer fits in virtual
>>memory. Then the B-tree is far superior.
>
>
> Why does a B-tree become far superior in that case?

I'll butt in -

If all the data no longer fits into memory, then, with
a hash, you have to go and read it from disk. Consider
overflow blocks in the hash. You will likely have to
read at least 1.5 disk records (with a GREAT hash algorithm),
to get the record you want.

It is likely, with modern machines, that
the whole B-tree can be kept in memory, you look it
(the index) up in the B-tree, and then go and read it
from disk directly. A single I/O.


>
> There are only two fundamental advantages to a B-tree that I can think of:
> records are effectively sorted, and there's no wasted space. If neither of
> these are important then AFAICS a hash table will offer greater performance,
> greater scalability (ignoring the space issue), and probably a simpler
> implementation. Conversely, if the fact that records are sorted _is_
> important - and I think it often is - then a hash table is simply not
> practical.
>
> Alex
>
>

I've done timing tests on an in-memory database (name withheld).
Hashing gives about a 10-15% performance improvement over their
tree indices. As you said, if the sorting of record is
important (e.g. "get next" is a common function), the
trees are the way to go. If not, pay your money and take
your choice.

NPL

--
"It is impossible to make anything foolproof
because fools are so ingenious"
- A. Bloch
From:Alex Fraser
Subject:Re: Hash-table versus B-Tree
Date:Tue, 7 Dec 2004 22:16:04 -0000
"Nick Landsberg" wrote in message
news:xcltd.1048486$Gx4.219589@bgtnsc04-news.ops.worldnet.att.net...
> Alex Fraser wrote:
> > "CBFalconer" wrote in message
> > news:41B5A6AC.FCE24212@yahoo.com...
[snip]
> >> [A hash table] Scales just fine until the data base no longer fits in
> >> virtual memory. Then the B-tree is far superior.
> >
> > Why does a B-tree become far superior in that case?
>
> I'll butt in -
>
> If all the data no longer fits into memory, then, with
> a hash, you have to go and read it from disk. Consider
> overflow blocks in the hash. You will likely have to
> read at least 1.5 disk records (with a GREAT hash algorithm),
> to get the record you want.
>
> It is likely, with modern machines, that
> the whole B-tree can be kept in memory, you look it
> (the index) up in the B-tree, and then go and read it
> from disk directly. A single I/O.

If you store the hash value and record number in an in-memory hash table,
the chances of needing to read in anything other than the desired record
should be very low. Or am I missing your point?

Move that hash table to disc, and use sequential probing, and most of the
time you'll need two reads. The biggest problem with this is delete
operations, or rather, minimising the expense of rebuilding the hash table
when there have been many delete operations.

Alex
From:Nick Landsberg
Subject:Re: Hash-table versus B-Tree
Date:Wed, 08 Dec 2004 04:13:19 GMT
Alex Fraser wrote:
> "Nick Landsberg" wrote in message
> news:xcltd.1048486$Gx4.219589@bgtnsc04-news.ops.worldnet.att.net...
>
>>Alex Fraser wrote:
>>
>>>"CBFalconer" wrote in message
>>>news:41B5A6AC.FCE24212@yahoo.com...
>
> [snip]
>
>>>>[A hash table] Scales just fine until the data base no longer fits in
>>>>virtual memory. Then the B-tree is far superior.
>>>
>>>Why does a B-tree become far superior in that case?
>>
>>I'll butt in -
>>
>>If all the data no longer fits into memory, then, with
>>a hash, you have to go and read it from disk. Consider
>>overflow blocks in the hash. You will likely have to
>>read at least 1.5 disk records (with a GREAT hash algorithm),
>>to get the record you want.
>>
>>It is likely, with modern machines, that
>>the whole B-tree can be kept in memory, you look it
>>(the index) up in the B-tree, and then go and read it
>>from disk directly. A single I/O.
>
>
> If you store the hash value and record number in an in-memory hash table,
> the chances of needing to read in anything other than the desired record
> should be very low. Or am I missing your point?

From the heuristics which I remember from many years ago,
the average "probes per search" for a good generic hash
function was about 1.5 . Which translates to about 1.5
disk reads per hash, if the hash table is in memory
and the actual data is on disk.

>
> Move that hash table to disc, and use sequential probing, and most of the
> time you'll need two reads. The biggest problem with this is delete
> operations, or rather, minimising the expense of rebuilding the hash table
> when there have been many delete operations.

This depends upon your hash function and how disk is
accessed. If you use a minimalist approach, then
your hash function points to a logical file sector
on disk where the data record can be found, or
where pointers to overflow sectors can be found.
I have never considered rebuilding the hash table
on the fly when the number of records changed
substantilly, but will give it some thought.
I wonder what the tradeoffs are. Would you have
to relocate records on the disk? This would
seem to be counter-productive.

Don't get me wrong, please. I a proponent of hash
tables, but IFF you know what the heck you're doing
ahead of time, e.g. expected number of records
to set the table size.



>
> Alex
>
>


--
"It is impossible to make anything foolproof
because fools are so ingenious"
- A. Bloch
From:CBFalconer
Subject:Re: Hash-table versus B-Tree
Date:Wed, 08 Dec 2004 05:36:43 GMT
Nick Landsberg wrote:
>
.... snip ...
>
> This depends upon your hash function and how disk is
> accessed. If you use a minimalist approach, then
> your hash function points to a logical file sector
> on disk where the data record can be found, or
> where pointers to overflow sectors can be found.
> I have never considered rebuilding the hash table
> on the fly when the number of records changed
> substantilly, but will give it some thought.
> I wonder what the tradeoffs are. Would you have
> to relocate records on the disk? This would
> seem to be counter-productive.
>
> Don't get me wrong, please. I a proponent of hash
> tables, but IFF you know what the heck you're doing
> ahead of time, e.g. expected number of records
> to set the table size.

Try out my hashlib, on my site below, download section. It is
instrumented so you can tell the average probes per search, etc,
and handles automatic expansion.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
Available for consulting/temporary embedded and systems.
USE worldnet address!
From:Nick Landsberg
Subject:Re: Hash-table versus B-Tree
Date:Thu, 09 Dec 2004 02:05:45 GMT
CBFalconer wrote:

> Nick Landsberg wrote:
>
> ... snip ...
>
>>This depends upon your hash function and how disk is
>>accessed. If you use a minimalist approach, then
>>your hash function points to a logical file sector
>>on disk where the data record can be found, or
>>where pointers to overflow sectors can be found.
>>I have never considered rebuilding the hash table
>>on the fly when the number of records changed
>>substantilly, but will give it some thought.
>>I wonder what the tradeoffs are. Would you have
>>to relocate records on the disk? This would
>>seem to be counter-productive.
>>
>>Don't get me wrong, please. I a proponent of hash
>>tables, but IFF you know what the heck you're doing
>>ahead of time, e.g. expected number of records
>>to set the table size.
>
>
> Try out my hashlib, on my site below, download section. It is
> instrumented so you can tell the average probes per search, etc,
> and handles automatic expansion.
>

I've downloaded it Chuck, and looked at the code, but
work time prohibits me from doing a thorough test.
From what I can see of the code, it's first-class
stuff.

NPL


--
"It is impossible to make anything foolproof
because fools are so ingenious"
- A. Bloch
From:Alex Fraser
Subject:Re: Hash-table versus B-Tree
Date:Fri, 10 Dec 2004 19:47:57 -0000
"Nick Landsberg" wrote in message
news:zbvtd.1051684$Gx4.251966@bgtnsc04-news.ops.worldnet.att.net...
> Alex Fraser wrote:
[snip]
> > If you store the hash value and record number in an in-memory hash
> > table, the chances of needing to read in anything other than the
> > desired record should be very low. Or am I missing your point?
>
> From the heuristics which I remember from many years ago,
> the average "probes per search" for a good generic hash
> function was about 1.5 .

Obviously, this depends on the hash table loading. I think 1.5 is right for
an optimal hash function and 50% load (using a "closed" table, ie no
chaining).

> Which translates to about 1.5 disk reads per hash, if the hash table is
> in memory and the actual data is on disk.

Not with what I attempted to describe. Let me try to explain more fully.

Suppose you have a (good) hash function which produces 32- or 64-bit hash
values. After computing the hash value for the key of the record you are
(say) searching for, you must somehow reduce it to the size of the table (eg
using the modulus operator), and that is where you start your probe
sequence. But, and this is the main point, you don't throw away the original
hash value. Instead, you compare it to the hash values you cunningly stored
in the table (along with the record number). If the range of hash values is
much larger than the size of the table, then quite often you won't need to
read the records when keys don't match: even if they hash to the same slot,
the hash values will probably differ.

You can alternatively consider the idea of storing the hash value in the
table as as a cache. The property relied upon is that if the hash values of
two keys differ, then the keys must differ. By caching the hash value, you
have a cheaper way to prove inequality (and continue the probe sequence,
without examining the record).

[snip]
> I have never considered rebuilding the hash table
> on the fly when the number of records changed
> substantilly, but will give it some thought.
> I wonder what the tradeoffs are. Would you have
> to relocate records on the disk? This would
> seem to be counter-productive.

You can avoid the need to relocate records by using indirection (this
applies if you use a B-tree, too). Basically, you seperate the index (hash
table or B-tree) from the records themselves.

I read somewhere on the web about a hash table implementation which sought
to reduce the pain of rebuilding. Unfortunately, I don't remember how it
worked and I can't find it now, but I wonder if it could be applied to this
situation.

Alex
From:CBFalconer
Subject:Re: Hash-table versus B-Tree
Date:Fri, 10 Dec 2004 23:11:47 GMT
Alex Fraser wrote:
>
.... snip ...
>
> I read somewhere on the web about a hash table implementation which
> sought to reduce the pain of rebuilding. Unfortunately, I don't
> remember how it worked and I can't find it now, but I wonder if it
> could be applied to this situation.

One such is my hashlib, available on the download section of my
page. It doesn't reduce, it eliminates.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
Available for consulting/temporary embedded and systems.
USE worldnet address!
From:Alex Fraser
Subject:Re: Hash-table versus B-Tree
Date:Sat, 11 Dec 2004 07:53:20 -0000
"CBFalconer" wrote in message
news:41BA2802.7D1942B@yahoo.com...
> Alex Fraser wrote:
> ... snip ...
> >
> > I read somewhere on the web about a hash table implementation which
> > sought to reduce the pain of rebuilding. Unfortunately, I don't
> > remember how it worked and I can't find it now, but I wonder if it
> > could be applied to this situation.
>
> One such is my hashlib, available on the download section of my
> page. It doesn't reduce, it eliminates.

The pain I referred to was time complexity: rebuilding the table by adding
each entry to a new hash table then discarding the old table (the obvious
algorithm, as used in hashlib) is O(n).

If there are a large number of inserts and deletes (balanced, so the number
of entries is constant in the long term) the table will typically become
cluttered with deleted entries, and performance will suffer unless the table
is (frequently) rebuilt.

If the number of entries is also large, this can result in a significant
proportion of the time being used for rebuilding. Even if the average
performance of operations is excellent, ie O(1), rebuilding will give some
probably undesirable "bumps" in performance. These are the problems
addressed by what I mentioned I read.

Alex
From:CBFalconer
Subject:Re: Hash-table versus B-Tree
Date:Sat, 11 Dec 2004 10:02:41 GMT
Alex Fraser wrote:
> "CBFalconer" wrote in message
>> Alex Fraser wrote:
>
>> ... snip ...
>>>
>>> I read somewhere on the web about a hash table implementation which
>>> sought to reduce the pain of rebuilding. Unfortunately, I don't
>>> remember how it worked and I can't find it now, but I wonder if it
>>> could be applied to this situation.
>>
>> One such is my hashlib, available on the download section of my
>> page. It doesn't reduce, it eliminates.
>
> The pain I referred to was time complexity: rebuilding the table by
> adding each entry to a new hash table then discarding the old table
> (the obvious algorithm, as used in hashlib) is O(n).

However there are savings available in rebuilding versus inserting.

>
> If there are a large number of inserts and deletes (balanced, so
> the number of entries is constant in the long term) the table will
> typically become cluttered with deleted entries, and performance
> will suffer unless the table is (frequently) rebuilt.

Hashlib uses the same criterion, approximately 90% full, to trigger
rebuilding. However if deletions can be removed and reach a
different criterion, the table does not expand.

>
> If the number of entries is also large, this can result in a
> significant proportion of the time being used for rebuilding. Even
> if the average performance of operations is excellent, ie O(1),
> rebuilding will give some probably undesirable "bumps" in
> performance. These are the problems addressed by what I mentioned
> I read.

You would think so at first glance. While there are 'bumps' in the
access times, they are much smaller than one would expect
a-priori. At any rate, the system has built in instrumentation, so
you can measure all these things out of the box.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
Available for consulting/temporary embedded and systems.
USE worldnet address!
   

Copyright © 2006 knowledge-database   -   All rights reserved