|
|
 | | From: | CBFalconer | | Subject: | Re: Hash-table versus B-Tree | | Date: | Tue, 07 Dec 2004 13:00:47 GMT |
|
|
 | "kar1107@gmail.com" wrote: > KP Bhat wrote: > > I assume by B-Tree we also include balanced trees like AVL. > Hash collision is handled thru a linked list. > >> I am trying to compare and contrast the use of Hash-table and >> B-Trees as indexing mechanisms for record-based files. Here's >> what I have got so far. Kindly suggest additions/modifications. >> >> Hash-table >> 1. Simple indexing scheme >> 2. Insert and delete operations very efficient > > Delete is not efficient. A potentially long linked list need to > be traversed in case of collisions.
Not necessarily so. See hashlib, reference later. No linked list is involved, and the operation is O(1).
> >> 3. Retrieve operation very efficient >> 4. Need to provide a hashing function that hashes record keys >> into some number
This is usually a trivial operation. The trick is to supply a hash function suitable for the data involved, and good systems can be quite forgiving. The penalty for a bad function is poorer performance.
>> 5. Supporting GetNext and GetBulk operations quite expensive
Depends on your definition of these. I have no idea what you have in mind.
>> 6. Does not scale that well as the size of the data increases
Scales just fine until the data base no longer fits in virtual memory. Then the B-tree is far superior.
> > 7. Collision handling can lead to poor searches ("retrieve > operation"). The complexity is O(n) for search and delete.
Not so. Again, see the hashlib reference at end. Search and delete times are O(1).
> >> >> B-Tree >> 1. Complex indexing scheme >> 2. Insert and delete operations slightly expensive due to need >> to retain the B-Tree property (i.e. splitting/joining nodes) >> 3. Retrieve operation very efficient >> 4. No need to provide a hashing function >> 5. Supporting the GetNext and GetBulk operations very efficient >> 6. Scales extremely well as the size of the data increases > > 7. search, insert, delete always O(log n) (thus supports #6 above) > > Once a solid library is implemented for the tree, its good to go > with tree. You never know you may have to handle large data in > future.
You can conveniently examine the characteristics of hash systems via hashlib, which is available under GPL at:
-- Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net) Available for consulting/temporary embedded and systems. USE worldnet address!
|
|
 | | From: | Alex Fraser | | Subject: | Re: Hash-table versus B-Tree | | Date: | Tue, 7 Dec 2004 15:42:29 -0000 |
|
|
 | "CBFalconer" wrote in message news:41B5A6AC.FCE24212@yahoo.com... > "kar1107@gmail.com" wrote: > > KP Bhat wrote: > > > > I assume by B-Tree we also include balanced trees like AVL. > > Hash collision is handled thru a linked list. > > > >> I am trying to compare and contrast the use of Hash-table and > >> B-Trees as indexing mechanisms for record-based files. Here's > >> what I have got so far. Kindly suggest additions/modifications. > >> > >> Hash-table [snip] > >> 6. Does not scale that well as the size of the data increases > > Scales just fine until the data base no longer fits in virtual > memory. Then the B-tree is far superior.
Why does a B-tree become far superior in that case?
There are only two fundamental advantages to a B-tree that I can think of: records are effectively sorted, and there's no wasted space. If neither of these are important then AFAICS a hash table will offer greater performance, greater scalability (ignoring the space issue), and probably a simpler implementation. Conversely, if the fact that records are sorted _is_ important - and I think it often is - then a hash table is simply not practical.
Alex
|
|
 | | From: | CBFalconer | | Subject: | Re: Hash-table versus B-Tree | | Date: | Tue, 07 Dec 2004 18:23:44 GMT |
|
|
 | Alex Fraser wrote: > "CBFalconer" wrote in message > > "kar1107@gmail.com" wrote: > > > KP Bhat wrote: > > > > > > I assume by B-Tree we also include balanced trees like AVL. > > > Hash collision is handled thru a linked list. > > > > > >> I am trying to compare and contrast the use of Hash-table and > > >> B-Trees as indexing mechanisms for record-based files. Here's > > >> what I have got so far. Kindly suggest additions/modifications. > > >> > > >> Hash-table > [snip] > > >> 6. Does not scale that well as the size of the data increases > > > > Scales just fine until the data base no longer fits in virtual > > memory. Then the B-tree is far superior. > > Why does a B-tree become far superior in that case?
Because it does not depend on pointers, which in turn will depend on loading order, and other outside influences. It is designed to operate efficiently with only one record per level in memory.
> > There are only two fundamental advantages to a B-tree that I can > think of: records are effectively sorted, and there's no wasted > space. If neither of these are important then AFAICS a hash table > will offer greater performance, greater scalability (ignoring the > space issue), and probably a simpler implementation. Conversely, > if the fact that records are sorted _is_ important - and I think > it often is - then a hash table is simply not practical.
Sorting is perfectly possible, with virtually no additional space required, with a hash system. See the examples I provide with hashlib. A b-tree is intrinsically sorted (more or less) only on the key. The hashlib system allows you to sort on any field. Also, the hashlib system wastes little space - the fundamental table of pointers is maintained between roughly 45 and 90% full, and non-existant records take no space whatsoever.
It is perfectly possible to maintain a continuously ordered binary tree within a hash system. This is not the same thing as a sorted list, however, and will change insertion from O(1) to O(logN).
Access times in a b-tree are O(logN). Most hash table operations are, or can be, O(1). You have to consider the operations your system really needs when selecting a method.
-- Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net) Available for consulting/temporary embedded and systems. USE worldnet address!
|
|
 | | From: | Alex Fraser | | Subject: | Re: Hash-table versus B-Tree | | Date: | Tue, 7 Dec 2004 22:16:54 -0000 |
|
|
 | "CBFalconer" wrote in message news:41B5F129.DE9DAF44@yahoo.com... > Alex Fraser wrote: > > "CBFalconer" wrote in message [snip] > > > [A hash table] scales just fine until the data base no longer fits in > > > virtual memory. Then the B-tree is far superior. > > > > Why does a B-tree become far superior in that case? > > Because it does not depend on pointers, which in turn will depend > on loading order, and other outside influences.
Why do you need pointers?
Maybe we have different things in mind. See my other post for what I was thinking of.
[snip] > Sorting is perfectly possible, with virtually no additional space > required, with a hash system. See the examples I provide with > hashlib.
Which example(s)?
> Also, the hashlib system wastes little space - the fundamental > table of pointers is maintained between roughly 45 and 90% full, > and non-existant records take no space whatsoever.
Well, space for the pointer only.
Alex
|
|
 | | From: | Nick Landsberg | | Subject: | Re: Hash-table versus B-Tree | | Date: | Tue, 07 Dec 2004 16:51:41 GMT |
|
|
 | Alex Fraser wrote:
> "CBFalconer" wrote in message > news:41B5A6AC.FCE24212@yahoo.com... > >>"kar1107@gmail.com" wrote: >> >>>KP Bhat wrote: >>> >>>I assume by B-Tree we also include balanced trees like AVL. >>>Hash collision is handled thru a linked list. >>> >>> >>>>I am trying to compare and contrast the use of Hash-table and >>>>B-Trees as indexing mechanisms for record-based files. Here's >>>>what I have got so far. Kindly suggest additions/modifications. >>>> >>>>Hash-table > > [snip] > >>>>6. Does not scale that well as the size of the data increases >> >>Scales just fine until the data base no longer fits in virtual >>memory. Then the B-tree is far superior. > > > Why does a B-tree become far superior in that case?
I'll butt in -
If all the data no longer fits into memory, then, with a hash, you have to go and read it from disk. Consider overflow blocks in the hash. You will likely have to read at least 1.5 disk records (with a GREAT hash algorithm), to get the record you want.
It is likely, with modern machines, that the whole B-tree can be kept in memory, you look it (the index) up in the B-tree, and then go and read it from disk directly. A single I/O.
> > There are only two fundamental advantages to a B-tree that I can think of: > records are effectively sorted, and there's no wasted space. If neither of > these are important then AFAICS a hash table will offer greater performance, > greater scalability (ignoring the space issue), and probably a simpler > implementation. Conversely, if the fact that records are sorted _is_ > important - and I think it often is - then a hash table is simply not > practical. > > Alex > >
I've done timing tests on an in-memory database (name withheld). Hashing gives about a 10-15% performance improvement over their tree indices. As you said, if the sorting of record is important (e.g. "get next" is a common function), the trees are the way to go. If not, pay your money and take your choice.
NPL
-- "It is impossible to make anything foolproof because fools are so ingenious" - A. Bloch
|
|
 | | From: | Alex Fraser | | Subject: | Re: Hash-table versus B-Tree | | Date: | Tue, 7 Dec 2004 22:16:04 -0000 |
|
|
 | "Nick Landsberg" wrote in message news:xcltd.1048486$Gx4.219589@bgtnsc04-news.ops.worldnet.att.net... > Alex Fraser wrote: > > "CBFalconer" wrote in message > > news:41B5A6AC.FCE24212@yahoo.com... [snip] > >> [A hash table] Scales just fine until the data base no longer fits in > >> virtual memory. Then the B-tree is far superior. > > > > Why does a B-tree become far superior in that case? > > I'll butt in - > > If all the data no longer fits into memory, then, with > a hash, you have to go and read it from disk. Consider > overflow blocks in the hash. You will likely have to > read at least 1.5 disk records (with a GREAT hash algorithm), > to get the record you want. > > It is likely, with modern machines, that > the whole B-tree can be kept in memory, you look it > (the index) up in the B-tree, and then go and read it > from disk directly. A single I/O.
If you store the hash value and record number in an in-memory hash table, the chances of needing to read in anything other than the desired record should be very low. Or am I missing your point?
Move that hash table to disc, and use sequential probing, and most of the time you'll need two reads. The biggest problem with this is delete operations, or rather, minimising the expense of rebuilding the hash table when there have been many delete operations.
Alex
|
|
 | | From: | Nick Landsberg | | Subject: | Re: Hash-table versus B-Tree | | Date: | Wed, 08 Dec 2004 04:13:19 GMT |
|
|
 | Alex Fraser wrote: > "Nick Landsberg" wrote in message > news:xcltd.1048486$Gx4.219589@bgtnsc04-news.ops.worldnet.att.net... > >>Alex Fraser wrote: >> >>>"CBFalconer" wrote in message >>>news:41B5A6AC.FCE24212@yahoo.com... > > [snip] > >>>>[A hash table] Scales just fine until the data base no longer fits in >>>>virtual memory. Then the B-tree is far superior. >>> >>>Why does a B-tree become far superior in that case? >> >>I'll butt in - >> >>If all the data no longer fits into memory, then, with >>a hash, you have to go and read it from disk. Consider >>overflow blocks in the hash. You will likely have to >>read at least 1.5 disk records (with a GREAT hash algorithm), >>to get the record you want. >> >>It is likely, with modern machines, that >>the whole B-tree can be kept in memory, you look it >>(the index) up in the B-tree, and then go and read it >>from disk directly. A single I/O. > > > If you store the hash value and record number in an in-memory hash table, > the chances of needing to read in anything other than the desired record > should be very low. Or am I missing your point?
From the heuristics which I remember from many years ago, the average "probes per search" for a good generic hash function was about 1.5 . Which translates to about 1.5 disk reads per hash, if the hash table is in memory and the actual data is on disk.
> > Move that hash table to disc, and use sequential probing, and most of the > time you'll need two reads. The biggest problem with this is delete > operations, or rather, minimising the expense of rebuilding the hash table > when there have been many delete operations.
This depends upon your hash function and how disk is accessed. If you use a minimalist approach, then your hash function points to a logical file sector on disk where the data record can be found, or where pointers to overflow sectors can be found. I have never considered rebuilding the hash table on the fly when the number of records changed substantilly, but will give it some thought. I wonder what the tradeoffs are. Would you have to relocate records on the disk? This would seem to be counter-productive.
Don't get me wrong, please. I a proponent of hash tables, but IFF you know what the heck you're doing ahead of time, e.g. expected number of records to set the table size.
> > Alex > >
-- "It is impossible to make anything foolproof because fools are so ingenious" - A. Bloch
|
|
 | | From: | CBFalconer | | Subject: | Re: Hash-table versus B-Tree | | Date: | Wed, 08 Dec 2004 05:36:43 GMT |
|
|
 | Nick Landsberg wrote: > .... snip ... > > This depends upon your hash function and how disk is > accessed. If you use a minimalist approach, then > your hash function points to a logical file sector > on disk where the data record can be found, or > where pointers to overflow sectors can be found. > I have never considered rebuilding the hash table > on the fly when the number of records changed > substantilly, but will give it some thought. > I wonder what the tradeoffs are. Would you have > to relocate records on the disk? This would > seem to be counter-productive. > > Don't get me wrong, please. I a proponent of hash > tables, but IFF you know what the heck you're doing > ahead of time, e.g. expected number of records > to set the table size.
Try out my hashlib, on my site below, download section. It is instrumented so you can tell the average probes per search, etc, and handles automatic expansion.
-- Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net) Available for consulting/temporary embedded and systems. USE worldnet address!
|
|
 | | From: | Nick Landsberg | | Subject: | Re: Hash-table versus B-Tree | | Date: | Thu, 09 Dec 2004 02:05:45 GMT |
|
|
 | CBFalconer wrote:
> Nick Landsberg wrote: > > ... snip ... > >>This depends upon your hash function and how disk is >>accessed. If you use a minimalist approach, then >>your hash function points to a logical file sector >>on disk where the data record can be found, or >>where pointers to overflow sectors can be found. >>I have never considered rebuilding the hash table >>on the fly when the number of records changed >>substantilly, but will give it some thought. >>I wonder what the tradeoffs are. Would you have >>to relocate records on the disk? This would >>seem to be counter-productive. >> >>Don't get me wrong, please. I a proponent of hash >>tables, but IFF you know what the heck you're doing >>ahead of time, e.g. expected number of records >>to set the table size. > > > Try out my hashlib, on my site below, download section. It is > instrumented so you can tell the average probes per search, etc, > and handles automatic expansion. >
I've downloaded it Chuck, and looked at the code, but work time prohibits me from doing a thorough test. From what I can see of the code, it's first-class stuff.
NPL
-- "It is impossible to make anything foolproof because fools are so ingenious" - A. Bloch
|
|
 | | From: | Alex Fraser | | Subject: | Re: Hash-table versus B-Tree | | Date: | Fri, 10 Dec 2004 19:47:57 -0000 |
|
|
 | "Nick Landsberg" wrote in message news:zbvtd.1051684$Gx4.251966@bgtnsc04-news.ops.worldnet.att.net... > Alex Fraser wrote: [snip] > > If you store the hash value and record number in an in-memory hash > > table, the chances of needing to read in anything other than the > > desired record should be very low. Or am I missing your point? > > From the heuristics which I remember from many years ago, > the average "probes per search" for a good generic hash > function was about 1.5 .
Obviously, this depends on the hash table loading. I think 1.5 is right for an optimal hash function and 50% load (using a "closed" table, ie no chaining).
> Which translates to about 1.5 disk reads per hash, if the hash table is > in memory and the actual data is on disk.
Not with what I attempted to describe. Let me try to explain more fully.
Suppose you have a (good) hash function which produces 32- or 64-bit hash values. After computing the hash value for the key of the record you are (say) searching for, you must somehow reduce it to the size of the table (eg using the modulus operator), and that is where you start your probe sequence. But, and this is the main point, you don't throw away the original hash value. Instead, you compare it to the hash values you cunningly stored in the table (along with the record number). If the range of hash values is much larger than the size of the table, then quite often you won't need to read the records when keys don't match: even if they hash to the same slot, the hash values will probably differ.
You can alternatively consider the idea of storing the hash value in the table as as a cache. The property relied upon is that if the hash values of two keys differ, then the keys must differ. By caching the hash value, you have a cheaper way to prove inequality (and continue the probe sequence, without examining the record).
[snip] > I have never considered rebuilding the hash table > on the fly when the number of records changed > substantilly, but will give it some thought. > I wonder what the tradeoffs are. Would you have > to relocate records on the disk? This would > seem to be counter-productive.
You can avoid the need to relocate records by using indirection (this applies if you use a B-tree, too). Basically, you seperate the index (hash table or B-tree) from the records themselves.
I read somewhere on the web about a hash table implementation which sought to reduce the pain of rebuilding. Unfortunately, I don't remember how it worked and I can't find it now, but I wonder if it could be applied to this situation.
Alex
|
|
 | | From: | CBFalconer | | Subject: | Re: Hash-table versus B-Tree | | Date: | Fri, 10 Dec 2004 23:11:47 GMT |
|
|
 | Alex Fraser wrote: > .... snip ... > > I read somewhere on the web about a hash table implementation which > sought to reduce the pain of rebuilding. Unfortunately, I don't > remember how it worked and I can't find it now, but I wonder if it > could be applied to this situation.
One such is my hashlib, available on the download section of my page. It doesn't reduce, it eliminates.
-- Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net) Available for consulting/temporary embedded and systems. USE worldnet address!
|
|
 | | From: | Alex Fraser | | Subject: | Re: Hash-table versus B-Tree | | Date: | Sat, 11 Dec 2004 07:53:20 -0000 |
|
|
 | "CBFalconer" wrote in message news:41BA2802.7D1942B@yahoo.com... > Alex Fraser wrote: > ... snip ... > > > > I read somewhere on the web about a hash table implementation which > > sought to reduce the pain of rebuilding. Unfortunately, I don't > > remember how it worked and I can't find it now, but I wonder if it > > could be applied to this situation. > > One such is my hashlib, available on the download section of my > page. It doesn't reduce, it eliminates.
The pain I referred to was time complexity: rebuilding the table by adding each entry to a new hash table then discarding the old table (the obvious algorithm, as used in hashlib) is O(n).
If there are a large number of inserts and deletes (balanced, so the number of entries is constant in the long term) the table will typically become cluttered with deleted entries, and performance will suffer unless the table is (frequently) rebuilt.
If the number of entries is also large, this can result in a significant proportion of the time being used for rebuilding. Even if the average performance of operations is excellent, ie O(1), rebuilding will give some probably undesirable "bumps" in performance. These are the problems addressed by what I mentioned I read.
Alex
|
|
 | | From: | CBFalconer | | Subject: | Re: Hash-table versus B-Tree | | Date: | Sat, 11 Dec 2004 10:02:41 GMT |
|
|
 | Alex Fraser wrote: > "CBFalconer" wrote in message >> Alex Fraser wrote: > >> ... snip ... >>> >>> I read somewhere on the web about a hash table implementation which >>> sought to reduce the pain of rebuilding. Unfortunately, I don't >>> remember how it worked and I can't find it now, but I wonder if it >>> could be applied to this situation. >> >> One such is my hashlib, available on the download section of my >> page. It doesn't reduce, it eliminates. > > The pain I referred to was time complexity: rebuilding the table by > adding each entry to a new hash table then discarding the old table > (the obvious algorithm, as used in hashlib) is O(n).
However there are savings available in rebuilding versus inserting.
> > If there are a large number of inserts and deletes (balanced, so > the number of entries is constant in the long term) the table will > typically become cluttered with deleted entries, and performance > will suffer unless the table is (frequently) rebuilt.
Hashlib uses the same criterion, approximately 90% full, to trigger rebuilding. However if deletions can be removed and reach a different criterion, the table does not expand.
> > If the number of entries is also large, this can result in a > significant proportion of the time being used for rebuilding. Even > if the average performance of operations is excellent, ie O(1), > rebuilding will give some probably undesirable "bumps" in > performance. These are the problems addressed by what I mentioned > I read.
You would think so at first glance. While there are 'bumps' in the access times, they are much smaller than one would expect a-priori. At any rate, the system has built in instrumentation, so you can measure all these things out of the box.
-- Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net) Available for consulting/temporary embedded and systems. USE worldnet address!
|
|
|