|
|
 | | From: | Dylan | | Subject: | what is the correct data structure to use with this problem? | | Date: | Tue, 18 Jan 2005 08:43:48 +0000 |
|
|
 | I have an algorithm that must iterate through a rulebase as quickly as possible to eliminate all rules that are not applicable within the current context. Each rule consists of one or more boolean context queries (P1, P2, .... Pn) many of which are duplicated in other rules. The following is a very simple example of a rulebase containing 3 rules:
R1 = {P1, P5, P7} R2 = {P1, P4, P5, P9, P11} R3 = {P1, P2, P7, P11}
This is a very simple example (the rulebases I'm working with contain many more rules with hundreds or thousands of different context queries) but I hope it clarifies.
Each of the context queries is a function object that queries the world state and returns a Boolean value. Some of these queries contain lengthy calculations so I want to avoid duplication and make as few function calls as possible.
I'm thinking that some sort of tree can be utilized to reduce the amount of function call overhead but I cannot imagine how it would look.
I would like to know if I'm thinking along the correct lines and if so, what type of tree I should use. I'd also be interested in any other ideas you guys may have for representing this kind of rulebase in a way that allows it to be iterated through quickly.
Thanks for any help.
|
|
 | | From: | spinoza1111 at yahoo.com | | Subject: | Re: what is the correct data structure to use with this problem? | | Date: | 20 Jan 2005 19:17:30 -0800 |
|
|
 | It would be a structure containing data, a left node index and a right node index.
The top level would contain no data, a left node index either null or pointing to the first real node (the top of the tree) and a null right node. This design takes care of the need to represent an empty tree.
The traditional way to represent is an array of structures and if deletions are unnecessary this is OK.
But if deletes are necessary and you are in an OO environment, an array isn't needed. Instead create and delete objects representing nodes as the triplet (data, left node, right node). Here left node and right node are not numeric indexes as they are in arrays but object references (they are not "pointers").
Creating an empty tree:
tree = new node(null, null, null)
Adding a member to an empty tree:
tree.leftNode = new node(memberData, null, null)
Adding a left node to node:
node.leftNode = new node(memberData, null, null)
Deletions are a bit tricky since it's not enough to delete a node. This simply breaks all meaningful communication with the node's descendants resulting in clutter and eventual crash. Instead, you have to recursively delete the node and its descendants, going all the way to the outermost child, deleting him and then deleting his ancestors one after the other.
However, your problem doesn't seem to involve this.
The key in an OO environment is that you no longer are responsible for managing a memory space and in this memory space, simulating a tree...with all the nonsense that involves, of restricting the size of the tree to a hard coded constant and pissing me off when your software does not scale. The OO environment allows you to make each node including the top a proxy for the rest of the structure, whose management is responsibility of the OO heap manager.
I know many computer science profs still want you to emulate a tree as an array but this is nonsense.
It may be necessary to have an n-way and not binary tree:
new node(data, null, null ... null)
That's cool. Any such tree can be proven equivalent to a binary tree by adding dummy nodes but in actual practice it seems to be better to use an n-way tree. Sure, the nodes are an array, but it is a fixed array, and, in typical environments, a fixed array of four byte unsigned pointers.
Of course, in your problem, it might be a variable length array and here this could be problematic. The solution is to make the tree node structure two way.
Any time you have to add node 3, you instead set the right node to a new node whose left node node 2 and whose left node is node 3, and set the right node of its parent to the new node. The data at the new node becomes something amusing such as Nothing, null, 0, or whussup.
|
|
 | | From: | spinoza1111 at yahoo.com | | Subject: | Re: what is the correct data structure to use with this problem? | | Date: | 19 Jan 2005 04:23:42 -0800 |
|
|
 | Dylan, the simplest idea would be to have a lookaside table accessed on a hash or direct basis by the unique number or name of each rule. Each entry would have in it true, false, or not evaluated as the value of query Px.
In the example you present, P1, P5 and P7 would all be evaluated in evaluating R1. I assume from the implied logic of what you present that IFF at least one is False, you proceed to the next rule. A further optimization would be to abandon the evaluation "lazily" at the first false query and move to the next, retaining the fact that the false rule is, well, false.
Suppose P1 is True but P5 is False. P7 remains Unknown but P5 is in the lookaside as False. You save time in each rule evaluating P1 assuming your direct or hash table algorithm is efficient and also in rule 2.
This without an elaborate "tree".
A further refinement would be to "optimize" the rules by sorting them. Sort them column-wise from "easy" to "hard" and sort them rowwise by length...the number of queries in a row, which, I fancy, would correspond to the difficulty of the row. Or, if you like, assign each row a weight, the sum of the hardness of each query.
Of course, this would require a separate data base giving the complexity of each query UNLESS the "source code" of each query is available at run time.
Of course, if the query values CHANGE, my lookaside table is useless. But if they can CHANGE you are in deep doo doo anyway for you have to go to the user, and say, Mister User, shall I use the first value, the last value, or what?
No, it's probably the case that you can evaluate each query once and record its Boolean value in a hash table keyed on query name, or a direct access table keyed on query number.
A tree, properly constructed, would give you the theoretical maximum performance. A lookaside would give you good enough for government work performance.
Dylan wrote: > I have an algorithm that must iterate through a rulebase as quickly as > possible to eliminate all rules that are not applicable within the > current context. Each rule consists of one or more boolean context > queries (P1, P2, .... Pn) many of which are duplicated in other rules. > The following is a very simple example of a rulebase containing 3 > rules: > > R1 = {P1, P5, P7} > R2 = {P1, P4, P5, P9, P11} > R3 = {P1, P2, P7, P11} > > This is a very simple example (the rulebases I'm working with contain > many more rules with hundreds or thousands of different context > queries) but I hope it clarifies. > > Each of the context queries is a function object that queries the > world state and returns a Boolean value. Some of these queries contain > lengthy calculations so I want to avoid duplication and make as few > function calls as possible. > > I'm thinking that some sort of tree can be utilized to reduce the > amount of function call overhead but I cannot imagine how it would > look. > > I would like to know if I'm thinking along the correct lines and if > so, what type of tree I should use. I'd also be interested in any > other ideas you guys may have for representing this kind of rulebase > in a way that allows it to be iterated through quickly. > > Thanks for any help.
|
|
 | | From: | Dylan | | Subject: | Re: what is the correct data structure to use with this problem? | | Date: | Wed, 19 Jan 2005 16:12:55 +0000 |
|
|
 | Hi Spinoza,
Thanks for your input. Most of your suggestions are ones I've already considered and some are already in place (such as complexity rating and lazy evaluation).
I really do need this to be as fast as possible, which is why I think a tree is best... I'm just stumped with how the tree should look...
On 19 Jan 2005 04:23:42 -0800, spinoza1111@yahoo.com wrote:
>Dylan, the simplest idea would be to have a lookaside table accessed on >a hash or direct basis by the unique number or name of each rule. Each >entry would have in it true, false, or not evaluated as the value of >query Px. > >In the example you present, P1, P5 and P7 would all be evaluated in >evaluating R1. I assume from the implied logic of what you present that >IFF at least one is False, you proceed to the next rule. A further >optimization would be to abandon the evaluation "lazily" at the first >false query and move to the next, retaining the fact that the false >rule is, well, false. > >Suppose P1 is True but P5 is False. P7 remains Unknown but P5 is in the >lookaside as False. You save time in each rule evaluating P1 assuming >your direct or hash table algorithm is efficient and also in rule 2. > >This without an elaborate "tree". > >A further refinement would be to "optimize" the rules by sorting them. >Sort them column-wise from "easy" to "hard" and sort them rowwise by >length...the number of queries in a row, which, I fancy, would >correspond to the difficulty of the row. Or, if you like, assign each >row a weight, the sum of the hardness of each query. > >Of course, this would require a separate data base giving the >complexity of each query UNLESS the "source code" of each query is >available at run time. > >Of course, if the query values CHANGE, my lookaside table is useless. >But if they can CHANGE you are in deep doo doo anyway for you have to >go to the user, and say, Mister User, shall I use the first value, the >last value, or what? > >No, it's probably the case that you can evaluate each query once and >record its Boolean value in a hash table keyed on query name, or a >direct access table keyed on query number. > >A tree, properly constructed, would give you the theoretical maximum >performance. A lookaside would give you good enough for government work >performance. > >Dylan wrote: >> I have an algorithm that must iterate through a rulebase as quickly >as >> possible to eliminate all rules that are not applicable within the >> current context. Each rule consists of one or more boolean context >> queries (P1, P2, .... Pn) many of which are duplicated in other >rules. >> The following is a very simple example of a rulebase containing 3 >> rules: >> >> R1 = {P1, P5, P7} >> R2 = {P1, P4, P5, P9, P11} >> R3 = {P1, P2, P7, P11} >> >> This is a very simple example (the rulebases I'm working with contain >> many more rules with hundreds or thousands of different context >> queries) but I hope it clarifies. >> >> Each of the context queries is a function object that queries the >> world state and returns a Boolean value. Some of these queries >contain >> lengthy calculations so I want to avoid duplication and make as few >> function calls as possible. >> >> I'm thinking that some sort of tree can be utilized to reduce the >> amount of function call overhead but I cannot imagine how it would >> look. >> >> I would like to know if I'm thinking along the correct lines and if >> so, what type of tree I should use. I'd also be interested in any >> other ideas you guys may have for representing this kind of rulebase >> in a way that allows it to be iterated through quickly. >> >> Thanks for any help.
|
|
 | | From: | Casey Hawthorne | | Subject: | Re: what is the correct data structure to use with this problem? | | Date: | Wed, 19 Jan 2005 08:11:04 GMT |
|
|
 | Isn't "parsing" of rule-bases what Prolog is good at?
There are compiled versions of Prolog for faster results.
-- Regards, Casey
|
|
 | | From: | Dylan | | Subject: | Re: what is the correct data structure to use with this problem? | | Date: | Wed, 19 Jan 2005 09:57:07 +0000 |
|
|
 | On Wed, 19 Jan 2005 08:11:04 GMT, Casey Hawthorne wrote:
>Isn't "parsing" of rule-bases what Prolog is good at? > >There are compiled versions of Prolog for faster results.
I have no idea Casey and I don't have the time to learn.
|
|
|