I am trying to implement a unranked boolean retrieval. For this, I need to construct a tree and perform a DFS to retrieve documents. I have the leaf nodes but I am having difficulty to construct the tree.
Eg: query = OR ( AND (maria sharapova) tennis)
Result:
OR | | AND tennis | | maria sharapova
I traverse the tree using DFS and calculate the boolean equivalent of certain document ids to identify the required document from the corpus. Can someone help me with the design of this using python? I have parsed the query and retrieved the leaf nodes for now.
EDIT: I am new here, so apologies for lacking clarity. I am basically trying to build a very naive search engine. So, the user enters any boolean query like: OR ( AND (maria sharapova) tennis). I have a corpus of wikipedia documents that gets displayed to the user depending on the query you type.
Till now, I have parsed the query to retrieve individual operators (like OR, AND, etc). And, the individual search terms(maria, tennis, etc). The code for parsing is just a function that would basically group all the operators and query terms as typed. i.e (maria sharapova), (tennis), OR, AND. I parsed this function this way so as to create a tree bottom-up. Now, using the inverted lists for the corresponding keywords like tennis, maria, sharapova, etc I perform the boolean operation with the inverted list to get a certain "documentid". This documentid is then passed to an API which would then retrieve the correct wikipedia page.
Just to explain the topic in more detail, please refer to this document for more information about my problem in hand: http://www.ccs.neu.edu/home/jaa/CSG339.06F/Lectures/boolean.pdf
To insert into a tree we use the same node class created above and add a insert class to it. The insert class compares the value of the node to the parent node and decides to add it as a left node or a right node. Finally the PrintTree class is used to print the tree.
The root of the tree (5) is on top. Python does not have built-in support for trees.
Trees are non-linear data structures that represent nodes connected by edges. Each tree consists of a root node as the Parent node, and the left node and right node as Child nodes.
To create a tree in Python, we first have to start by creating a Node class that will represent a single node. This Node class will contain 3 variables; the first is the left pointing to the left child, the second variable data containing the value for that node, and the right variable pointing to the right child.
First if you want a fancy syntax of your query language to support many operators, range queries or wildcard, you definitely should refer to lex/yacc solution as Joran pointed out.
Second, from the lecture slides you posted, I think you care more about how to implement the boolean query model than constructing a tree in python. Then you don't need to worry about the query itself. Suppose the query is well formatted as below:
"OR ( AND ( maria sharapova ) tennis )"
That is, you have space between operator (AND/OR) and keywords/parenthesis. Then you only need two stacks (without using DFS on tree-data-structure) to parse the query and get the combined search results from them.
The first stack holds the operators (AND/OR) and the operands (e.g., maria, tennis). You treat the parenthesis as open/close condition to process the current operands on top of the stack. You only process the search operation when you see a close parenthesis )
.
The second stack holds the current search results.
Let's do a step-by-step demo using the above example. You scan the query from left to right.
Step 1. You push the "OR" operator into the stack.
+ +
+ +
+ OR +
+ + + + + + + + +
Step 2. You see an open parenthesis (
, just skip it.
Step 3. You push the "AND" operator into your stack. Now the stack looks like below:
+ +
+ AND +
+ OR +
+ + + + + + + + +
Step 4. You skip another (
.
Step 5. You push "maria" to your stack.
Step 6. You push "sharapova" to your stack. Now the stack looks like below:
+ sharapova +
+ maria +
+ AND +
+ OR +
+ + + + + + + + +
Step 7. You see a close parenthesis )
. Now it's time to do the first operation. You pop all items on top of the stack until you see an operator. Pop the operator as well to get the current operator. Now you process the search for "sharapova" and "maria" separately and combine the search results using the operator "AND". Assume for "maria", you get 3 doc ids: [1, 2, 3]
. For "sharapova", you get another 5 doc ids: [2, 3, 8, 9, 10]
. After you combine the results with "AND", you have [2,3]
in the
second stack which holds the current search results. The current situation looks like below: on the right it is the result buffer.
+ + + +
+ + + +
+ + + +
+ OR + + [2,3] +
+ + + + + + + + + + + + + + +
Step 8. You push tennis to the stack.
+ + + +
+ + + +
+ tennis + + +
+ OR + + [2,3] +
+ + + + + + + + + + + + + + +
Step 9. You see another close parenthesis )
. Again, you pop all items on top of the stack until you see "OR". You start search using "tennis" and suppose you get resulting doc ids: [3, 5, 7]
. At this time, you combine this result with the previous results in your buffer using operator "OR", so that finally get doc ids: [2,3,5,7]
.
My sample code is here. Note I simulate the searching and returning doc ids by randomly sample len(word)
integers.
The printout from the code shows step-by-step, how the system looks like before processing the current query item (1st column), the status of the result buffer(2nd column), the items in stack (3rd column) and the immediate search result (4th column).
A list of lists is a natural way to represent trees in Python (without creating classes):
>>> query = ['OR', ['AND', 'maria', 'sharapova'], 'tennis']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With