Task
I want to use Python for doing full text searches of XML data.
Example data
<elements>
<elem id="1">some element</elem>
<elem id="2">some other element</elem>
<elem id="3">some element
<nested id="1">
other nested element
</nested>
</elem>
</elements>
Basic functionality
The most basic functionality I want is that a search for "other" in an XPath ("/elements/elem") returns at least the value of the ID attribute for the matching element (elem 2) and nested element (elem 3, nested 1) or the matching XPaths.
Ideal functionality
The solution should be flexible and scalable. I am looking for possible combinations of these features:
Question
I don't expect a solution with all of the ideal functionality, I'll have to combine different existing functionalities and code the rest myself. But first I would like to know more about what there is out there, which libraries and approaches you would usually use for this, what their pros and cons are.
EDIT: Thanks for the answers so far, I added detail and started a bounty.
XML's hierarchical structure makes it easy to apply the concept of object interfaces to documents—it's quite simple to build application-specific objects directly from the information stream, given mappings from element names to object types.
To read an XML file using ElementTree, firstly, we import the ElementTree class found inside xml library, under the name ET (common convension). Then passed the filename of the xml file to the ElementTree. parse() method, to enable parsing of our xml file. Then got the root (parent tag) of our xml file using getroot().
To iterate over all nodes, use the iter method on the ElementTree , not the root Element. The root is an Element, just like the other elements in the tree and only really has context of its own attributes and children. The ElementTree has the context for all Elements.
Not sure if that will be enough for your needs, but lxml has support for regular expressions in xpath (meaning: you can use xpath 1.0 plus the EXSLT extension functions for regular expressions)
Compared to the feature list that was added later:
The only way to satisfy all your request that I see, would be to load your files into a native xml database that supports "real" fulltext search (via XQuery Fulltext probably) and use that. (can't help you much further with that, maybe Sedna, which seems to have a python API and seems to supports fulltext search?)
I think you would be best served using a full text search engine like Solr: http://lucene.apache.org/solr/
What you can do is store a "document" in solr for each <elem />
in your xml. You can store any data you like in the document. Then you can search against the index and grab the id
field stored in the matching documents. This will be very fast for a large set of documents.
select="/elements/elem//[contains(.,'other')]"
see also xpath: find a node that has a given attribute whose value contains a string
I'd recommend the following two:
Use XPath 2.0. It supports regular expressions.
Or,
Use XQuery and XPath (2.0) Full Text, which has even more powerful features.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With