Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Full text searching XML data with Python: best practices, pros & cons

Task

I want to use Python for doing full text searches of XML data.

Example data

<elements>
  <elem id="1">some element</elem>
  <elem id="2">some other element</elem>
  <elem id="3">some element
    <nested id="1">
    other nested element
    </nested>
  </elem>
</elements>

Basic functionality

The most basic functionality I want is that a search for "other" in an XPath ("/elements/elem") returns at least the value of the ID attribute for the matching element (elem 2) and nested element (elem 3, nested 1) or the matching XPaths.

Ideal functionality

The solution should be flexible and scalable. I am looking for possible combinations of these features:

  • search nested elements (infinite depth)
  • search attributes
  • search for sentences and paragraphs
  • search using wildcards
  • search using fuzzy matching
  • return precise matching info
  • good search speed for large XML files

Question

I don't expect a solution with all of the ideal functionality, I'll have to combine different existing functionalities and code the rest myself. But first I would like to know more about what there is out there, which libraries and approaches you would usually use for this, what their pros and cons are.

EDIT: Thanks for the answers so far, I added detail and started a bounty.

like image 713
lecodesportif Avatar asked Apr 26 '11 13:04

lecodesportif


People also ask

What makes XML so appealing?

XML's hierarchical structure makes it easy to apply the concept of object interfaces to documents—it's quite simple to build application-specific objects directly from the information stream, given mappings from element names to object types.

How does Python handle XML?

To read an XML file using ElementTree, firstly, we import the ElementTree class found inside xml library, under the name ET (common convension). Then passed the filename of the xml file to the ElementTree. parse() method, to enable parsing of our xml file. Then got the root (parent tag) of our xml file using getroot().

How do you iterate through an XML file in Python?

To iterate over all nodes, use the iter method on the ElementTree , not the root Element. The root is an Element, just like the other elements in the tree and only really has context of its own attributes and children. The ElementTree has the context for all Elements.


4 Answers

Not sure if that will be enough for your needs, but lxml has support for regular expressions in xpath (meaning: you can use xpath 1.0 plus the EXSLT extension functions for regular expressions)

Compared to the feature list that was added later:

  • search nested elements (infinite depth): yes
  • search attributes: yes
  • search for sentences and paragraphs: no. Assuming that "paragraphs" are actual xml elements, then yes. But "sentences" as such, no.
  • search using wildcards: yes (regular expressions)
  • search using fuzzy matching: no (assuming stemming, synonyms and so on...)
  • return precise matching info: yes
  • good search speed for large XML files: yes, except when your files are so extremely large that you would actually need a fulltext index to get good speed anyway.

The only way to satisfy all your request that I see, would be to load your files into a native xml database that supports "real" fulltext search (via XQuery Fulltext probably) and use that. (can't help you much further with that, maybe Sedna, which seems to have a python API and seems to supports fulltext search?)

like image 145
Steven Avatar answered Oct 23 '22 11:10

Steven


I think you would be best served using a full text search engine like Solr: http://lucene.apache.org/solr/

What you can do is store a "document" in solr for each <elem /> in your xml. You can store any data you like in the document. Then you can search against the index and grab the id field stored in the matching documents. This will be very fast for a large set of documents.

like image 28
Max Avatar answered Oct 23 '22 13:10

Max


select="/elements/elem//[contains(.,'other')]"

see also xpath: find a node that has a given attribute whose value contains a string

like image 1
vartec Avatar answered Oct 23 '22 11:10

vartec


I'd recommend the following two:

Use XPath 2.0. It supports regular expressions.

Or,

Use XQuery and XPath (2.0) Full Text, which has even more powerful features.

like image 1
Dimitre Novatchev Avatar answered Oct 23 '22 12:10

Dimitre Novatchev