Full text searching XML data with Python: best practices, pros & cons

Tags:

Task

I want to use Python for doing full text searches of XML data.

Example data

<elements>
  <elem id="1">some element</elem>
  <elem id="2">some other element</elem>
  <elem id="3">some element
    <nested id="1">
    other nested element
    </nested>
  </elem>
</elements>

Basic functionality

The most basic functionality I want is that a search for "other" in an XPath ("/elements/elem") returns at least the value of the ID attribute for the matching element (elem 2) and nested element (elem 3, nested 1) or the matching XPaths.

Ideal functionality

The solution should be flexible and scalable. I am looking for possible combinations of these features:

search nested elements (infinite depth)
search attributes
search for sentences and paragraphs
search using wildcards
search using fuzzy matching
return precise matching info
good search speed for large XML files

Question

I don't expect a solution with all of the ideal functionality, I'll have to combine different existing functionalities and code the rest myself. But first I would like to know more about what there is out there, which libraries and approaches you would usually use for this, what their pros and cons are.

EDIT: Thanks for the answers so far, I added detail and started a bounty.

713

asked Apr 26 '11 13:04

lecodesportif

4 Answers

Not sure if that will be enough for your needs, but lxml has support for regular expressions in xpath (meaning: you can use xpath 1.0 plus the EXSLT extension functions for regular expressions)

Compared to the feature list that was added later:

search nested elements (infinite depth): yes
search attributes: yes
search for sentences and paragraphs: no. Assuming that "paragraphs" are actual xml elements, then yes. But "sentences" as such, no.
search using wildcards: yes (regular expressions)
search using fuzzy matching: no (assuming stemming, synonyms and so on...)
return precise matching info: yes
good search speed for large XML files: yes, except when your files are so extremely large that you would actually need a fulltext index to get good speed anyway.

The only way to satisfy all your request that I see, would be to load your files into a native xml database that supports "real" fulltext search (via XQuery Fulltext probably) and use that. (can't help you much further with that, maybe Sedna, which seems to have a python API and seems to supports fulltext search?)

145

answered Oct 23 '22 11:10

Steven

I think you would be best served using a full text search engine like Solr: http://lucene.apache.org/solr/

What you can do is store a "document" in solr for each <elem /> in your xml. You can store any data you like in the document. Then you can search against the index and grab the id field stored in the matching documents. This will be very fast for a large set of documents.

answered Oct 23 '22 13:10

Max

select="/elements/elem//[contains(.,'other')]"

see also xpath: find a node that has a given attribute whose value contains a string

answered Oct 23 '22 11:10

vartec

I'd recommend the following two:

Use XPath 2.0. It supports regular expressions.

Or,

Use XQuery and XPath (2.0) Full Text, which has even more powerful features.

answered Oct 23 '22 12:10

Dimitre Novatchev

Related questions
                            
                                ImportError: cannot import name 'app' from 'mypackage' (unknown location)
                            
                                Confusion about URI path to configure SQLite database
                            
                                ReadTimeout: HTTPSConnectionPool(host='', port=443): Read timed out. (read timeout=10)
                            
                                Why does setting a descriptor on a class overwrite the descriptor?
                            
                                Django — async_to_sync vs asyncio.run
                            
                                Django - Annotate multiple fields from a Subquery
                            
                                Error Installing streamlit It says "ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly"
                            
                                When will most libraries be Python 3 compliant? [closed]
                            
                                Credit card payments and notifications on the Google App Engine
                            
                                How do I make text wrapping match current indentation level in vim?
                            
                                Python OCR library or handwritten character recognition engine [closed]
                            
                                Python: Separating the GUI process from the core logic process
                            
                                What is returned by wave.readframes?
                            
                                Subclass builtin List
                            
                                Best resources for learning PyGame? [closed]
                            
                                Google App Engine or Django? [closed]
                            
                                Python finding substring between certain characters using regex and replace()
                            
                                Edit configuration file through python
                            
                                Architecture for providing different linear algebra back-ends
                            
                                Python: How can I check the number of pending tasks in a multiprocessing.Pool?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Full text searching XML data with Python: best practices, pros & cons

Tags:

python

search

xml

full-text-search

xpath