Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithms or Patterns for reading text

My company has a client that tracks prices for products from different companies at different locations. This information goes into a database.

These companies email the prices to our client each day, and of course the emails are all formatted differently. It is impossible to have any of the companies change their format - they will not do it.

Some look sort of like this:

    This is example text that could be many lines long...

    Location 1
    Product 1     Product 2     Product 3
    $20.99        $21.99        $33.79

    Location 2
    Product 1     Product 2     Product 3
    $24.99        $22.88        $35.59

Others look sort of like this:

    PRODUCT       PRICE    + / -
    ------------  -------- -------
    Location 1
    1             2007.30 +048.20
    2             2022.50 +048.20

    Maybe some multiline text here about a holiday or something...

    Location 2
    1             2017.30 +048.20
    2             2032.50 +048.20

Currently we have individual parsers written for each company's email format. But these formats change slightly pretty frequently. We can't count on the prices being on the same row or column each time.

It's trivial for us to look at the emails and determine which price goes with which product at which location. But not so much for our code. So I'm trying to find a more flexible solution and would like your suggestions about what approaches to take. I'm open to anything from regex to neural networks - I'll learn what I need to to make this work, I just don't know what I need to learn. Is this a lex/parsing problem? More similar to OCR?

The code doesn't have to figure out the formats all on its own. The emails fall into a few main 'styles' like the ones above. We really need the code to just be flexible enough that a new product line or whitespace or something doesn't make the file unparsable.

Thanks for any suggestions about where to start.

like image 972
Scott Saunders Avatar asked Aug 07 '09 15:08

Scott Saunders


1 Answers

I think this problem would be suitable for proper parser generator. Regular expressions are too difficult to test and debug if they go wrong. However, I would go for a parser generator that is simple to use as if it was part of a language.

For these type of tasks I would go with pyparsing as its got the power of a full lr parser but without a difficult grammer to define and very good helper functions. The code is easy to read too.

from pyparsing import *

aaa ="""    This is example text that could be many lines long...
             another line

    Location 1
    Product 1     Product 2     Product 3
    $20.99        $21.99        $33.79

    stuff in here you want to ignore

    Location 2
    Product 1     Product 2     Product 3
    $24.99        $22.88        $35.59 """

result = SkipTo("Location").suppress() \  
# in place of "location" could be any type of match like a re.
         + OneOrMore(Word(alphas) + Word(nums)) \
         + OneOrMore(Word(nums+"$.")) \

all_results = OneOrMore(Group(result))

parsed = all_results.parseString(aaa)

for block in parsed:
    print block

This returns a list of lists.

['Location', '1', 'Product', '1', 'Product', '2', 'Product', '3', '$20.99', '$21.99', '$33.79']
['Location', '2', 'Product', '1', 'Product', '2', 'Product', '3', '$24.99', '$22.88', '$35.59']

You can group things as you want but for simplicity I have just returned lists. Whitespace is ignored by default which makes things a lot simpler.

I do not know if there are equivalents in other languages.

like image 108
David Raznick Avatar answered Oct 21 '22 01:10

David Raznick