Processing malformed text data with machine learning or NLP

Tags:

I'm trying to extract data from a few large textfiles containing entries about people. The problem is, though, I cannot control the way the data comes to me.

It is usually in a format like this:

LASTNAME, Firstname Middlename (Maybe a Nickname)Why is this text hereJanuary, 25, 2012

Firstname Lastname 2001 Some text that I don't care about

Lastname, Firstname blah blah ... January 25, 2012 ...

Currently, I am using a huge regex that splits all kindaCamelcase words, all words that have a month name tacked onto the end, and a lot of special cases for names. Then I use more regex to extract a lot of combinations for the name and date.

This seems sub-optimal.

Are there any machine-learning libraries for Python that can parse malformed data that is somewhat structured?

I've tried NLTK, but it could not handle my dirty data. I'm tinkering with Orange right now and I like it's OOP style, but I'm not sure if I'm wasting my time.

Ideally, I'd like to do something like this to train a parser (with many input/output pairs):

training_data = (
  'LASTNAME, Firstname Middlename (Maybe a Nickname)FooBarJanuary 25, 2012',
   ['LASTNAME', 'Firstname', 'Middlename', 'Maybe a Nickname', 'January 25, 2012']
)

Is something like this possible or am I overestimating machine learning? Any suggestions will be appreciated, as I'd like to learn more about this topic.

958

asked Jan 25 '12 21:01

Blender

1 Answers

I ended up implementing a somewhat-complicated series of exhaustive regexes that encompassed every possible use case using text-based "filters" that were substituted with the appropriate regexes when the parser loaded.

If anyone's interested in the code, I'll edit it into this answer.

Here's basically what I used. To construct the regular expressions out of my "language", I had to make replacement classes:

class Replacer(object):
    def __call__(self, match):
        group = match.group(0)

        if group[1:].lower().endswith('_nm'):
            return '(?:' + Matcher(group).regex[1:]
        else:
            return '(?P<' + group[1:] + '>' + Matcher(group).regex[1:]

Then, I made a generic Matcher class, which constructed a regex for a particular pattern given the pattern name:

class Matcher(object):
    name_component =    r"([A-Z][A-Za-z|'|\-]+|[A-Z][a-z]{2,})"
    name_component_upper = r"([A-Z][A-Z|'|\-]+|[A-Z]{2,})"

    year = r'(1[89][0-9]{2}|20[0-9]{2})'
    year_upper = year

    age = r'([1-9][0-9]|1[01][0-9])'
    age_upper = age

    ordinal = r'([1-9][0-9]|1[01][0-9])\s*(?:th|rd|nd|st|TH|RD|ND|ST)'
    ordinal_upper = ordinal

    date = r'((?:{0})\.? [0-9]{{1,2}}(?:th|rd|nd|st|TH|RD|ND|ST)?,? \d{{2,4}}|[0-9]{{1,2}} (?:{0}),? \d{{2,4}}|[0-9]{{1,2}}[\-/\.][0-9]{{1,2}}[\-/\.][0-9]{{2,4}})'.format('|'.join(months + months_short) + '|' + '|'.join(months + months_short).upper())
    date_upper = date

    matchers = [
        'name_component',
        'year',
        'age',
        'ordinal',
        'date',
    ]

    def __init__(self, match=''):
        capitalized = '_upper' if match.isupper() else ''
        match = match.lower()[1:]

        if match.endswith('_instant'):
            match = match[:-8]

        if match in self.matchers:
            self.regex = getattr(self, match + capitalized)
        elif len(match) == 1:
        elif 'year' in match:
            self.regex = getattr(self, 'year')
        else:
            self.regex = getattr(self, 'name_component' + capitalized)

Finally, there's the generic Pattern object:

class Pattern(object):
    def __init__(self, text='', escape=None):
        self.text = text
        self.matchers = []

        escape = not self.text.startswith('!') if escape is None else False

        if escape:
            self.regex = re.sub(r'([\[\].?+\-()\^\\])', r'\\\1', self.text)
        else:
            self.regex = self.text[1:]

        self.size = len(re.findall(r'(\$[A-Za-z0-9\-_]+)', self.regex))

        self.regex = re.sub(r'(\$[A-Za-z0-9\-_]+)', Replacer(), self.regex)
        self.regex = re.sub(r'\s+', r'\\s+', self.regex)

    def search(self, text):
        return re.search(self.regex, text)

    def findall(self, text, max_depth=1.0):
        results = []
        length = float(len(text))

        for result in re.finditer(self.regex, text):
            if result.start() / length < max_depth:
                results.extend(result.groups())

        return results

    def match(self, text):
        result = map(lambda x: (x.groupdict(), x.start()), re.finditer(self.regex, text))

        if result:
            return result
        else:
            return []

It got pretty complicated, but it worked. I'm not going to post all of the source code, but this should get someone started. In the end, it converted a file like this:

$LASTNAME, $FirstName $I. said on $date

Into a compiled regex with named capturing groups.

168

answered Oct 14 '22 08:10

Blender

Related questions
                            
                                Using Python, how do I tell if a rectangle and a shape overlap?
                            
                                How do I change Emacs's font face for Python docstrings?
                            
                                Graceful Handling of Segfault
                            
                                Sphinx PDF output: Apostrophes in python source are replaced by right single quotes
                            
                                Converting a list of points to a numpy 2D array
                            
                                What is the Python equivalent of Lame MP3 Converter?
                            
                                Resume FTP download after timeout
                            
                                How to install lxml into virtualenv from the local system?
                            
                                Python: how to modify/edit the string printed to screen and read it back?
                            
                                Cython callback works correctly for function, but not for bound method
                            
                                What multinlingual database support for Django 1.3?
                            
                                How to use a different database per "application instance" in Django?
                            
                                PIP always reinstalls package when using specific SVN revision
                            
                                Why does python subprocess.Popen launch a subprocess through cmd.exe?
                            
                                Flask-SQLAlchemy: Photo column type
                            
                                Long (>20million element) array summation in python numpy
                            
                                Boost.Python static method overloads
                            
                                Learning Django - Good starter project [closed]
                            
                                library for transforming a node tree
                            
                                Data persistency of scientific simulation data, Mongodb + HDF5?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Processing malformed text data with machine learning or NLP

Tags:

python

parsing

machine-learning

nlp

Blender

People also ask

1 Answers

Blender

Recent Activity

Donate For Us