Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Processing malformed text data with machine learning or NLP

I'm trying to extract data from a few large textfiles containing entries about people. The problem is, though, I cannot control the way the data comes to me.

It is usually in a format like this:

LASTNAME, Firstname Middlename (Maybe a Nickname)Why is this text hereJanuary, 25, 2012

Firstname Lastname 2001 Some text that I don't care about

Lastname, Firstname blah blah ... January 25, 2012 ...

Currently, I am using a huge regex that splits all kindaCamelcase words, all words that have a month name tacked onto the end, and a lot of special cases for names. Then I use more regex to extract a lot of combinations for the name and date.

This seems sub-optimal.

Are there any machine-learning libraries for Python that can parse malformed data that is somewhat structured?

I've tried NLTK, but it could not handle my dirty data. I'm tinkering with Orange right now and I like it's OOP style, but I'm not sure if I'm wasting my time.

Ideally, I'd like to do something like this to train a parser (with many input/output pairs):

training_data = (
  'LASTNAME, Firstname Middlename (Maybe a Nickname)FooBarJanuary 25, 2012',
   ['LASTNAME', 'Firstname', 'Middlename', 'Maybe a Nickname', 'January 25, 2012']
)

Is something like this possible or am I overestimating machine learning? Any suggestions will be appreciated, as I'd like to learn more about this topic.

like image 958
Blender Avatar asked Jan 25 '12 21:01

Blender


People also ask

What is text processing in machine learning?

The term text processing refers to the automation of analyzing electronic text. This allows machine learning models to get structured information about the text to use for analysis, manipulation of the text, or to generate new text.

What is text processing in NLP?

What is NLP (Natural Language Processing)? NLP is a subfield of computer science and artificial intelligence concerned with interactions between computers and human (natural) languages. It is used to apply machine learning algorithms to text and speech.

Is NLP a machine learning algorithm?

NLP algorithms are typically based on machine learning algorithms. Instead of hand-coding large sets of rules, NLP can rely on machine learning to automatically learn these rules by analyzing a set of examples (i.e. a large corpus, like a book, down to a collection of sentences), and making a statistical inference.


1 Answers

I ended up implementing a somewhat-complicated series of exhaustive regexes that encompassed every possible use case using text-based "filters" that were substituted with the appropriate regexes when the parser loaded.

If anyone's interested in the code, I'll edit it into this answer.


Here's basically what I used. To construct the regular expressions out of my "language", I had to make replacement classes:

class Replacer(object):
    def __call__(self, match):
        group = match.group(0)

        if group[1:].lower().endswith('_nm'):
            return '(?:' + Matcher(group).regex[1:]
        else:
            return '(?P<' + group[1:] + '>' + Matcher(group).regex[1:]

Then, I made a generic Matcher class, which constructed a regex for a particular pattern given the pattern name:

class Matcher(object):
    name_component =    r"([A-Z][A-Za-z|'|\-]+|[A-Z][a-z]{2,})"
    name_component_upper = r"([A-Z][A-Z|'|\-]+|[A-Z]{2,})"

    year = r'(1[89][0-9]{2}|20[0-9]{2})'
    year_upper = year

    age = r'([1-9][0-9]|1[01][0-9])'
    age_upper = age

    ordinal = r'([1-9][0-9]|1[01][0-9])\s*(?:th|rd|nd|st|TH|RD|ND|ST)'
    ordinal_upper = ordinal

    date = r'((?:{0})\.? [0-9]{{1,2}}(?:th|rd|nd|st|TH|RD|ND|ST)?,? \d{{2,4}}|[0-9]{{1,2}} (?:{0}),? \d{{2,4}}|[0-9]{{1,2}}[\-/\.][0-9]{{1,2}}[\-/\.][0-9]{{2,4}})'.format('|'.join(months + months_short) + '|' + '|'.join(months + months_short).upper())
    date_upper = date

    matchers = [
        'name_component',
        'year',
        'age',
        'ordinal',
        'date',
    ]

    def __init__(self, match=''):
        capitalized = '_upper' if match.isupper() else ''
        match = match.lower()[1:]

        if match.endswith('_instant'):
            match = match[:-8]

        if match in self.matchers:
            self.regex = getattr(self, match + capitalized)
        elif len(match) == 1:
        elif 'year' in match:
            self.regex = getattr(self, 'year')
        else:
            self.regex = getattr(self, 'name_component' + capitalized)

Finally, there's the generic Pattern object:

class Pattern(object):
    def __init__(self, text='', escape=None):
        self.text = text
        self.matchers = []

        escape = not self.text.startswith('!') if escape is None else False

        if escape:
            self.regex = re.sub(r'([\[\].?+\-()\^\\])', r'\\\1', self.text)
        else:
            self.regex = self.text[1:]

        self.size = len(re.findall(r'(\$[A-Za-z0-9\-_]+)', self.regex))

        self.regex = re.sub(r'(\$[A-Za-z0-9\-_]+)', Replacer(), self.regex)
        self.regex = re.sub(r'\s+', r'\\s+', self.regex)

    def search(self, text):
        return re.search(self.regex, text)

    def findall(self, text, max_depth=1.0):
        results = []
        length = float(len(text))

        for result in re.finditer(self.regex, text):
            if result.start() / length < max_depth:
                results.extend(result.groups())

        return results

    def match(self, text):
        result = map(lambda x: (x.groupdict(), x.start()), re.finditer(self.regex, text))

        if result:
            return result
        else:
            return []

It got pretty complicated, but it worked. I'm not going to post all of the source code, but this should get someone started. In the end, it converted a file like this:

$LASTNAME, $FirstName $I. said on $date

Into a compiled regex with named capturing groups.

like image 168
Blender Avatar answered Oct 14 '22 08:10

Blender