I'm trying to extract data from a few large textfiles containing entries about people. The problem is, though, I cannot control the way the data comes to me.
It is usually in a format like this:
LASTNAME, Firstname Middlename (Maybe a Nickname)Why is this text hereJanuary, 25, 2012
Firstname Lastname 2001 Some text that I don't care about
Lastname, Firstname blah blah ... January 25, 2012 ...
Currently, I am using a huge regex that splits all kindaCamelcase
words, all words that have a month name tacked onto the end, and a lot of special cases for names. Then I use more regex to extract a lot of combinations for the name and date.
This seems sub-optimal.
Are there any machine-learning libraries for Python that can parse malformed data that is somewhat structured?
I've tried NLTK, but it could not handle my dirty data. I'm tinkering with Orange right now and I like it's OOP style, but I'm not sure if I'm wasting my time.
Ideally, I'd like to do something like this to train a parser (with many input/output pairs):
training_data = (
'LASTNAME, Firstname Middlename (Maybe a Nickname)FooBarJanuary 25, 2012',
['LASTNAME', 'Firstname', 'Middlename', 'Maybe a Nickname', 'January 25, 2012']
)
Is something like this possible or am I overestimating machine learning? Any suggestions will be appreciated, as I'd like to learn more about this topic.
The term text processing refers to the automation of analyzing electronic text. This allows machine learning models to get structured information about the text to use for analysis, manipulation of the text, or to generate new text.
What is NLP (Natural Language Processing)? NLP is a subfield of computer science and artificial intelligence concerned with interactions between computers and human (natural) languages. It is used to apply machine learning algorithms to text and speech.
NLP algorithms are typically based on machine learning algorithms. Instead of hand-coding large sets of rules, NLP can rely on machine learning to automatically learn these rules by analyzing a set of examples (i.e. a large corpus, like a book, down to a collection of sentences), and making a statistical inference.
I ended up implementing a somewhat-complicated series of exhaustive regexes that encompassed every possible use case using text-based "filters" that were substituted with the appropriate regexes when the parser loaded.
If anyone's interested in the code, I'll edit it into this answer.
Here's basically what I used. To construct the regular expressions out of my "language", I had to make replacement classes:
class Replacer(object):
def __call__(self, match):
group = match.group(0)
if group[1:].lower().endswith('_nm'):
return '(?:' + Matcher(group).regex[1:]
else:
return '(?P<' + group[1:] + '>' + Matcher(group).regex[1:]
Then, I made a generic Matcher
class, which constructed a regex for a particular pattern given the pattern name:
class Matcher(object):
name_component = r"([A-Z][A-Za-z|'|\-]+|[A-Z][a-z]{2,})"
name_component_upper = r"([A-Z][A-Z|'|\-]+|[A-Z]{2,})"
year = r'(1[89][0-9]{2}|20[0-9]{2})'
year_upper = year
age = r'([1-9][0-9]|1[01][0-9])'
age_upper = age
ordinal = r'([1-9][0-9]|1[01][0-9])\s*(?:th|rd|nd|st|TH|RD|ND|ST)'
ordinal_upper = ordinal
date = r'((?:{0})\.? [0-9]{{1,2}}(?:th|rd|nd|st|TH|RD|ND|ST)?,? \d{{2,4}}|[0-9]{{1,2}} (?:{0}),? \d{{2,4}}|[0-9]{{1,2}}[\-/\.][0-9]{{1,2}}[\-/\.][0-9]{{2,4}})'.format('|'.join(months + months_short) + '|' + '|'.join(months + months_short).upper())
date_upper = date
matchers = [
'name_component',
'year',
'age',
'ordinal',
'date',
]
def __init__(self, match=''):
capitalized = '_upper' if match.isupper() else ''
match = match.lower()[1:]
if match.endswith('_instant'):
match = match[:-8]
if match in self.matchers:
self.regex = getattr(self, match + capitalized)
elif len(match) == 1:
elif 'year' in match:
self.regex = getattr(self, 'year')
else:
self.regex = getattr(self, 'name_component' + capitalized)
Finally, there's the generic Pattern
object:
class Pattern(object):
def __init__(self, text='', escape=None):
self.text = text
self.matchers = []
escape = not self.text.startswith('!') if escape is None else False
if escape:
self.regex = re.sub(r'([\[\].?+\-()\^\\])', r'\\\1', self.text)
else:
self.regex = self.text[1:]
self.size = len(re.findall(r'(\$[A-Za-z0-9\-_]+)', self.regex))
self.regex = re.sub(r'(\$[A-Za-z0-9\-_]+)', Replacer(), self.regex)
self.regex = re.sub(r'\s+', r'\\s+', self.regex)
def search(self, text):
return re.search(self.regex, text)
def findall(self, text, max_depth=1.0):
results = []
length = float(len(text))
for result in re.finditer(self.regex, text):
if result.start() / length < max_depth:
results.extend(result.groups())
return results
def match(self, text):
result = map(lambda x: (x.groupdict(), x.start()), re.finditer(self.regex, text))
if result:
return result
else:
return []
It got pretty complicated, but it worked. I'm not going to post all of the source code, but this should get someone started. In the end, it converted a file like this:
$LASTNAME, $FirstName $I. said on $date
Into a compiled regex with named capturing groups.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With