Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regexp for Tokenizing English Text

Tags:

regex

text

nlp

What would be the best regular expression for tokenizing an English text?

By an English token, I mean an atom consisting of maximum number of characters that can be meaningfully used for NLP purposes. An analogy is a "token" in any programming language (e.g. in C, '{', '[', 'hello', '&', etc. can be tokens). There is one restriction: Though English punctuation characters can be "meaningful", let's ignore them for the sake of simplicity when they do not appear in the middle of \w+. So, "Hello, world." yields 'hello' and 'world'; similarly, "You are good-looking." may yield either [you, are, good-looking] or [you, are, good, looking].

like image 576
OTZ Avatar asked Sep 13 '10 19:09

OTZ


2 Answers

Treebank Tokenization

Penn Treebank (PTB) tokenization is a reasonably common tokenization scheme used for natural language processing (NLP) work.

You can find a sed script with the appropriate regular expressions to get this tokenization here.

Software Packages

However, most NLP packages provide ready to use tokenizers, so you don't really need to write your own. For example, if you're using python you can just use the TreebankWordTokenizer provided with NLTK. If you're using the Java based Stanford Parser, it will by default tokenize any sentence you give it using its edu.stanford.nlp.processor.PTBTokenizer.

like image 195
dmcer Avatar answered Oct 28 '22 08:10

dmcer


You probably shouldn't try to use a regular expression for tokenizing English text. In English some tokens have several different meanings and you can only know which is right by understanding the context in which they are found, and that requires understanding the meaning of the text to some extent. Examples:

  • The character ' could be an apostrophe or it could be used as a single-quote to quote some text.
  • The period could be the end of a sentence or it could signify an abbreviation. Or in some cases it could fulfil both roles simultaneously.

Try a natural language parser instead. For example you could use the Stanford Parser. It is free to use and will do a much better job than any regular expression at tokenizing English text. That's just one example though - there are also many other NLP libraries you could use.

like image 30
Mark Byers Avatar answered Oct 28 '22 08:10

Mark Byers