Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regular expression that will extract sentences from text file

Tags:

regex

php

I need a regular expression that will extract sentences from text file. example text :

Consider, for example, the Asian tsunami disaster that happened in the end of 2004. A query to Google News (http://news.google.com) returned more than 80,000 online news articles about this event within one month (Jan.17 through Feb.17, 2005). information by mr. Kahana.

here's my code :

$re = '/(?<=[.!?]|[.!?][\'"])\s+/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);

but the last sentence still splitted information by mr. and Kahana. how to solve it ? thank you :)

like image 395
bruine Avatar asked Oct 15 '12 03:10

bruine


1 Answers

You Can't Do this with Regular Expressions

English as a language does not fit into well-placed formatting rules. As such, regular expressions are not fit to fulfill the purpose you are seeking out. What you are really looking for is something like a Natural Language Processor.

Unless this is critical to your program, I suggest you instead determine the following things:

  • What is an acceptable level of error? Nothing you do will be perfect. But if it works 80% is that okay? 90%? 99%? How critical is this to you/your client?
  • Where is the text coming from? For example, a textbook will most likely be written differently than people's twitter feeds. You can do research and make exceptions based on what you see in the actual text you are using.
  • What am I doing with the text? If you are just indexing things like keywords, then it doesn't matter (as much) if you get the sentences split correctly. It's all about tuning the program to get the appropriate output for this specific purpose.

My recommendation is to use trial and error to get your error rate down as much as possible. Run your program on a large set of text, and keep adding exceptions until you get an acceptable error rate. If, however, you need more than a couple dozen rules or so, you will probably just want to rethink the problem.

In short, PHP and Regular Expressions aren't meant for this because English is funky. So either live with adding exceptions to get a small(er) error rate, or rethink the point altogether.

like image 192
cegfault Avatar answered Oct 20 '22 20:10

cegfault