Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find some sentences

Tags:

regex

ruby

nlp

I'd like to find good way to find some (let it be two) sentences in some text. What will be better - use regexp or split-method? Your ideas?

As requested by Jeremy Stein - there are some examples

Examples:

Input:

The first thing to do is to create the Comment model. We’ll create this in the normal way, but with one small difference. If we were just creating comments for an Article we’d have an integer field called article_id in the model to store the foreign key, but in this case we’re going to need something more abstract.

First two sentences:

The first thing to do is to create the Comment model. We’ll create this in the normal way, but with one small difference.

Input:

Mr. T is one mean dude. I'd hate to get in a fight with him.

First two sentences:

Mr. T is one mean dude. I'd hate to get in a fight with him.

Input:

The D.C. Sniper was executed was executed by lethal injection at a Virginia prison. Death was pronounced at 9:11 p.m. ET.

First two sentences:

The D.C. Sniper was executed was executed by lethal injection at a Virginia prison. Death was pronounced at 9:11 p.m. ET.

Input:

In her concluding remarks, the opposing attorney said that "...in this and so many other instances, two wrongs won’t make a right." The jury seemed to agree.

First two sentences:

In her concluding remarks, the opposing attorney said that "...in this and so many other instances, two wrongs won’t make a right." The jury seemed to agree.

Guys, as you can see - it's not so easy to determine two sentences from text. :(

like image 317
Alexey Poimtsev Avatar asked Nov 11 '09 11:11

Alexey Poimtsev


People also ask

What is sentence of find?

[M] [T] You'll find our house at the end of the next street. [M] [T] As soon as I find it, I'll bring it over to your place. [M] [T] He woke up to find himself lying on a bench in the park. [M] [T] We'll have to camp out if we can't find a place to stay.


1 Answers

As you've noticed, sentence tokenizing is a bit tricker than it first might seem. So you may as well take advantage of existing solutions. The Punkt sentence tokenizing algorithm is popular in NLP, and there is a good implementation in the Python Natural Language Toolkit which they describe the use of here. They also describe another approach here.

There's probably other implementations around, or you could also read the original paper describing the Punkt algorithm: Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525.

You can also read another Stack Overflow question about sentence tokenizing here.

like image 52
nedned Avatar answered Sep 20 '22 07:09

nedned