Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get sentence number from input?

It seems hard to detect a sentence boundary in a text. Quotation marks like .!? may be used to delimite sentences but not so accurate as there may be ambiguous words and quotations such as U.S.A or Prof. or Dr. I am studying Tperlregex library and Regular Expression Cookbook by Jan Goyvaerts but I do not know how to write the expression that detects sentence?

What may be comparatively accurate expression using Tperlregex in delphi?

Thanks

like image 728
Warren Avatar asked Apr 20 '11 15:04

Warren


1 Answers

First, you probably need to arrive at your own definition of what a "sentence" is, then implement that definition. For example, how about:

He said: "It's OK!"

Is it one sentence or two? A general answer is irrelevant. Decide whether you want it to interpret it as one or two sentences, and proceed accordingly.

Second, I don't think I'd be using regular expressions for this. Instead, I would scan each character and try to detect sequences. A period by itself may not be enough to delimit a sentence, but a period followed by whitespace or carriage return (or end of string) probably does. This immediately lets you weed out U.S.A (periods not followed by whitespace).

For common abbreviations like Prof. an Dr. it may be a good idea to create a dictionary - perhaps editable by your users, since each language will have its own set of common abbreviations.

Each language will have its own set of punctuation rules too, which may affect how you interpret punctuation characters. For example, English tends to put a period inside the parentheses (like this.) while Polish does the opposite (like this). The same difference will apply to double quotes, single quotes (some languages don't use them at all, sometimes they are indistinguishable from apostrophes etc.). Your rules may well have to be language-specific, at least in part.

In the end, you may approximate the human way of delimiting sentences, but there will always be cases that can throw the analysis off. For example, assuming that you have a dictionary that recognizes "Prof." as an abbreviation, what are you going to do about

Most people called him Professor Jones, but to me he was simply The Prof.

Even if you have another sentence that follows and starts with a capital letter, that still won't help you know where the sentence ends, because it might as well be

Most people called him Professor Jones, but to me he was simply Prof. Bill.
like image 91
Marek Jedliński Avatar answered Oct 03 '22 17:10

Marek Jedliński