Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve detection of sentences in Sphinx?

It is possible to search words in one sentence with Sphinx. For example, we have next text:

Вася молодец, съел огурец, т.к. проголодался. Такие дела.

If I search

молодец SENTENCE огурец

i find this text. If I search

молодец SENTENCE проголодался

I cant find this text, because dot from phrase т.к. regarded as end of sentence.

And how I see, set of delimiters is hardcoded in Sphinx's sources.

My question is how to improve detection of sentence? Better way for me is to use Yandex's Tomita parser or another nlp library with smart detection of sentences.

like image 591
Nick Avatar asked Sep 12 '16 08:09

Nick


1 Answers

Split text into sentences with Yandex's Tomita parser. We get the text, which splited by "\n".

Delete all ".", "!", "?" leaving last from each sentence.

Build the Sphinx index with this preprocessed data.

like image 176
Nick Avatar answered Sep 23 '22 15:09

Nick