I am trying to parse text off of a PDF page into sentences but it is much more difficult than I had anticipated. There are a whole lot of special cases to consider such as initials, decimals, quotations, etc which contain periods but do not necessarily end the sentence.
I was curious if anyone here was familiar with an NLP library for C or C++ that could help me out with this task or just offer any advice?
Thank you for any help.
This is a problem called sentence boundary disambiguation. The Wikipedia page for it lists a few libraries, but I'm not sure if any of them are easily callable from C.
You can find many papers on the theory of sentence boundary disambiguation. The Unicode Standard, in Unicode Standard Annex #29 - Unicode Text Segmentation defines a simple sentence boundary detection algorithm as well.
Sentence boundary disambiguation (SBD) is a central problem in the field of NLP. Unfortunately, those I've found and used in the past aren't in C (as it's not the favourite language for string based tasks, unless speed is a major issue)
Pipeline
If at all possible I'd create a simple pipeline - if on a Unix system this shouldn't be a problem, but even if you're on Windows with a scripting language you should be able to fill in the gaps. This means that the SBD can be the best tool for the job, not merely the only SBD you could find for language Z. For example,
./pdfconvert | SBD | my_C_tool > ...
This is the standard way we do things in my work, and unless you have more strict requirements than you've stated it should be fine.
Tools
In regards to the tools you can use,
Models and Training
Now, some of these tools may give you good results out of the box, but some may not. OpenNLP includes a model for English sentence detection out of the box, which may work for you. If your domain is significantly different to the one which the tools were trained on they may not perform well however. For example, if they were trained on newspaper text they may be very good at that task but horrible at letters.
As such, you may want to train the SBD tool by giving it examples. Each of the tools should document this process, but I will warn you, it may be a bit of work. It would require you running the tool on document X, going through and manually fixing any incorrect splits and giving the correctly split document X back to the tool to train on. Depending on the sizes of the documents and the tool involved you may need to do this for one or a hundred documents until you get a reasonable result.
Good luck, and if you've any questions feel free to ask.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With