Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

details on the following Natural Language Processing terms?

Named Entity Extraction (extract ppl, cities, organizations)
Content Tagging (extract topic tags by scanning doc)
Structured Data Extraction
Topic Categorization (taxonomy classification by scanning doc....bayesian )
Text extraction (HTML page cleaning)

are there libraries that i can use to do any of the above functions of NLP ?

dont really feel like forking out cash to AlchemyAPI

like image 492
wefwgeweg Avatar asked Jan 22 '23 05:01

wefwgeweg


2 Answers

There are actually plenty of freely available open-source natural language processing packages out there. Here's a brief list, organized by what language the toolkit is implemented in:

  • Python: Natural Language Toolkit NLTK
  • Java: OpenNLP, Gate, and Stanford's JavaNLP
  • .NET: Sharp NLP

If you're uncertain which one to go with, I would recommend starting with NLTK. The package is reasonably easy to use and has great documentation online, including a free book.

You should be able to use NLTK to easily accomplish the NLP tasks you've listed, e.g. named entity recognition (NER), extracting tags for documents, and document categorization.

What the Alchemy people call structured data extraction looks like it's just HTML scrapping that is robust against changes to the underlying HTML as long as the page still visually renders the same way. So, it's not really a NLP task.

For the extraction of text from HTML, just use boilerpipe. It's fast, good, and free.

like image 107
dmcer Avatar answered Feb 02 '23 08:02

dmcer


The Apache UIMA project was originally created by IBM and provides an NLP framework much like GATE. There are various annotators out there that are built for UIMA.

like image 41
Thien Avatar answered Feb 02 '23 08:02

Thien