Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text tokenization with Stanford NLP : Filter unrequired words and characters

I use Stanford NLP for string tokenization in my classification tool. I want to get only meaningful words, but I get non-word tokens (like ---, >, . etc.) and not important words like am, is, to (stop words). Does anybody know a way to solve this problem?

like image 841
dmitrievanthony Avatar asked May 03 '15 20:05

dmitrievanthony


People also ask

What is PTB tokenizer?

A fast, rule-based tokenizer implementation, which produces Penn Treebank style tokenization of English text. It was initially written to conform to Penn Treebank tokenization conventions over ASCII text, but now provides a range of tokenization options over a broader space of Unicode text.

What is PTB in NLP?

Penn Treebank (PTB) dataset, is widely used in machine learning for NLP (Natural Language Processing) research.


2 Answers

In stanford Corenlp, there is a stopword removal annotator which provides the functionality to remove the standord stopwords. You can also define custom stopwords here as per your need (i.e ---,<,. etc)

You can see the example here:

   Properties props = new Properties();
   props.put("annotators", "tokenize, ssplit, stopword");
   props.setProperty("customAnnotatorClass.stopword", "intoxicant.analytics.coreNlp.StopwordAnnotator");

   StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
   Annotation document = new Annotation(example);
   pipeline.annotate(document);
   List<CoreLabel> tokens = document.get(CoreAnnotations.TokensAnnotation.class);

Here in the above example "tokenize,ssplit,stopwords" are set as custom stopwords.

Hope it'll help you....!!

like image 191
Nishu Tayal Avatar answered Sep 21 '22 18:09

Nishu Tayal


This is a very domain-specific task that we don't perform for you in CoreNLP. You should be able to make this work with a regular expression filter and a stopword filter on top of the CoreNLP tokenizer.

Here's an example list of English stopwords.

like image 37
Jon Gauthier Avatar answered Sep 24 '22 18:09

Jon Gauthier