With existing text categorization (supervised) techniques why don't we consider Named Entities (NE) in the text as a feature in training and testing? Do you think we can improve precision with using NEs as a feature?
Entities can be names of people, organizations, locations, times, quantities, monetary values, percentages, and more. With named entity recognition, you can extract key information to understand what a text is about, or merely use it to collect important information to store in a database.
Named Entity Recognition and Classification (NERC) is a process of recognizing information units like names, including person, organization and location names, and numeric expressions including time, date, money and percent expressions from unstructured text.
Named entities are persons, locations, organizations, time expressions, etc. POS tagger does not look for the relation between the words in the document whereas NER looks for the relationship between words. The output of POS tagging is used as an input for NER.
The named entity recognition (NER) is one of the most popular data preprocessing task. It involves the identification of key information in the text and classification into a set of predefined categories. An entity is basically the thing that is consistently talked about or refer to in the text. NER is the form of NLP.
It depends a lot on the domain you are working in. You have to define the features based on the domain. Say in a search engine you are working on learning to rank problem, generating a dynamic rank, the NE's wont give you any benefit here. It largerly depends on the domain that you are working and also the output categorization labels (supervised learning) defined.
Now say you are working on classifying documents pertaining to Soccer or Movie or Polictics and so on. In this case Named Entities can work. I will give you an example here, say you are using a Neural Network which categorizes documents into Soccer, Movie, Politics etc. Now say a document comes in "Lionel Messi was invited to attend the premier of "The Social Network", also present were the cast and crew including Jesse Eisenberg, Andrew Garfield and Justin Timberlake" Here the connection between named entities (input features) and movie (output defined) will be stronger and hence it will be classified as a document on Movie.
Another example, say our document is "Tom Cruise is portraying the character of Lionel Messi in the movie "The last soccer game". Here comes the benefit say your neural network has learnt that when an actor and footballer comes together in one document there is high probability of it being a movie. Again it depends on the data and training it may be other way round too (but that is what is learning all about; seeing the past data)
So my answer would be try it out, nobody is stopping you to have named entities as features. It might help for the domain that you are working in.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With