Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I build a model to distinguish tweets about Apple (Inc.) from tweets about apple (fruit)?

See below for 50 tweets about "apple." I have hand labeled the positive matches about Apple Inc. They are marked as 1 below.

Here are a couple of lines:

1|“@chrisgilmer: Apple targets big business with new iOS 7 features http://bit.ly/15F9JeF ”. Finally.. A corp iTunes account! 0|“@Zach_Paull: When did green skittles change from lime to green apple? #notafan” @Skittles 1|@dtfcdvEric: @MaroneyFan11 apple inc is searching for people to help and tryout all their upcoming tablet within our own net page No. 0|@STFUTimothy have you tried apple pie shine? 1|#SuryaRay #India Microsoft to bring Xbox and PC games to Apple, Android phones: Report: Microsoft Corp... http://dlvr.it/3YvbQx  @SuryaRay 

Here is the total data set: http://pastebin.com/eJuEb4eB

I need to build a model that classifies "Apple" (Inc). from the rest.

I'm not looking for a general overview of machine learning, rather I'm looking for actual model in code (Python preferred).

like image 249
SAL Avatar asked Jun 27 '13 20:06

SAL


1 Answers

What you are looking for is called Named Entity Recognition. It is a statistical technique that (most commonly) uses Conditional Random Fields to find named entities, based on having been trained to learn things about named entities.

Essentially, it looks at the content and context of the word, (looking back and forward a few words), to estimate the probability that the word is a named entity.

Good software can look at other features of words, such as their length or shape (like "Vcv" if it starts with "Vowel-consonant-vowel")

A very good library (GPL) is Stanford's NER

Here's the demo: http://nlp.stanford.edu:8080/ner/

Some sample text to try:

I was eating an apple over at Apple headquarters and I thought about Apple Martin, the daughter of the Coldplay guy

(the 3class and 4class classifiers get it right)

like image 158
Neil McGuigan Avatar answered Oct 09 '22 07:10

Neil McGuigan