Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text Categorization in R

MY objective is to Automatically route the Feedback Email to respective division.
My fields are FNUMBER,CATEGORY, SUBCATEGORY, Description.
I have last 6 months Data in the above format - where the entire Email is stored in Description along with CATEGORY and SUBCATEGORY.

I have to analyse the DESCRIPTION column and find the Keywords for Each Category/subcategory and when next feedback email enters , it should automatically categorize into Categories and Sub categories based on the Keyword Generated from history Data.

I have Imported an XML file into R - for Text categorization in R and then converted the XML into a data frame with Required Fields. I have 23017 Records for a particular Month - I have only listed first twenty columns as a dataframe below.

I have more than 100 Categories and sub categries.
I am new to text mining Concept - however with the help of SO and tm package - I have tried below code:

step1 <-  structure(list(FNUMBER = structure(1:20, .Label = c(" 20131202-0885 ", 
"20131202-0886 ", "20131202-0985 ", "20131202-1145 ", "20131202-1227 ", 
"20131202-1228 ", "20131202-1235 ", "20131202-1236 ", "20131202-1247 ", 
"20131202-1248 ", "20131202-1249 ", "20131222-0157 ", "20131230-0668 ", 
"20131230-0706 ", "20131230-0776 ", "20131230-0863 ", "20131230-0865 ", 
"20131230-0866 ", "20131230-0868 ", "20131230-0874 "), class = "factor"), 
    CATEGORY = structure(c(9L, 14L, 11L, 6L, 10L, 12L, 7L, 11L, 
    13L, 13L, 6L, 1L, 2L, 5L, 4L, 8L, 8L, 3L, 11L, 11L), .Label = c(" BVL-Vocational Licence (VL) Investigation ", 
    " BVL - Bus Licensing ", " Corporate Transformation Office (CTO) ", 
    " CSV - Customer Service ", " Deregistration - Transfer/Split/Encash Rebates ", 
    " ENF - Enforcement Matters ", " ENF - Illegal Parking  ", 
    " Marina Coastal Expressway ", " PTQ - Public Transport Quality ", 
    " Road Asset Management ", " Service Quality (SQ) ", " Traffic Management & Cycling ", 
    " VR - Issuance/disputes of bookings by vendors ", " VRLSO - Update Owner's Particulars "
    ), class = "factor"), SUBCATEGORY = structure(c(2L, 15L, 
    5L, 1L, 3L, 14L, 6L, 12L, 8L, 8L, 18L, 17L, 11L, 10L, 16L, 
    7L, 9L, 4L, 13L, 12L), .Label = c(" Abandoned Vehicles ", 
    " Bus driver behaviour ", " Claims for accident ", " Corporate Development ", 
    " FAQ ", " Illegal Parking ", " Intra Group (Straddling Case) ", 
    " Issuance/disputes of bookings by vendors ", " MCE ", " PARF (Transfer/Split/Encash) ", 
    " Private bus related matters ", " Referrals ", " Straddle Cases (Across Groups) ", 
    " Traffic Flow ", " Update Owner Particulars ", " Vehicle Related Matters ", 
    " VL Holders (Complaint/Investigation/Appeal) ", " Warrant of Arrrest "
    ), class = "factor"), Description = structure(c(3L, 1L, 2L, 
    9L, 4L, 7L, 8L, 6L, 5L, 3L, 1L, 2L, 9L, 4L, 7L, 8L, 6L, 5L, 
    7L, 8L), .Label = c(" The street is the ONLY road leading to &amp; exit for vehicles and buses to (I think) four temples and, with the latest addition of 8B, four (!!) industrial estate.", 
    "Could you kindly increase the frequencies for Service 58. All my colleagues who travelled AVOID 58!!!\nThey would rather take 62-87 instead of 3-58", 
    "I saw bus no. 169A approaching the bus stop. At that time, the passengers had already boarded and alighted from the bus.", 
    "I want to apologise and excuse about my summon because I dont know can&apos;t park my motorcycle at the double line when I friday prayer ..please forgive me", 
    "Many thanks for the prompt action. However please note that the rectification could rather short term as it&apos;s just replacing the bulb but without the proper cover to protect against the elements.PS. the same job was done i.e. without installing a cover a few months back; and the same problem happen again.", 
    "Placed in such a manner than it cannot be seen properly due to the background ahead; colours blend.There is not much room angle to divert from 1st lane to 2nd lane. The outer most cone covers more than 1st lane", 
    "The vehicle GX3368K was observed to be driving along PIE towards Changi on 28th November 2013, 3:48pm without functioning braking lights during the day.", 
    "The vehicle was behaving suspiciously with many sudden brakes - which caused vehicles behind to do heavy &quot;jam brakes&quot; due to no warnings at all (no brake lights).", 
    "We have received a feedback regarding the back lane of the said address being blocked up by items.\nKindly investigate and keep us in the loop on the actions taken while we look into any fire safety issues on this case again."
    ), class = "factor")), .Names = c("FNUMBER", "CATEGORY", 
"SUBCATEGORY", "Description"), class = "data.frame", row.names = c(NA, 
-20L))  

dim(step1)
names(step1)
library(tm)
m <- list(ID = "FNUMBER", Content = "Description")
myReader <- readTabular(mapping = m)
txt <- Corpus(DataframeSource(step1), readerControl = list(reader = myReader))

summary(txt)
txt <- tm_map(txt,tolower)
txt <- tm_map(txt,removeNumbers)
txt <- tm_map(txt,removePunctuation)
txt <- tm_map(txt,stripWhitespace)
txt <- tm_map(txt,removeWords,stopwords("english"))
txt <- tm_map(txt,stemDocument)


tdm <- TermDocumentMatrix(txt,
                      control = list(removePunctuation = TRUE,
                                     stopwords = TRUE))
tdm

UPDATE: I have now got the Frequntly occuring keywords on the whole dataset:

tdm3 <-removeSparseTerms(tdm, 0.98)
TDM.dense <- as.matrix(tdm3)
TDM.dense = melt(TDM.dense, value.name = "count")
attach(TDM.dense)
TDM_Final <- aggregate(count, list(Terms), sum)
colnames(TDM_Final) <- c("Words","Word_Freq")

I am stuck after this. I am not sure about how to get:

1.The Relevant Keywords (unigrams,bi -grams and Trigrams) for Each Category/subcategory there by generating a Taxonomy list (Keywords with Ctaegory/Subcategory).

2.when next feedback email is entered how to categorize into Categories and Sub categories. (there are 100+ Categories ) based on the keyword taxonomy list generated on the above step.
3. Or if my above understanding and solution part is not correct, advise me on other possible option.

I have went through materials in internet (i can only able to see classification of text inot only two classes, not more than that) - but i am not able to proceed further.I am new to Text Mining in R - so excuse me , if this is very naive.

Any help or starting point would be great.

like image 844
Prasanna Nandakumar Avatar asked Mar 10 '14 04:03

Prasanna Nandakumar


1 Answers

I'll give a brief answer here because your question is a little vague.

This code below will quickly create a TDM for each CATEGORY for 2-grams.

library(RWeka)
library(SnowballC)

#Create a function that will produce a 'nvalue'-gram for the underlying dataset. Notice that the function accesses step1 data.frame external (it's not fed into the function). I'll leave it to someone else to fix this up!
makeNgramFeature=function(nvalue){

  tokenize=function(x){NGramTokenizer(x,Weka_control(min=nvalue,max=nvalue))}

  m <- list(ID = "FNUMBER", Content = "Description")
  myReader <- readTabular(mapping = m)
  txt <- Corpus(DataframeSource(step1), readerControl = list(reader = myReader))

  summary(txt)
  txt <- tm_map(txt,tolower)
  txt <- tm_map(txt,removeNumbers)
  txt <- tm_map(txt,removePunctuation)
  txt <- tm_map(txt,stripWhitespace)
  txt <- tm_map(txt,removeWords,stopwords("english"))
  txt <- tm_map(txt,stemDocument)


  tdm <- TermDocumentMatrix(txt,
                            control = list(removePunctuation = TRUE,
                                           stopwords = TRUE,
                                           tokenize=tokenize))
  return(tdm)
}

# All is a list of tdm for each category. You could simply create a 'cascade' of by functions, or create a unique list of category/sub-category pairs to analyse.
all=by(step1,INDICES=step1$CATEGORY,FUN=function(x){makeNgramFeature(2)})

The resulting list 'all' is a little ugly. You can run names(all) to look at the categories. I'm sure there is a cleaner way to solve this, but hopefully this gets you going on one of the many correct paths...

like image 104
slimCity Avatar answered Oct 12 '22 01:10

slimCity