I have a dataset with a column that is currently being treated as a factor with 1000+ levels. These are values for the column. I would like to clean up this data. Some values are strings like "-18 + 5 = -13" and "5 - 18 = -13", I would like the clustering to group these differently than say "R3no4".
Is this possible in R? I looked at the natural language processing task view http://cran.r-project.org/web/views/NaturalLanguageProcessing.html but I need to be pushed in the right direction.
the dataset is from the kdd 2010 cup I would like to create meaningful new columns from this column to aid in creating a predictive model. for example it would be nice to know if the string contains a certain operation, or if it contains no operations and instead is describing the problem.
my data frame looks like this:
str(data1)
'data.frame': 809694 obs. of 19 variables:
$ Row : int 1 2 3 4 5 6 7 8 9 10 ...
$ Anon.Student.Id : Factor w/ 574 levels "02i5jCrfQK","02ZjVTxC34",..: 7 7 7 7 7 7 7 7 7 7 ...
$ Problem.Hierarchy : Factor w/ 138 levels "Unit CTA1_01, Section CTA1_01-1",..: 80 80 80 80 80 80 80 80 80 80 ...
$ Problem.Name : Factor w/ 1084 levels "1PTB02","1PTB03",..: 377 377 378 378 378 378 378 378 378 378 ...
$ Problem.View : int 1 1 1 1 2 2 3 3 4 4 ...
$ Step.Name : Factor w/ 187539 levels "-(-0.24444444-y) = -0.93333333",..: 116742 177541 104443 64186 58776 58892 153246 153078 45114 163923 ...
I'm most interested in the Step.Name feature, since it contains the greatest number of unique factor values.
and some example values for step name:
[97170] (1+7)/4 = x
[97171] (1-sqrt(1^2-4*2*-6))/4 = x
[97172] (1-sqrt(1^2-(-48)))/4 = x
[97173] (1-sqrt(1-(-48)))/4 = x
[97174] (1-sqrt(49))/4 = x
[97175] (1-7)/4 = x
[97176] x^2+15x+44 = 0
[97177] a-factor-node
[97178] b-factor-node
[97179] c-factor-node
[97180] num1-factor-node
[97181] num2-factor-node
[97182] den1-factor-node
[97183] (-15?sqrt((-15)^2-4*1*44))/2 = x
[97184] (-15+sqrt((-15)^2-4*1*44))/2 = x
[97185] (-15+sqrt((-15)^2-176))/2 = x
[97186] (-15+sqrt(225-176))/2 = x
[97187] (-15+sqrt(49))/2 = x
[97188] (-15+7)/2 = x
[97189] (-15-sqrt((-15)^2-4*1*44))/2 = x
[97190] (-15-sqrt((-15)^2-176))/2 = x
[97191] (-15-sqrt(225-176))/2 = x
[97192] (-15-sqrt(49))/2 = x
[97193] (-15-7)/2 = x
[97194] 2x^2+x = 0
[97195] a-factor-node
[97196] b-factor-node
[97197] c-factor-node
[97198] num1-factor-node
[97199] num2-factor-node
[97200] den1-factor-node
[97201] (-1?sqrt((-1)^2-4*2*0))/4 = x
[97202] (-1+sqrt((-1)^2-4*2*0))/4 = x
[97203] (-1+sqrt((-1)^2-0))/4 = x
[97204] (-1+sqrt((-1)^2))/4 = x
[97205] (-1+1)/4 = x
[97206] (-1-sqrt((-1)^2-4*2*0))/4 = x
[97207] (-1-sqrt((-1)^2-0))/4 = x
[97208] (-1-sqrt((-1)^2))/4 = x
[97209] (-1-1)/4 = x
[97210] x^2-6x = 0
[97211] a-factor-node
[97212] b-factor-node
Clustering is just scoring each instance in a data array according to some metric, sorting the data array according to this calculated score, then slicing into some number of segments, assigning a label each one.
In other words, you can cluster any data for which you can formulate some meaningful function to calculate similarity of each data point w/r/t the others; this is usually referred to as a similarity metric.
There are a lot of these, but only a small subset of them are useful to evaluate strings. Of these, perhaps the most commonly used is Levenshtein Distance (aka Edit Distance).
This metric is expressed as an integer, and it increments one unit (+1) for each 'edit'--inserting, deleting, or changing a letter--required to transform one word into another. Summing those individual edits (one for each letter) gives you the Levenshtein Distance.
The R Package vwr includes an implementation:
> library(vwr)
> levenshtein.distance('cat', 'hat')
hat
1
> levenshtein.distance('cat', 'catwalk')
catwalk
4
> levenshtein.distance('catwalk', 'sidewalk')
sidewalk
4
> # using a data set supplied with the vmr library
> EW = english.words
> ew1 = sample(EW, 20) # random select 20 words from EW
> # the second argument is a vector of words, returns a vector of distances
> dx = levenshtein.distance('cat', ew1)
> dx
furriers graves crooned cursively gabled caparisons drainpipes
8 5 6 8 5 8 9
patricians medially beholder chirpiness fluttered bobolink lamentably
8 7 8 9 8 8 8
depredations alights unearthed thimbles supersede dissembler
10 6 7 8 9 10
While Levenshtein Distance can be used to cluster your data, whether it should be used for your data is a question i'll leave to you (i.e., the primary use case for L/D is clearly pure text data).
(Perhaps the next-most-common similarity metric that operates on strings is Hamming Distance. Hamming Distance (unlike Levenshtein) requires that the two strings be of equal length, hence it won't work for your data.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With