Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group categories in R according to first letters of a string?

I have a dataset loaded in R, and I have one of the columns that has text. This text is not unique (any row can have the same value) but it represents a specific condition of a row, and so the first 3-5 letters of this field will represent the group where the row belongs. Let me explain with an example.

Having 3 different rows, only showing the id and the column I need to group by:

ID........... TEXTFIELD    
1............ VGH2130    
2............ BFGF2345    
3............ VGH3321

Having the previous example, I would like to create a new column in the dataframe where it would be set the group such as

ID........... TEXTFIELD........... NEWCOL    
1............ VGH2130............. VGH    
2............ BFGF2345............ BFGF    
3............ VGH3321............. VGH

And to determine the groups that would be formed in this new column, I would like to make an array with the possible groups to make (since all the rows will be contained in one of these groups) (for example c <- ("VGH","BFGF",......) )

Can anyone drop any light on how to efficiently do this? (without making a for loop having to do this, since I have millions of rows and this would take ages)

like image 512
heythatsmekri Avatar asked Apr 23 '15 13:04

heythatsmekri


2 Answers

You can also try

> data$group <- (str_extract(TEXTFIELD, "[aA-zZ]+"))
> data
  ID TEXTFIELD group
1  1   VGH2130   VGH
2  2  BFGF2345  BFGF
3  3   VGH3321   VGH
like image 131
Prasanna Nandakumar Avatar answered Nov 15 '22 04:11

Prasanna Nandakumar


you can try, if df is your data.frame:

df$NEWCOL <- gsub("([A-Z)]+)\\d+.*","\\1", df$TEXTFIELD)

> df
#  ID TEXTFIELD NEWCOL
#1  1   VGH2130    VGH
#2  2  BFGF2345   BFGF
#3  3   VGH3321    VGH
like image 43
Cath Avatar answered Nov 15 '22 03:11

Cath