I have a dataset loaded in R, and I have one of the columns that has text. This text is not unique (any row can have the same value) but it represents a specific condition of a row, and so the first 3-5 letters of this field will represent the group where the row belongs. Let me explain with an example.
Having 3 different rows, only showing the id and the column I need to group by:
ID........... TEXTFIELD
1............ VGH2130
2............ BFGF2345
3............ VGH3321
Having the previous example, I would like to create a new column in the dataframe where it would be set the group such as
ID........... TEXTFIELD........... NEWCOL
1............ VGH2130............. VGH
2............ BFGF2345............ BFGF
3............ VGH3321............. VGH
And to determine the groups that would be formed in this new column, I would like to make an array with the possible groups to make (since all the rows will be contained in one of these groups) (for example c <- ("VGH","BFGF",......)
)
Can anyone drop any light on how to efficiently do this? (without making a for loop having to do this, since I have millions of rows and this would take ages)
You can also try
> data$group <- (str_extract(TEXTFIELD, "[aA-zZ]+"))
> data
ID TEXTFIELD group
1 1 VGH2130 VGH
2 2 BFGF2345 BFGF
3 3 VGH3321 VGH
you can try, if df
is your data.frame:
df$NEWCOL <- gsub("([A-Z)]+)\\d+.*","\\1", df$TEXTFIELD)
> df
# ID TEXTFIELD NEWCOL
#1 1 VGH2130 VGH
#2 2 BFGF2345 BFGF
#3 3 VGH3321 VGH
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With