Based on this question: Split text string in a data.table columns, I was wondering whether there is an efficient method to split the text string conditionally based on the contents of the row.
Suppose I have the following table:
Comments Eaten
001 Testing my computer No
0026 Testing my fridge No
Testing my car Yes
and I would to have this:
ID Comments Eaten
001 Testing my computer No
0026 Testing my fridge No
NA Testing my car Yes
Where NA is empty.
Is this possible in data.table?
The comment should have an ID, but since this is optional, I only want to extract the ID's if and only if the comment starts with a number.
This could be done using tidyr
's extract
function which allows you to specify a regex pattern:
tidyr::extract(dt, Comments, c("ID", "Comments"), regex = "^(\\d+)?\\s?(.*)$")
# ID Comments Eaten
#1: 001 Testing my computer No
#2: 0026 Testing my fridge No
#3: NA Testing my car Yes
You can add the argument convert = TRUE
if you want the extracted columns to be converted to a more sensible type.
Another option using only base R and data.table would be
dt[grepl("^\\d+", Comments), # check if start with ID (subset)
`:=`(ID = sub("^(\\d+).*", "\\1",Comments), # extract ID from comments
Comments = sub("^(\\d+)", "",Comments)) # delete ID from Comments
]
Though in this case the tidyr syntax seems a little easier to me. There may also be a way using data.table's tstrsplit
function with a fancy lookaround regex.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With