Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditional string split in data.table in R

Based on this question: Split text string in a data.table columns, I was wondering whether there is an efficient method to split the text string conditionally based on the contents of the row.

Suppose I have the following table:

Comments                  Eaten
001 Testing my computer   No
0026 Testing my fridge    No
Testing my car            Yes

and I would to have this:

ID   Comments             Eaten
001  Testing my computer  No
0026 Testing my fridge    No
NA   Testing my car       Yes

Where NA is empty.

Is this possible in data.table?

The comment should have an ID, but since this is optional, I only want to extract the ID's if and only if the comment starts with a number.

like image 812
Snowflake Avatar asked Mar 09 '23 04:03

Snowflake


1 Answers

This could be done using tidyr's extract function which allows you to specify a regex pattern:

tidyr::extract(dt, Comments, c("ID", "Comments"), regex = "^(\\d+)?\\s?(.*)$")
#     ID            Comments Eaten
#1:  001 Testing my computer    No
#2: 0026   Testing my fridge    No
#3:   NA      Testing my car   Yes

You can add the argument convert = TRUE if you want the extracted columns to be converted to a more sensible type.


Another option using only base R and data.table would be

dt[grepl("^\\d+", Comments),                     # check if start with ID (subset)
   `:=`(ID = sub("^(\\d+).*", "\\1",Comments),   # extract ID from comments
        Comments = sub("^(\\d+)", "",Comments))  # delete ID from Comments
]

Though in this case the tidyr syntax seems a little easier to me. There may also be a way using data.table's tstrsplit function with a fancy lookaround regex.

like image 63
talat Avatar answered Mar 20 '23 17:03

talat