Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I split a string with tidyr::separate in R and retain the values of the separator string?

Tags:

r

tidyr

stringr

I have a data set:

crimes<-data.frame(x=c("Smith", "Jones"), charges=c("murder, first degree-G, manslaughter-NG", "assault-NG, larceny, second degree-G"))

I'm using tidyr:separate to split the charges column on a match with "G,"

crimes<-separate(crimes, charges, into=c("v1","v2"), sep="G,")

This splits my columns, but removes the separator "G,". I want to retain the "G," in the resulting column split.

My desired output is:

 x         v1                       v2
 Smith     murder, first degree-G   manslaughter-NG
 Jones     assault-NG               larceny, second degree-G

Any suggestions welcome.

like image 830
TDog Avatar asked Apr 13 '16 00:04

TDog


People also ask

What does separate () do in R?

separate() turns a single character column into multiple columns by splitting the values of the column wherever a separator character appears.

How do you split a string in R?

To split a string in R, use the strsplit() method. The strsplit() is a built-in R function that splits the string vector into sub-strings. The strsplit() method returns the list, where each list item resembles the item of input that has been split.

How do I split text into data in R?

Use the split() function in R to split a vector or data frame. Use the unsplit() method to retrieve the split vector or data frame.

How do I separate data into separate columns in R?

To split a column into multiple columns in the R Language, we use the separator() function of the dplyr package library. The separate() function separates a character column into multiple columns with a regular expression or numeric locations.

How to split a single column into multiple columns with tidyr’ separate ()?

The data frame contains just single column of file names. How to Split a Single Column into Multiple Columns with tidyr’ separate ()? Let us use separate function from tidyr to split the “file_name” column into multiple columns with specific column name. Here, we will specify the column names in a vector.

How do I use separate delimiters in tidyr?

By default, separate uses regular expression that matches any sequence of non-alphanumeric values as delimiter to split. In this example, tidyr automatically found that the delimiters are underscore and dot and separted the single column to four columns with the names specified. Often you want only part of text in a column.

How do you split a string in R with delimiter?

Use strsplit () function with delimiter in R A delimiter in programming is a symbol or a special character or value that separates the words or text in the data. Let’s use the & character as a delimiter and split the string from that character. rs <- ("This&is&First&R&String&Example") strsplit (rs, split = "&")

How to separate string and a numeric value in R?

How to separate string and a numeric value in R? To separate string and a numeric value, we can use strplit function and split the values by passing all type of characters and all the numeric values.


2 Answers

Replace <yourRegexPattern> with your Regex

If you want the 'sep' in the left column (look behind)

dataframe %>% separate(column_to_sep, into = c("newCol1", "newCol2"), sep="(?<=<yourRegexPattern>)")

If you want the 'sep' in the right column (look ahead)

dataframe %>% separate(column_to_sep, into = c("newCol1", "newCol2"), sep="(?=<yourRegexPattern>)")

Also note that when you are trying to separate a word from a group of digits (I.E. Auguest1990 to August and 1990) you will need to ensure the whole pattern gets read.

Example:

dataframe %>% separate(column_to_sep, into = c("newCol1", "newCol2"), sep="(?=[[:digit:]])", extra="merge")
like image 103
Cameron Avatar answered Sep 24 '22 21:09

Cameron


UPDATE

This is what you ask for. Keep in mind that your data is not tidy (both V1 and V2 have more than one variable inside each column)

A<-separate(crimes,charges,into=c("V1","V2"),sep = "(?<=G,)")
A
      x                      V1                        V2
1 Smith murder, first degree-G,           manslaughter-NG
2 Jones             assault-NG,  larceny, second degree-G

An easier way to get keep the "G" or "NG" is to use sep=", " as said by alistaire.

A<-separate(crimes, charges, into=c("v1","v2"), sep = ', ')

This gives

      x         v1              v2
1 Smith   murder-G manslaughter-NG
2 Jones assault-NG       larceny-G

If you wanted to keep separating your data.frame (using the -)

separate(A, v1, into = c("v3","v4"), sep = "-")

that gives

      x      v3 v4              v2
1 Smith  murder  G manslaughter-NG
2 Jones assault NG       larceny-G

You'll need to do that again for the v2 column. I don't know if you want to keep separating, please post your expected output to make my answer more specific.

like image 25
Matias Andina Avatar answered Sep 23 '22 21:09

Matias Andina