I'm trying to break a string vector into several variables using regular expressions in R, preferably in a dplyr-tidyr way using the tidyr::extract command. For insctance in the vector bellow: <pre class="prettyprint"><code>sasdic <- data.frame(a=c( '@1 ANO_CENSO 5. /*Ano do Censo*/', '@71 TP_SEXO $Char1. /*Sexo*/', '@72 TP_COR_RACA $Char1. /*Cor/raça*/', '@74 FK_COD_PAIS_ORIGEM 4. /*Código País de origem*/' )) </code></pre> I would like for the: <ul> <li>first number ([0-9]+) to go to variable "int_pos"</li> <li>the variable name connected by undersline ([a-zA-Z_]+) to go to variable "var_name"</li> <li>The second number or the term $Char1 (could be $Char2, etc) to go to var "x". I figured ([0-9]+|$Char[0-9]+) could select this?</li> <li>Lastly, whatever comes in between "/* ... /" to go to variable "label" (don´t know the regex for this). All other intermidiate caracters (blank spaces, ".", "/", "" should be disconsidered)</li> </ul> This would be the result <pre class="prettyprint"><code>d <- data.frame(int_pos=c(1,72,72,74), var_name=c('ANO_CENSO','TP_SEXO','TP_COR_RACA','FK_COD_PAIS_ORIGEM'), x=c('5','Chart1','$Char1','4'), label=c('Ano do Censo','Sexo','Cor/raça','Código País de origem') ) </code></pre> I tryed to construct a regular expression for this. This is what I got so far: <pre class="prettyprint"><code>sasdic %>% extract(a, c('int_pos','var_name','x','label'), "([0-9]+)([a-zA-Z_]+)([0-9]+|$Char[0-9]+)(something to get the label") -> d </code></pre> above the regular expression is incomplete. Also, I don't know hot to make explicit in the extract command syntax, what are the parts to be recovered and what are the parts to leave out.

In the regex used, we are matchng one more more punctuation characters (<code>[[:punct:]]+</code>) i.e. <code>@</code> followed by capturing the numeric part (<code>(\\d+)</code> - this will be our first column of interest), followed by one or more white-space (<code>\\s+</code>), followed by the second capture group (<code>\\S+</code> - one or more non white-space character i.e. "ANO_CENSO" for the first row), followed by space (<code>\\s+</code>), then we capture the third group (<code>([[:alum:]$]+)</code> - i.e. one or more characters that include the alpha numeric along with <code>$</code> so as to match <code>$Char1</code>), next we match one or more characters that are not a letter (<code>[^A-Za-z]+</code>- this should get rid of the space and <code>*</code>) and the last part we capture one or more characters that are not <code>*</code> (<code>([^*]+)</code>. <pre class="prettyprint"><code>sasdic %>% extract(a, into=c('int_pos', 'var_name', 'x', 'label'), "[[:punct:]](\\d+)\\s+(\\S+)\\s+([[:alnum:]$]+)[^A-Za-z]+([^*]+)") # int_pos var_name x label #1 1 ANO_CENSO 5 Ano do Censo #2 71 TP_SEXO $Char1 Sexo #3 72 TP_COR_RACA $Char1 Cor/raça #4 74 FK_COD_PAIS_ORIGEM 4 Código País de origem </code></pre>

Break string into several columns using tidyr::extract regex

I'm trying to break a string vector into several variables using regular expressions in R, preferably in a dplyr-tidyr way using the tidyr::extract command. For insctance in the vector bellow:

sasdic <- data.frame(a=c(
  '@1   ANO_CENSO   5.  /*Ano do Censo*/',
  '@71  TP_SEXO $Char1. /*Sexo*/',
  '@72  TP_COR_RACA $Char1. /*Cor/raça*/',
  '@74  FK_COD_PAIS_ORIGEM  4.  /*Código País de origem*/' ))

I would like for the:

first number ([0-9]+) to go to variable "int_pos"
the variable name connected by undersline ([a-zA-Z_]+) to go to variable "var_name"
The second number or the term $Char1 (could be $Char2, etc) to go to var "x". I figured ([0-9]+|$Char[0-9]+) could select this?
Lastly, whatever comes in between "/* ... /" to go to variable "label" (don´t know the regex for this). All other intermidiate caracters (blank spaces, ".", "/", "" should be disconsidered)

This would be the result

d <- data.frame(int_pos=c(1,72,72,74),
                var_name=c('ANO_CENSO','TP_SEXO','TP_COR_RACA','FK_COD_PAIS_ORIGEM'),
                x=c('5','Chart1','$Char1','4'),
                label=c('Ano do Censo','Sexo','Cor/raça','Código País de origem') )

I tryed to construct a regular expression for this. This is what I got so far:

sasdic %>% extract(a, c('int_pos','var_name','x','label'), 
                   "([0-9]+)([a-zA-Z_]+)([0-9]+|$Char[0-9]+)(something to get the label") 
             -> d

above the regular expression is incomplete. Also, I don't know hot to make explicit in the extract command syntax, what are the parts to be recovered and what are the parts to leave out.

How do I split a string into multiple columns in R?

To split a column into multiple columns in the R Language, We use the str_split_fixed() function of the stringr package library. The str_split_fixed() function splits up a string into a fixed number of pieces.

How do I extract multiple columns in R?

To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.

Which function in Tidyr package is used to split a single column into multiple columns?

separate( ) function: This can be accomplished using the separate() function which turns a single character column into multiple columns.

In the regex used, we are matchng one more more punctuation characters ([[:punct:]]+) i.e. @ followed by capturing the numeric part ((\\d+) - this will be our first column of interest), followed by one or more white-space (\\s+), followed by the second capture group (\\S+ - one or more non white-space character i.e. "ANO_CENSO" for the first row), followed by space (\\s+), then we capture the third group (([[:alum:]$]+) - i.e. one or more characters that include the alpha numeric along with $ so as to match $Char1), next we match one or more characters that are not a letter ([^A-Za-z]+- this should get rid of the space and *) and the last part we capture one or more characters that are not * (([^*]+).

sasdic %>% 
      extract(a, into=c('int_pos', 'var_name', 'x', 'label'),
   "[[:punct:]](\\d+)\\s+(\\S+)\\s+([[:alnum:]$]+)[^A-Za-z]+([^*]+)")

#  int_pos           var_name      x                 label
#1       1          ANO_CENSO      5          Ano do Censo
#2      71            TP_SEXO $Char1                  Sexo
#3      72        TP_COR_RACA $Char1              Cor/raça
#4      74 FK_COD_PAIS_ORIGEM      4 Código País de origem

This is another option, though it uses the data.table package instead of tidyr:

library(data.table)
setDT(sasdic)

# split label
sasdic[, c("V1","label") := tstrsplit(a, "/\\*|\\*/")]                   
# remove leading "@", split remaining parts
sasdic[, c("int_pos","var_name","x") := tstrsplit(gsub("^@","",V1)," +")]
# remove unneeded columns
sasdic[, c("a","V1") := NULL]                                            

sasdic

#                    label int_pos           var_name       x
# 1:          Ano do Censo       1          ANO_CENSO      5.
# 2:                  Sexo      71            TP_SEXO $Char1.
# 3:              Cor/raça      72        TP_COR_RACA $Char1.
# 4: Código País de origem      74 FK_COD_PAIS_ORIGEM      4.

This assumes that the "remaining parts" (aside from the label) are space-separated.

This could also be done in one block (which is what I would do):

sasdic[, c("a","label","int_pos","var_name","x") := {
  x   = tstrsplit(a, "/\\*|\\*/")
  x1s = tstrsplit(gsub("^@","",x[[1]])," +")
  c(list(NULL), x1s, x[2])
}]

Break string into several columns using tidyr::extract regex

Tags:

regex

r

tidyr

LucasMation

People also ask

2 Answers

akrun

Frank

Recent Activity

Donate For Us

Break string into several columns using tidyr::extract regex

Tags:

regex

r

tidyr

LucasMation

People also ask

2 Answers

akrun

Frank

Related questions

Recent Activity

Donate For Us