Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Break string into several columns using tidyr::extract regex

Tags:

regex

r

tidyr

I'm trying to break a string vector into several variables using regular expressions in R, preferably in a dplyr-tidyr way using the tidyr::extract command. For insctance in the vector bellow:

sasdic <- data.frame(a=c(
  '@1   ANO_CENSO   5.  /*Ano do Censo*/',
  '@71  TP_SEXO $Char1. /*Sexo*/',
  '@72  TP_COR_RACA $Char1. /*Cor/raça*/',
  '@74  FK_COD_PAIS_ORIGEM  4.  /*Código País de origem*/' )) 

I would like for the:

  • first number ([0-9]+) to go to variable "int_pos"
  • the variable name connected by undersline ([a-zA-Z_]+) to go to variable "var_name"
  • The second number or the term $Char1 (could be $Char2, etc) to go to var "x". I figured ([0-9]+|$Char[0-9]+) could select this?
  • Lastly, whatever comes in between "/* ... /" to go to variable "label" (don´t know the regex for this). All other intermidiate caracters (blank spaces, ".", "/", "" should be disconsidered)

This would be the result

d <- data.frame(int_pos=c(1,72,72,74),
                var_name=c('ANO_CENSO','TP_SEXO','TP_COR_RACA','FK_COD_PAIS_ORIGEM'),
                x=c('5','Chart1','$Char1','4'),
                label=c('Ano do Censo','Sexo','Cor/raça','Código País de origem') )

I tryed to construct a regular expression for this. This is what I got so far:

sasdic %>% extract(a, c('int_pos','var_name','x','label'), 
                   "([0-9]+)([a-zA-Z_]+)([0-9]+|$Char[0-9]+)(something to get the label") 
             -> d

above the regular expression is incomplete. Also, I don't know hot to make explicit in the extract command syntax, what are the parts to be recovered and what are the parts to leave out.

like image 239
LucasMation Avatar asked Jan 15 '16 15:01

LucasMation


People also ask

How do I split a string into multiple columns in R?

To split a column into multiple columns in the R Language, We use the str_split_fixed() function of the stringr package library. The str_split_fixed() function splits up a string into a fixed number of pieces.

How do I extract multiple columns in R?

To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.

Which function in Tidyr package is used to split a single column into multiple columns?

separate( ) function: This can be accomplished using the separate() function which turns a single character column into multiple columns.


2 Answers

In the regex used, we are matchng one more more punctuation characters ([[:punct:]]+) i.e. @ followed by capturing the numeric part ((\\d+) - this will be our first column of interest), followed by one or more white-space (\\s+), followed by the second capture group (\\S+ - one or more non white-space character i.e. "ANO_CENSO" for the first row), followed by space (\\s+), then we capture the third group (([[:alum:]$]+) - i.e. one or more characters that include the alpha numeric along with $ so as to match $Char1), next we match one or more characters that are not a letter ([^A-Za-z]+- this should get rid of the space and *) and the last part we capture one or more characters that are not * (([^*]+).

sasdic %>% 
      extract(a, into=c('int_pos', 'var_name', 'x', 'label'),
   "[[:punct:]](\\d+)\\s+(\\S+)\\s+([[:alnum:]$]+)[^A-Za-z]+([^*]+)")

#  int_pos           var_name      x                 label
#1       1          ANO_CENSO      5          Ano do Censo
#2      71            TP_SEXO $Char1                  Sexo
#3      72        TP_COR_RACA $Char1              Cor/raça
#4      74 FK_COD_PAIS_ORIGEM      4 Código País de origem
like image 121
akrun Avatar answered Sep 21 '22 11:09

akrun


This is another option, though it uses the data.table package instead of tidyr:

library(data.table)
setDT(sasdic)

# split label
sasdic[, c("V1","label") := tstrsplit(a, "/\\*|\\*/")]                   
# remove leading "@", split remaining parts
sasdic[, c("int_pos","var_name","x") := tstrsplit(gsub("^@","",V1)," +")]
# remove unneeded columns
sasdic[, c("a","V1") := NULL]                                            

sasdic

#                    label int_pos           var_name       x
# 1:          Ano do Censo       1          ANO_CENSO      5.
# 2:                  Sexo      71            TP_SEXO $Char1.
# 3:              Cor/raça      72        TP_COR_RACA $Char1.
# 4: Código País de origem      74 FK_COD_PAIS_ORIGEM      4.

This assumes that the "remaining parts" (aside from the label) are space-separated.


This could also be done in one block (which is what I would do):

sasdic[, c("a","label","int_pos","var_name","x") := {
  x   = tstrsplit(a, "/\\*|\\*/")
  x1s = tstrsplit(gsub("^@","",x[[1]])," +")
  c(list(NULL), x1s, x[2])
}]
like image 34
Frank Avatar answered Sep 23 '22 11:09

Frank