I'm trying to break a string vector into several variables using regular expressions in R, preferably in a dplyr-tidyr way using the tidyr::extract command. For insctance in the vector bellow:
sasdic <- data.frame(a=c(
'@1 ANO_CENSO 5. /*Ano do Censo*/',
'@71 TP_SEXO $Char1. /*Sexo*/',
'@72 TP_COR_RACA $Char1. /*Cor/raça*/',
'@74 FK_COD_PAIS_ORIGEM 4. /*Código País de origem*/' ))
I would like for the:
This would be the result
d <- data.frame(int_pos=c(1,72,72,74),
var_name=c('ANO_CENSO','TP_SEXO','TP_COR_RACA','FK_COD_PAIS_ORIGEM'),
x=c('5','Chart1','$Char1','4'),
label=c('Ano do Censo','Sexo','Cor/raça','Código País de origem') )
I tryed to construct a regular expression for this. This is what I got so far:
sasdic %>% extract(a, c('int_pos','var_name','x','label'),
"([0-9]+)([a-zA-Z_]+)([0-9]+|$Char[0-9]+)(something to get the label")
-> d
above the regular expression is incomplete. Also, I don't know hot to make explicit in the extract command syntax, what are the parts to be recovered and what are the parts to leave out.
To split a column into multiple columns in the R Language, We use the str_split_fixed() function of the stringr package library. The str_split_fixed() function splits up a string into a fixed number of pieces.
To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.
separate( ) function: This can be accomplished using the separate() function which turns a single character column into multiple columns.
In the regex used, we are matchng one more more punctuation characters ([[:punct:]]+
) i.e. @
followed by capturing the numeric part ((\\d+)
- this will be our first column of interest), followed by one or more white-space (\\s+
), followed by the second capture group (\\S+
- one or more non white-space character i.e. "ANO_CENSO" for the first row), followed by space (\\s+
), then we capture the third group (([[:alum:]$]+)
- i.e. one or more characters that include the alpha numeric along with $
so as to match $Char1
), next we match one or more characters that are not a letter ([^A-Za-z]+
- this should get rid of the space and *
) and the last part we capture one or more characters that are not *
(([^*]+)
.
sasdic %>%
extract(a, into=c('int_pos', 'var_name', 'x', 'label'),
"[[:punct:]](\\d+)\\s+(\\S+)\\s+([[:alnum:]$]+)[^A-Za-z]+([^*]+)")
# int_pos var_name x label
#1 1 ANO_CENSO 5 Ano do Censo
#2 71 TP_SEXO $Char1 Sexo
#3 72 TP_COR_RACA $Char1 Cor/raça
#4 74 FK_COD_PAIS_ORIGEM 4 Código País de origem
This is another option, though it uses the data.table package instead of tidyr:
library(data.table)
setDT(sasdic)
# split label
sasdic[, c("V1","label") := tstrsplit(a, "/\\*|\\*/")]
# remove leading "@", split remaining parts
sasdic[, c("int_pos","var_name","x") := tstrsplit(gsub("^@","",V1)," +")]
# remove unneeded columns
sasdic[, c("a","V1") := NULL]
sasdic
# label int_pos var_name x
# 1: Ano do Censo 1 ANO_CENSO 5.
# 2: Sexo 71 TP_SEXO $Char1.
# 3: Cor/raça 72 TP_COR_RACA $Char1.
# 4: Código País de origem 74 FK_COD_PAIS_ORIGEM 4.
This assumes that the "remaining parts" (aside from the label) are space-separated.
This could also be done in one block (which is what I would do):
sasdic[, c("a","label","int_pos","var_name","x") := {
x = tstrsplit(a, "/\\*|\\*/")
x1s = tstrsplit(gsub("^@","",x[[1]])," +")
c(list(NULL), x1s, x[2])
}]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With