Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting a data frame in rows according to a pattern in a column

Tags:

regex

dataframe

r

i have a data frame like so;

mydf=data.frame(Authors=c("A","B","C"), ID=c("1","2","3"), Adresses=c("[XYZ, DEF] Ege Univ, Izmir, Turkey","[Vil, Beat; Fern, Alm; Pro-Pas, Ram; Fevfz, Jes; Saur, Mari] INIA CSIC, Dept Genet Anim, Madrid, Spain; [Penza, Carna; Housen, Rosie] Univ Edigh, Roxbn Inst, Edinburgh, Scotland","[Zeek, Umt] Kastamonu Univ, Kast, Turkey; [Kalu, Sear] Ege Univ, Fac Engn, Izmir, Turkey"))

it seems like this:

data frame before

i want to split it according to pattern in Adresses column like this:

enter image description here

Here the pattern is something like this: [ ] ; But the last record of the cell (or if the cell has only one record ) doesn't have a semicolon as you can see from the first picture.

i tried with tidyr, dplyr, regex in r and also this strsplit(as.character(mydf[,3]), "[[(.*)]](.*);") pattern but it didn't work. Any help will be appreciated.

like image 700
gc7 Avatar asked Dec 13 '25 22:12

gc7


2 Answers

In base R, we can split the column into a list of vectors and then replicate the rows of the data based on the lengths of the list and update the 'Adresses' by unlisting the list

lst1 <- strsplit(mydf$Adresses, ";\\s*(?=\\[)", perl = TRUE)
mydf2 <-  transform(mydf[rep(seq_len(nrow(mydf)), lengths(lst1)),],
    Adresses = unlist(lst1))
row.names(mydf2) <- NULL

-output

> mydf2
  Authors ID                                                                                               Adresses
1       A  1                                                                     [XYZ, DEF] Ege Univ, Izmir, Turkey
2       B  2 [Vil, Beat; Fern, Alm; Pro-Pas, Ram; Fevfz, Jes; Saur, Mari] INIA CSIC, Dept Genet Anim, Madrid, Spain
3       B  2                              [Penza, Carna; Housen, Rosie] Univ Edigh, Roxbn Inst, Edinburgh, Scotland
4       C  3                                                               [Zeek, Umt] Kastamonu Univ, Kast, Turkey
5       C  3                                                         [Kalu, Sear] Ege Univ, Fac Engn, Izmir, Turkey
like image 152
akrun Avatar answered Dec 16 '25 14:12

akrun


You could use the ; (?=\\[)-regex, looking for a semi-colon and a space before a bracket.

E.g. with tidyr:

library(tidyr)

mydf |>
  separate_rows(Adresses, sep = "; (?=\\[)")

Output:

# A tibble: 5 × 3
  Authors ID    Adresses                                                                                              
  <chr>   <chr> <chr>                                                                                                 
1 A       1     [XYZ, DEF] Ege Univ, Izmir, Turkey                                                                    
2 B       2     [Vil, Beat; Fern, Alm; Pro-Pas, Ram; Fevfz, Jes; Saur, Mari] INIA CSIC, Dept Genet Anim, Madrid, Spain
3 B       2     [Penza, Carna; Housen, Rosie] Univ Edigh, Roxbn Inst, Edinburgh, Scotland                             
4 C       3     [Zeek, Umt] Kastamonu Univ, Kast, Turkey                                                              
5 C       3     [Kalu, Sear] Ege Univ, Fac Engn, Izmir, Turkey  
like image 45
harre Avatar answered Dec 16 '25 13:12

harre



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!