Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to pull out specific rows from two data frames with different dimensions and produce multiple .csv files?

Tags:

sqlite

r

Data frame one.

  structure(list(trial_id = c(2022L, 2023L, 2123L, 2184L, 3883L, 
4434L), ctri_number = c("CTRI/2018/02/011794 ", "CTRI/2017/08/009517 ", 
"CTRI/2019/05/019036 ", "CTRI/2017/12/010935 ", "CTRI/2017/09/009746 ", 
"CTRI/2016/06/007055 "), name = c("National Institute of Allergy and Infectious Diseases NIAIDMaryland USA", 
"Jawaharlal Nehru Medical College", "KLEU Ayurveda Pharmacy", 
"Amgen Inc", "Dr Arunkumar", "ALVAS EDUCATION FOUNDATION"), type_of_sponsor = c("' Government funding agency '", 
"' Government medical college '", "' Research institution '", 
"' Pharmaceutical industry-Global '", " Other [Self sponsored] '", 
"' Private hospital/clinic '"), address = c("' USA '", "' Jawaharlal Nehru Medical College, Aligarh Muslim University, Aligarh-202001 '", 
"' KLEU Ayurveda Pharmacy, Khasbhag, Belgaum, Karnataka '", "' One Amgen Center Drive\n\n\nThousand Oaks, CA USA\n\n\n91320 '", 
"' Room no 32 ,Department of Periodontics , Government Dental college , Trivandrum '", 
"' ALVAS EDUCATION FOUNDATION ALVAS COLLEGE OF PHYSIOTHERAPY\n\n\nMoodabidri - 574227\n\n\nSouth Canara District\n\n\nKarnataka '"
)), row.names = c(NA, 6L), class = "data.frame")

Data frame two.

    structure(list(distinctOrganizations = c("A AMMU", "A and U tibbia college and hospital", 
"A Arumuga kani", "A KIREETI", "AAMIR ZUBAIR SHAIKH", "Aansu Susan Varghese"
)), row.names = c(NA, 6L), class = "data.frame")

Using all the data fields from data frame 2(distinctOrganizations) I have to pull out the rows from data frame one which match the values in the name column.

However, each data field should produce a specific .csv file.

How can I achieve this?


Possible Outcome- A CSV file similar to the image.

The image is of CSV file which contains all the rows related to AIIMS and its variants only. I need CSV file different for each such name.

like image 733
classy_BLINK Avatar asked Oct 26 '22 10:10

classy_BLINK


People also ask

How to split csv file into multiple files using Pandas?

Method 3: Splitting based both on Rows and ColumnsUsing groupby() method of Pandas we can create multiple CSV files row-wise. To create a file we can use the to_csv() method of Pandas. Here created two files based on row values “male” and “female” values of specific Gender column for Spending Score.

How do I split a large CSV file in Python?

Step 1 (Using Pandas): Find the number of rows from the files. Step 1 (Using Traditional Python): Find the number of rows from the files. Step 2: User to input the number of lines per file (Range) and generate a random number. In case you want an equal split, provide the same number for max and min.


1 Answers

First of all: Your example data don't match any lines (df2 doesn't provide any names contained in your example df1).

If I got your question right, you could use

library(dplyr)
library(purrr)
library(readr)

df1 %>% 
  inner_join(df2, by = c("name" = "distinctOrganizations")) %>% 
  split(f = .$name) %>% 
  walk(~write_csv(.x, paste0(unique(.x$name), ".csv")))
  1. We use an inner_join to remove all elements from df1 that don't have a match in df2
  2. Then we split the resulting data.frame by name, creating a new data.frame for each (distinct) organization
  3. Finally we use purrr's walk function to write a .csv-file for each of these organizations. This produces .csv-files like Amgen Inc.csv or ALVAS EDUCATION FOUNDATION.csv.

Note: The address column contains some line breaks (\n). You should consider removing them, those could cause trouble in your .csv and in your next steps working with those. There are also some white spaces in column type_of_sponsor (at the beginning and the end) you perhaps want to remove.

enter image description here

Data

I modified df2 to get two matches:

df2 <- structure(list(distinctOrganizations = c("Amgen Inc", "A and U tibbia college and hospital", 
"ALVAS EDUCATION FOUNDATION", "A KIREETI", "AAMIR ZUBAIR SHAIKH", 
"Aansu Susan Varghese")), row.names = c(NA, 6L), class = "data.frame")
like image 188
Martin Gal Avatar answered Jan 02 '23 21:01

Martin Gal