Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging Dataframe chunks in Pandas

I currently have a script that will combine multiple csv files into one, the script works fine except that we run out of ram really quickly when larger files start being used. This is an issue for one reason, the script runs on an AWS server and running out of RAM means a server crash. Currently the file size limit is around 250mb each, and that limits us to 2 files, however as the company I work is in Biotech and we're using Genetic Sequencing files, the files we use can range in size from 17mb up to around 700mb depending on the experiment. My idea has been to load one dataframe into memory whole and then chunk the others and combine iteratively, this didn't work so well.

My dataframes are similar to this (they can vary in size, but some columns remain the same; "Mod", "AA" and "Nuc")

+-----+-----+-----+-----+-----+-----+-----+-----+
| Mod | Nuc | AA  | 1_1 | 1_2 | 1_3 | 1_4 | 1_5 |
+-----+-----+-----+-----+-----+-----+-----+-----+
| 000 | ABC | ABC | 10  | 5   | 9   | 16  | 8   |
+-----+-----+-----+-----+-----+-----+-----+-----+
| 010 | CBA | CBA | 0   | 1   | 4   | 9   | 0   |
+-----+-----+-----+-----+-----+-----+-----+-----+

When combining the two frames I need them to merge on "Mod", "Nuc" and "AA" so that I have something similar to this

+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| Mod | Nuc | AA  | 1_1 | 1_2 | 1_3 | 1_4 | 1_5 | 2_1 | 2_2 | 2_3 |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 000 | ABC | ABC | 10  | 5   | 9   | 16  | 8   | 5   | 29  | 0   |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 010 | CBA | CBA | 0   | 1   | 4   | 9   | 0   | 0   | 0   | 1   |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

I already have code to change the names of the headers so I'm not worried about that, however when I use chunks I end up with something closer to

+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| Mod | Nuc | AA  | 1_1 | 1_2 | 1_3 | 1_4 | 1_5 | 2_1 | 2_2 | 2_3 | 3_1 | 3_2 | 3_3 |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 000 | ABC | ABC | 10  | 5   | 9   | 16  | 8   | 5   | 29  | 0   | NA  | NA  | NA  |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 010 | CBA | CBA | 0   | 1   | 4   | 9   | 0   | NA  | NA  | NA  | 0   | 0   | 1   |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+

basically it treats each chunk as if it were a new file and not from the same one.

I know why its doing that but I'm not sure on how to fix this, right now my code for chunking is really simple.

    file = "tableFile/123456.txt"
    initDF = pd.read_csv(file, sep="\t", header=0)
    file2 = "tableFile/7891011.txt"
    for chunks in pd.read_csv(file2, sep="\t", chunksize=50000, header=0):
        initDF = initDF.merge(chunks, how='right', on=['Mod', "Nuc", "AA"])

as you can see its pretty bare bones, as I said I know why its doing what its doing but I'm not experienced with Pandas nor with dataframe joins to be able to fix it so any help would be much appreciated. I also couldn't find anything like this while I was searching stack and on google.

like image 593
Peter Waters Avatar asked Oct 16 '22 10:10

Peter Waters


1 Answers

The solution is to do it in chunks like you are but to concat the output into a new DataFrame like so:

file = "tableFile/123456.txt"
initDF = pd.read_csv(file, sep="\t", header=0)
file2 = "tableFile/7891011.txt"

amgPd = pd.DataFrame()              

for chunks in pd.read_csv(file2, sep="\t", chunksize=50000, header=0): 
    amgPd = pd.concat([amgPd, initDF.merge(chunks, how='right', on=['Mod', "Nuc", "AA"]]) 
like image 149
PhoenixCoder Avatar answered Oct 21 '22 14:10

PhoenixCoder