Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

concatenate pandas dataframe in a loop of files

I am trying to write a script that loops over files via a certain pattern/variable, then it concatenates the 8th column of the files while keeping the first 4 columns which are common to all files. The script works if I use the following command:

reader = csv.reader(open("1isoforms.fpkm_tracking.txt", 'rU'), delimiter='\t') #to read the header names so i can use them as index. all headers for the three files are the same
header_row = reader.next() # Gets the header
df1 =  pd.read_csv("1isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #1 with index as first 5 columns
df2 = pd.read_csv("2isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #2 with index as first 5 columns
df3 = pd.read_csv("3isoforms.fpkm_tracking.txt", index_col=header_row[0:4], sep="\t") #file #3 with index as first 5 columns

result = pd.concat([df1.ix[:,4], df2.ix[:,4]], keys=["Header1", "Header2", "Header3"], axis=1) #concatenates the 8th column of the files and changes the header
result.to_csv("OutputTest.xls", sep="\t")

While this works, it is NOT practical for me to enter file names one by one as I sometimes have 100's of files, so cant type in a df...function for each. Instead, I was trying to use a for loop to do this but i couldnt figure it out. here is what I have so far:

k=0
for geneFile in glob.glob("*_tracking*"):
    while k < 3:
         reader = csv.reader(open(geneFile, 'rU'), delimiter='\t')
         header_row = reader.next()
         key = str(k)
         key = pd.read_csv(geneFile, index_col=header_row[0:1], sep="\t")
   result = pd.concat([key[:,5]], axis=1)
   result.to_csv("test2.xls", sep="\t")

However, this is not working .

The issues I am facing are as follows:

  1. How can I be able to iterate over input files and generate different variable names for each which I can then have it used in the pd.concat function one after the other?

  2. How can I use a for loop to generate a string file name that is a combination of df and an integer

  3. How can I fix the above script get my desired item.

  4. A minor issue is regarding the way I am using the col_index function: is there a way to use the column # rather than column names? I know it works for index_col=0 or any single #. But I couldn't use integers for > 1 column of indexing.

Note that all files have the exact same structure, and the index columns are the same.

Your feedback is highly appreciated.

like image 834
BioProgram Avatar asked Oct 30 '22 12:10

BioProgram


1 Answers

Consider using merge with right_index and left_index arguments:

import pandas as pd

numberoffiles = 100

# FIRST IMPORT (CREATE RESULT DATA FRAME)
result = pd.read_csv("1isoforms.fpkm_tracking.txt", sep="\t",
                      index_col=[0,1,2,3], usecols=[0,1,2,3,7])

# ALL OTHER IMPORTS (MERGE TO RESULT DATA FRAME, 8TH COLUMN SUFFIXED ITERATIVELY)
for i in range(2,numberoffiles+1):    
    df = pd.read_csv("{}isoforms.fpkm_tracking.txt".format(i), sep="\t",
                     index_col=[0,1,2,3], usecols=[0,1,2,3,7])

    result = pd.merge(result, df, right_index=True, left_index=True, suffixes=[i-1, i])

result.to_excel("Output.xlsx")
result.to_csv("Output.csv")
like image 153
Parfait Avatar answered Nov 09 '22 05:11

Parfait