Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python program to generate a single species matrix file from multiple sample-wise species count files (using Pandas?)

Suppose I have 2 files that list the species count. Like this:

sample1_sc.tsv

Sample_1 | Sample ID
464 | Bacillus subtilis
116 | Escherichia coli
62 | Vibrio cholerae serotype 1

sample2_sc.tsv

Sample_2 | Sample ID
364 | Bacillus subtilis
120 | Homo sapiens
16 | Yersinia pestis
16 | Danio rerio

is there a function to concat both data files using dataframes such that the header contains the Species of all the sample files without repetition and the no. of rows are the no. samples, each row showing the read count of the species in that sample and 0 if the species is not in the sample?

For the above ex, I want the species matrix to look like this:

Sample ID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464 | 116 | 62 | 0 | 0 | 0
Sample_2 | 364 | 0 | 0 | 120 | 16 | 16

I am not as familiar with Pandas, so here's the code I have tried so far:

import pandas as pd
import numpy as np
import glob

path = "/content/"
sc_files = glob.glob(path + "*.tsv")
df_sc = []

for file in sc_files:
  df_sample = pd.read_csv(file, sep = '\t')
  df_sample = df_sample.set_index("SampleID")
  df_sample = df_sample.transpose()
  df_sample = df_sample[~df_sample.index.duplicated(keep='first')]
  df_sc.append(df_sample)

df_matrix = pd.concat(df_sc, axis = 1).fillna(0)

And this is the output I'm getting:

SampleID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Bacillus subtilis | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464.0 | 116.0 | 62.0 | 0.0 | 0.0 | 0.0 | 0.0
Sample_2 | 0.0 | 0.0 | 0.0 | 364.0 | 120.0 | 16.0 | 16.0

How do I make it so all the samples' count containing a specific species name (Bacillus subtilis in this case) appear in the same column?

I tried to remove the df_sample = df_sample[~df_sample.index.duplicated(keep='first')]

but whether I keep it or remove it, it gives me the same result?

like image 327
chan-98 Avatar asked Nov 28 '25 19:11

chan-98


1 Answers

I would use concat this way, (be careful of the axis!):

dfs = [df1, df2]

out = (pd.concat([d.set_index('Sample ID') for d in dfs], axis=1)
         .fillna(0, downcast='infer').T
         .rename_axis(index='Sample ID', columns=None).reset_index()
       )

Output:

Sample ID  Bacillus subtilis  Escherichia coli  Vibrio cholerae serotype 1  Homo sapiens  Yersinia pestis  Danio rerio
Sample_1                 464               116                          62             0                0            0
Sample_2                 364                 0                           0           120               16           16
like image 74
mozway Avatar answered Nov 30 '25 07:11

mozway



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!