Suppose I have 2 files that list the species count. Like this:
sample1_sc.tsv
Sample_1 | Sample ID
464 | Bacillus subtilis
116 | Escherichia coli
62 | Vibrio cholerae serotype 1
sample2_sc.tsv
Sample_2 | Sample ID
364 | Bacillus subtilis
120 | Homo sapiens
16 | Yersinia pestis
16 | Danio rerio
is there a function to concat both data files using dataframes such that the header contains the Species of all the sample files without repetition and the no. of rows are the no. samples, each row showing the read count of the species in that sample and 0 if the species is not in the sample?
For the above ex, I want the species matrix to look like this:
Sample ID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464 | 116 | 62 | 0 | 0 | 0
Sample_2 | 364 | 0 | 0 | 120 | 16 | 16
I am not as familiar with Pandas, so here's the code I have tried so far:
import pandas as pd
import numpy as np
import glob
path = "/content/"
sc_files = glob.glob(path + "*.tsv")
df_sc = []
for file in sc_files:
df_sample = pd.read_csv(file, sep = '\t')
df_sample = df_sample.set_index("SampleID")
df_sample = df_sample.transpose()
df_sample = df_sample[~df_sample.index.duplicated(keep='first')]
df_sc.append(df_sample)
df_matrix = pd.concat(df_sc, axis = 1).fillna(0)
And this is the output I'm getting:
SampleID | Bacillus subtilis | Escherichia coli | Vibrio cholerae serotype 1 | Bacillus subtilis | Homo sapiens | Yersinia pestis | Danio rerio
Sample_1 | 464.0 | 116.0 | 62.0 | 0.0 | 0.0 | 0.0 | 0.0
Sample_2 | 0.0 | 0.0 | 0.0 | 364.0 | 120.0 | 16.0 | 16.0
How do I make it so all the samples' count containing a specific species name (Bacillus subtilis in this case) appear in the same column?
I tried to remove the
df_sample = df_sample[~df_sample.index.duplicated(keep='first')]
but whether I keep it or remove it, it gives me the same result?
I would use concat this way, (be careful of the axis!):
dfs = [df1, df2]
out = (pd.concat([d.set_index('Sample ID') for d in dfs], axis=1)
.fillna(0, downcast='infer').T
.rename_axis(index='Sample ID', columns=None).reset_index()
)
Output:
Sample ID Bacillus subtilis Escherichia coli Vibrio cholerae serotype 1 Homo sapiens Yersinia pestis Danio rerio
Sample_1 464 116 62 0 0 0
Sample_2 364 0 0 120 16 16
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With