Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Cell frequency count by index

My dataframe is a long list of 4 letters, 'A', 'T', 'G','C', I need to count the frequency of each letter by index

df = pd.DataFrame({'cases': ['ACCTTGTAGTGTATTTTATGACCAAATGACTTTTTCCCCCCAGTGGCTAATTTGTCTCAGGCCTGCGTCTTAAAGAGACACGGTAATGAGTAGGAAGTCCAGCGTGGTCTGGA','ACCTTGTACTGTATCTTATGACCAGATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACGGTAATGAGTAGGAAGTCCAACGTGGTCTAGA','GCCTTGTACTGTATATTATGACCAAATGACTTTTTCCACCCATTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACGGAAATGAGTAGGAAGTCCAGCGTGGTCTAGA','ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACGGTAATGAGTAGGAAGTCCAGCGTGGTCTAGA']})
                                               cases
0  ACCTTGTAGTGTATTTTATGACCAAATGACTTTTTCCCCCCAGTGG...
1  ACCTTGTACTGTATCTTATGACCAGATGACTTTTTCCACCCAGTGG...
2  GCCTTGTACTGTATATTATGACCAAATGACTTTTTCCACCCATTGG...
3  ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGG...
4  ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGG...
5  ACCTTGTAGTGTATTTTATGACCAAATGACTTTTTCCCCCCAGTGG...
6  ACCTTGTACTGTATCTTATGACCAGATGACTTTTTCCACCCAGTGG...
7  GCCTTGTACTGTATATTATGACCAAATGACTTTTTCCACCCATTGG...
8  ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGG...
9  ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGG...

The result would be a new df of shape 4x113, i cannot figure out a pandas way to do this. Below is my non-pandas solution

def freq_lists(dna_list):
    n = len(dna_list[0])
    A = [0]*n
    T = [0]*n
    G = [0]*n
    C = [0]*n
    for dna in dna_list:
        for index, base in enumerate(dna):
            if base == 'A':
                        A[index] += 1
            elif base == 'C':
                C[index] += 1
            elif base == 'G':
                        G[index] += 1
            elif base == 'T':
                T[index] += 1
    return {'A': A, 'C': C, 'G': G, 'T': T}

fdf = pd.DataFrame(freq_lists(df['cases'].to_list()))
     A  C  G  T
0    3  0  1  0
1    0  4  0  0
2    0  4  0  0
3    0  0  0  4
4    0  0  0  4
..  .. .. .. ..
108  0  4  0  0
109  0  0  0  4
110  3  0  1  0
111  0  0  4  0
112  4  0  0  0

To clarify the first row is obtained by summing up the counts of the first str in the case column which is AAGA -> A: 3, C:0, G:1 T:0

like image 645
Kenan Avatar asked Aug 20 '21 15:08

Kenan


1 Answers

Let us do explode with crosstab

s = df.cases.map(list).explode()
out = pd.crosstab(s.groupby(level=0).cumcount(),s)
Out[583]: 
cases  A  C  G  T
row_0            
0      3  0  1  0
1      0  4  0  0
2      0  4  0  0
3      0  0  0  4
4      0  0  0  4
   .. .. .. ..
108    0  4  0  0
109    0  0  0  4
110    3  0  1  0
111    0  0  4  0
112    4  0  0  0
like image 54
BENY Avatar answered Oct 14 '22 04:10

BENY