Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I make more efficient code for a search for multiple strings in column in pandas

I am a newly self taught (minus 1 class on the very basics) programmer working for a bio lab. I have a script that goes though RNAseq data from two different cell types and runs a ttest if in another dataset. It worked for this application but the code feels very brutish and I know I will be writing similar scripts a lot.

How can I better write the following code to make it more efficient?

Goal of program:

  1. compare a list of genes to a rnaseq library of two cell types, if the library contains that gene, run a ttest of cell type 1 vs cell type 2
  2. output results.

:

import pandas as pd
from scipy.stats import ttest_ind
rnatest = {'Gene symbol':["GeneA","GeneB"],"rnaseq1A":[1,1.5],"rnaseq1B":[1.3,1.2],"rnaseq2A":[2.3,2.7],"rnaseq2B":[2,2.6]} 
df = pd.DataFrame(rnatest)
GOIlist = ["GeneA","GeneB"]
GOI = []
mu = [] 
pval = []
for index, row in df.iterrows():
  if row['Gene symbol'] in GOIlist:
    t, p = ttest_ind([row["rnaseq1A"],row["rnaseq1B"]],[row["rnaseq2A"],row["rnaseq2B"]])
    GOI.append(row['Gene symbol'])
    mu.append(t)
    pval.append(p)
df2 = {'Gene symbol':GOI,"tVAL":mu, "pVAL":pval}
df2 = pd.DataFrame(df2)
print(df2)  
like image 406
David William Turnell Avatar asked Jan 22 '20 20:01

David William Turnell


People also ask

How do you filter a Dataframe in multiple conditions?

Using Loc to Filter With Multiple Conditions The loc function in pandas can be used to access groups of rows or columns by label. Add each condition you want to be included in the filtered result and concatenate them with the & operator. You'll see our code sample will return a pd. dataframe of our filtered rows.

How do I select multiple columns with Loc?

Using df[] & loc[] to Select Multiple Columns by Name loc[] you can select multiple columns by names or labels. To select the columns by names, the syntax is df.

How do I extract multiple rows from a Dataframe in Python?

We will select multiple rows in pandas using multiple conditions, logical operators and using loc() function. Selecting rows with logical operators i.e. AND and OR can be achieved easily with a combination of >, <, <=, >= and == to extract rows with multiple filters.


1 Answers

The advantage of using pandas is that you can do columnwise operations. These are generally more efficient then iterating over the DataFrame with a for loop.

I slightly modified your df to show you the effect of filtering out the rows that we need.

>>> import pandas as pd
>>> from scipy.stats import ttest_ind
>>> GOIlist = ["GeneA","GeneB"]
>>> rnatest = {'Gene symbol':["GeneA","GeneB", "GeneC"],"rnaseq1A":[1,1.5,2],"rnaseq1B":[1.3,1.2,1.1],"rnaseq2A":[2.3,2.7,3.1],"rnaseq2B":[2,2.6,3.2]} 
>>> df = pd.DataFrame(rnatest)
>>> print(df)

    Gene symbol     rnaseq1A    rnaseq1B    rnaseq2A    rnaseq2B
0   GeneA           1.0         1.3         2.3         2.0
1   GeneB           1.5         1.2         2.7         2.6
2   GeneC           2.0         1.1         3.1         3.2

Now how I would rewrite your code:

  1. Use set_index to make the Gene symbol row an index, this speeds up the lookup time (especially if you have large DataFrames)
  2. Use loc to filter out the rows that have a Gene symbol that is in GOIlist
  3. Create two new columns pVal and tVal to which you assign the output of ttest_ind. Note that we don't have to iterate over the rows anymore.
  4. Optionally, drop the rnaseq* columns if you don't want to see them in your output.

In code:

>>> df3 = df.set_index(['Gene symbol'])
>>> df3 = df3.loc[GOIlist]
>>> df3['tVal'], df3['pVal'] = ttest_ind([df3["rnaseq1A"], df3["rnaseq1B"]], [df3["rnaseq2A"], df3["rnaseq2B"]])
>>> df3 = df3.drop(['rnaseq1A', 'rnaseq1B', 'rnaseq2A', 'rnaseq2B'], axis=1)
>>> print(df3)
            tVal        pVal
Gene symbol         
GeneA       -4.714045   0.042174
GeneB       -8.221922   0.014473

So, how much more efficient is this code now?

If I artificially increase the size of our DataFrame 10.000 times (so in total 30.000 rows instead of 3)

n = 10_000
rnatest = {'Gene symbol':["GeneA","GeneB", "GeneC"]*n, "rnaseq1A":[1,1.5,2]*n, "rnaseq1B":[1.3,1.2,1.1]*n, "rnaseq2A":[2.3,2.7,3.1]*n, "rnaseq2B":[2,2.6,3.2]*n} 
df = pd.DataFrame(rnatest)

then I can use timeit to measure the execution time of the code. For your original approach I get the result:

13.7 s ± 555 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

while my approach finishes in

45.2 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

so that is a more than 300 times speedup!

like image 159
BioGeek Avatar answered Oct 05 '22 13:10

BioGeek