I am a newly self taught (minus 1 class on the very basics) programmer working for a bio lab. I have a script that goes though RNAseq data from two different cell types and runs a ttest if in another dataset. It worked for this application but the code feels very brutish and I know I will be writing similar scripts a lot. How can I better write the following code to make it more efficient? Goal of program: <ol> <li>compare a list of genes to a rnaseq library of two cell types, if the library contains that gene, run a ttest of cell type 1 vs cell type 2 </li> <li>output results. </li> </ol> : <pre class="prettyprint"><code>import pandas as pd from scipy.stats import ttest_ind rnatest = {'Gene symbol':["GeneA","GeneB"],"rnaseq1A":[1,1.5],"rnaseq1B":[1.3,1.2],"rnaseq2A":[2.3,2.7],"rnaseq2B":[2,2.6]} df = pd.DataFrame(rnatest) GOIlist = ["GeneA","GeneB"] GOI = [] mu = [] pval = [] for index, row in df.iterrows(): if row['Gene symbol'] in GOIlist: t, p = ttest_ind([row["rnaseq1A"],row["rnaseq1B"]],[row["rnaseq2A"],row["rnaseq2B"]]) GOI.append(row['Gene symbol']) mu.append(t) pval.append(p) df2 = {'Gene symbol':GOI,"tVAL":mu, "pVAL":pval} df2 = pd.DataFrame(df2) print(df2) </code></pre>

The advantage of using <code>pandas</code> is that you can do columnwise operations. These are generally more efficient then iterating over the DataFrame with a for loop. I slightly modified your <code>df</code> to show you the effect of filtering out the rows that we need. <pre class="prettyprint"><code>>>> import pandas as pd >>> from scipy.stats import ttest_ind >>> GOIlist = ["GeneA","GeneB"] >>> rnatest = {'Gene symbol':["GeneA","GeneB", "GeneC"],"rnaseq1A":[1,1.5,2],"rnaseq1B":[1.3,1.2,1.1],"rnaseq2A":[2.3,2.7,3.1],"rnaseq2B":[2,2.6,3.2]} >>> df = pd.DataFrame(rnatest) >>> print(df) Gene symbol rnaseq1A rnaseq1B rnaseq2A rnaseq2B 0 GeneA 1.0 1.3 2.3 2.0 1 GeneB 1.5 1.2 2.7 2.6 2 GeneC 2.0 1.1 3.1 3.2 </code></pre> Now how I would rewrite your code: <ol> <li>Use <code>set_index</code> to make the <code>Gene symbol</code> row an index, this speeds up the lookup time (especially if you have large DataFrames)</li> <li>Use <code>loc</code> to filter out the rows that have a Gene symbol that is in <code>GOIlist</code> </li> <li>Create two new columns <code>pVal</code> and <code>tVal</code> to which you assign the output of <code>ttest_ind</code>. Note that we don't have to iterate over the rows anymore.</li> <li>Optionally, drop the <code>rnaseq*</code> columns if you don't want to see them in your output.</li> </ol> In code: <pre class="prettyprint"><code>>>> df3 = df.set_index(['Gene symbol']) >>> df3 = df3.loc[GOIlist] >>> df3['tVal'], df3['pVal'] = ttest_ind([df3["rnaseq1A"], df3["rnaseq1B"]], [df3["rnaseq2A"], df3["rnaseq2B"]]) >>> df3 = df3.drop(['rnaseq1A', 'rnaseq1B', 'rnaseq2A', 'rnaseq2B'], axis=1) >>> print(df3) tVal pVal Gene symbol GeneA -4.714045 0.042174 GeneB -8.221922 0.014473 </code></pre> So, how much more efficient is this code now? If I artificially increase the size of our DataFrame 10.000 times (so in total 30.000 rows instead of 3) <pre class="prettyprint"><code>n = 10_000 rnatest = {'Gene symbol':["GeneA","GeneB", "GeneC"]*n, "rnaseq1A":[1,1.5,2]*n, "rnaseq1B":[1.3,1.2,1.1]*n, "rnaseq2A":[2.3,2.7,3.1]*n, "rnaseq2B":[2,2.6,3.2]*n} df = pd.DataFrame(rnatest) </code></pre> then I can use <code>timeit</code> to measure the execution time of the code. For your original approach I get the result: <pre class="prettyprint"><code>13.7 s ± 555 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) </code></pre> while my approach finishes in <pre class="prettyprint"><code>45.2 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) </code></pre> so that is a more than 300 times speedup!

How do I make more efficient code for a search for multiple strings in column in pandas

Tags:

python

pandas

bioinformatics

biopython

I am a newly self taught (minus 1 class on the very basics) programmer working for a bio lab. I have a script that goes though RNAseq data from two different cell types and runs a ttest if in another dataset. It worked for this application but the code feels very brutish and I know I will be writing similar scripts a lot.

How can I better write the following code to make it more efficient?

Goal of program:

compare a list of genes to a rnaseq library of two cell types, if the library contains that gene, run a ttest of cell type 1 vs cell type 2
output results.

import pandas as pd
from scipy.stats import ttest_ind
rnatest = {'Gene symbol':["GeneA","GeneB"],"rnaseq1A":[1,1.5],"rnaseq1B":[1.3,1.2],"rnaseq2A":[2.3,2.7],"rnaseq2B":[2,2.6]} 
df = pd.DataFrame(rnatest)
GOIlist = ["GeneA","GeneB"]
GOI = []
mu = [] 
pval = []
for index, row in df.iterrows():
  if row['Gene symbol'] in GOIlist:
    t, p = ttest_ind([row["rnaseq1A"],row["rnaseq1B"]],[row["rnaseq2A"],row["rnaseq2B"]])
    GOI.append(row['Gene symbol'])
    mu.append(t)
    pval.append(p)
df2 = {'Gene symbol':GOI,"tVAL":mu, "pVAL":pval}
df2 = pd.DataFrame(df2)
print(df2)

406

asked Jan 22 '20 20:01

David William Turnell

1 Answers

The advantage of using pandas is that you can do columnwise operations. These are generally more efficient then iterating over the DataFrame with a for loop.

I slightly modified your df to show you the effect of filtering out the rows that we need.

>>> import pandas as pd
>>> from scipy.stats import ttest_ind
>>> GOIlist = ["GeneA","GeneB"]
>>> rnatest = {'Gene symbol':["GeneA","GeneB", "GeneC"],"rnaseq1A":[1,1.5,2],"rnaseq1B":[1.3,1.2,1.1],"rnaseq2A":[2.3,2.7,3.1],"rnaseq2B":[2,2.6,3.2]} 
>>> df = pd.DataFrame(rnatest)
>>> print(df)

    Gene symbol     rnaseq1A    rnaseq1B    rnaseq2A    rnaseq2B
0   GeneA           1.0         1.3         2.3         2.0
1   GeneB           1.5         1.2         2.7         2.6
2   GeneC           2.0         1.1         3.1         3.2

Now how I would rewrite your code:

Use set_index to make the Gene symbol row an index, this speeds up the lookup time (especially if you have large DataFrames)
Use loc to filter out the rows that have a Gene symbol that is in GOIlist
Create two new columns pVal and tVal to which you assign the output of ttest_ind. Note that we don't have to iterate over the rows anymore.
Optionally, drop the rnaseq* columns if you don't want to see them in your output.

In code:

>>> df3 = df.set_index(['Gene symbol'])
>>> df3 = df3.loc[GOIlist]
>>> df3['tVal'], df3['pVal'] = ttest_ind([df3["rnaseq1A"], df3["rnaseq1B"]], [df3["rnaseq2A"], df3["rnaseq2B"]])
>>> df3 = df3.drop(['rnaseq1A', 'rnaseq1B', 'rnaseq2A', 'rnaseq2B'], axis=1)
>>> print(df3)
            tVal        pVal
Gene symbol         
GeneA       -4.714045   0.042174
GeneB       -8.221922   0.014473

So, how much more efficient is this code now?

If I artificially increase the size of our DataFrame 10.000 times (so in total 30.000 rows instead of 3)

n = 10_000
rnatest = {'Gene symbol':["GeneA","GeneB", "GeneC"]*n, "rnaseq1A":[1,1.5,2]*n, "rnaseq1B":[1.3,1.2,1.1]*n, "rnaseq2A":[2.3,2.7,3.1]*n, "rnaseq2B":[2,2.6,3.2]*n} 
df = pd.DataFrame(rnatest)

then I can use timeit to measure the execution time of the code. For your original approach I get the result:

13.7 s ± 555 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

while my approach finishes in

45.2 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

so that is a more than 300 times speedup!

159

answered Oct 05 '22 13:10

BioGeek

Related questions
                            
                                How to get rid of library root mark in PyCharm
                            
                                Change Column Headers using DictWriter
                            
                                Balanced_accuracy is not a valid scoring value in scikit-learn
                            
                                ValueError: dimension of the inputs to `Dense` should be defined. Found `None`
                            
                                Convert commas to dots in txt with python that also contains scientific number formatting
                            
                                Python Equivalent to Javascript querySelector
                            
                                RNN : understanfingConcatenating layers
                            
                                Getting the path of an uploaded media file that uses "upload_to" in django
                            
                                What does pipenv do after installing that takes up so much time and downloads huge amounts of data?
                            
                                Dramatiq doesn't add tasks to the queue
                            
                                Is a Python module import at the bottom ok?
                            
                                Simulation of suicide burn in openai-gym's LunarLander
                            
                                What's difference between using metrics 'acc' and tf.keras.metrics.Accuracy()
                            
                                on_epoch_end() not called in keras fit_generator()
                            
                                How to force all strings to floats? [duplicate]
                            
                                How to remap or revert a point into its former coordinate system after warpAffine has transformed it?
                            
                                how to replace just first instance of max value in dataframe pandas?
                            
                                Post-install script with Python Poetry
                            
                                Clean text images with OpenCV for OCR reading
                            
                                Creating subplots with equal axis scale, Python, matplotlib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With