I have the following data frame: <pre class="prettyprint"><code>import pandas as pd df = pd.DataFrame({ 'gene':["foo", "bar // lal", "qux", "woz"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]}) df = df[["gene","cell1","cell2"]] df </code></pre> That looks like this: <pre class="prettyprint"><code>Out[6]: gene cell1 cell2 0 foo 5 12 1 bar // lal 9 90 2 qux 1 13 3 woz 7 87 </code></pre> What I want to do is to split the 'gene' column so that it result like this: <pre class="prettyprint"><code> gene cell1 cell2 foo 5 12 bar 9 90 lal 9 90 qux 1 13 woz 7 87 </code></pre> My current approach is this: <pre class="prettyprint"><code>import pandas as pd import timeit def create(): df = pd.DataFrame({ 'gene':["foo", "bar // lal", "qux", "woz"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]}) df = df[["gene","cell1","cell2"]] s = df["gene"].str.split(' // ').apply(pd.Series,1).stack() s.index = s.index.droplevel(-1) s.name = "Genes" del df["gene"] df.join(s) if __name__ == '__main__': print(timeit.timeit("create()", setup="from __main__ import create", number=100)) # 0.608163118362 </code></pre> This is very slow. In reality I have around 40K lines to check and process. What's the fast implementation of that?

TBH I think we need a fast built-in way of normalizing elements like this.. although since I've been out of the loop for a bit for all I know there is one by now, and I just don't know it. :-) In the meantime I've been using methods like this: <pre class="prettyprint"><code>def create(n): df = pd.DataFrame({ 'gene':["foo", "bar // lal", "qux", "woz"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]}) df = df[["gene","cell1","cell2"]] df = pd.concat([df]*n) df = df.reset_index(drop=True) return df def orig(df): s = df["gene"].str.split(' // ').apply(pd.Series,1).stack() s.index = s.index.droplevel(-1) s.name = "Genes" del df["gene"] return df.join(s) def faster(df): s = df["gene"].str.split(' // ', expand=True).stack() i = s.index.get_level_values(0) df2 = df.loc[i].copy() df2["gene"] = s.values return df2 </code></pre> which gives me <pre class="prettyprint"><code>>>> df = create(1) >>> df gene cell1 cell2 0 foo 5 12 1 bar // lal 9 90 2 qux 1 13 3 woz 7 87 >>> %time orig(df.copy()) CPU times: user 12 ms, sys: 0 ns, total: 12 ms Wall time: 10.2 ms cell1 cell2 Genes 0 5 12 foo 1 9 90 bar 1 9 90 lal 2 1 13 qux 3 7 87 woz >>> %time faster(df.copy()) CPU times: user 16 ms, sys: 0 ns, total: 16 ms Wall time: 12.4 ms gene cell1 cell2 0 foo 5 12 1 bar 9 90 1 lal 9 90 2 qux 1 13 3 woz 7 87 </code></pre> for comparable speeds at low sizes, and <pre class="prettyprint"><code>>>> df = create(10000) >>> %timeit z = orig(df.copy()) 1 loops, best of 3: 14.2 s per loop >>> %timeit z = faster(df.copy()) 1 loops, best of 3: 231 ms per loop </code></pre> a 60-fold speedup in the larger case. Note that the only reason I'm using <code>df.copy()</code> here is because <code>orig</code> is destructive.

Fast way to split column into multiple rows in Pandas

Tags:

python

pandas

I have the following data frame:

import pandas as pd
df = pd.DataFrame({ 'gene':["foo",
                            "bar // lal",
                            "qux",
                            "woz"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
df = df[["gene","cell1","cell2"]]
df

That looks like this:

Out[6]:
         gene  cell1  cell2
0         foo      5     12
1  bar // lal      9     90
2         qux      1     13
3         woz      7     87

What I want to do is to split the 'gene' column so that it result like this:

         gene  cell1  cell2
         foo      5     12
         bar      9     90
         lal      9     90
         qux      1     13
         woz      7     87

My current approach is this:

import pandas as pd
import timeit

def create():
    df = pd.DataFrame({ 'gene':["foo",
                            "bar // lal",
                            "qux",
                            "woz"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
    df = df[["gene","cell1","cell2"]]

    s = df["gene"].str.split(' // ').apply(pd.Series,1).stack()
    s.index = s.index.droplevel(-1)
    s.name = "Genes"
    del df["gene"]
    df.join(s)


if __name__ == '__main__':
    print(timeit.timeit("create()", setup="from __main__ import create", number=100))
    # 0.608163118362

This is very slow. In reality I have around 40K lines to check and process.

What's the fast implementation of that?

312

asked Nov 10 '15 03:11

neversaint

1 Answers

TBH I think we need a fast built-in way of normalizing elements like this.. although since I've been out of the loop for a bit for all I know there is one by now, and I just don't know it. :-) In the meantime I've been using methods like this:

def create(n):
    df = pd.DataFrame({ 'gene':["foo",
                                "bar // lal",
                                "qux",
                                "woz"], 
                        'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
    df = df[["gene","cell1","cell2"]]
    df = pd.concat([df]*n)
    df = df.reset_index(drop=True)
    return df

def orig(df):
    s = df["gene"].str.split(' // ').apply(pd.Series,1).stack()
    s.index = s.index.droplevel(-1)
    s.name = "Genes"
    del df["gene"]
    return df.join(s)

def faster(df):
    s = df["gene"].str.split(' // ', expand=True).stack()
    i = s.index.get_level_values(0)
    df2 = df.loc[i].copy()
    df2["gene"] = s.values
    return df2

which gives me

>>> df = create(1)
>>> df
         gene  cell1  cell2
0         foo      5     12
1  bar // lal      9     90
2         qux      1     13
3         woz      7     87
>>> %time orig(df.copy())
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 10.2 ms
   cell1  cell2 Genes
0      5     12   foo
1      9     90   bar
1      9     90   lal
2      1     13   qux
3      7     87   woz
>>> %time faster(df.copy())
CPU times: user 16 ms, sys: 0 ns, total: 16 ms
Wall time: 12.4 ms
  gene  cell1  cell2
0  foo      5     12
1  bar      9     90
1  lal      9     90
2  qux      1     13
3  woz      7     87

for comparable speeds at low sizes, and

>>> df = create(10000)
>>> %timeit z = orig(df.copy())
1 loops, best of 3: 14.2 s per loop
>>> %timeit z = faster(df.copy())
1 loops, best of 3: 231 ms per loop

a 60-fold speedup in the larger case. Note that the only reason I'm using df.copy() here is because orig is destructive.

answered Sep 17 '22 00:09

DSM

Related questions
                            
                                How to bring selenium browser to the front?
                            
                                Django @override_settings does not allow dictionary?
                            
                                Retrieve result from 'task_id' in Celery from unknown task
                            
                                Capture the result of an IPython magic function
                            
                                How to use python virtual environment in another computer
                            
                                Create shortcut files in Windows 7 using Python
                            
                                No module named utils error on compiling py file
                            
                                Combine columns in a Pandas DataFrame to a column of lists in a DataFrame
                            
                                Show entire toctree in Read The Docs sidebar
                            
                                Is it possible to use a function in an SQLAlchemy filter?
                            
                                conditional row read of csv in pandas
                            
                                Run npm commands using Python subprocess
                            
                                Attribute Error Installing with pip
                            
                                How to access id/widget of different class from a kivy file (.kv)?
                            
                                Make BeautifulSoup handle line breaks as a browser would
                            
                                Getting vertex list from python-igraph
                            
                                How to dynamically set default value in WTForms RadioField?
                            
                                Simulink for Python [closed]
                            
                                Python Pandas: How to move one row to the first row of a Dataframe?
                            
                                Converting all non-numeric to 0 (zero) in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With