I have the following data frame:
import pandas as pd
df = pd.DataFrame({ 'gene':["foo",
"bar // lal",
"qux",
"woz"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
df = df[["gene","cell1","cell2"]]
df
That looks like this:
Out[6]:
gene cell1 cell2
0 foo 5 12
1 bar // lal 9 90
2 qux 1 13
3 woz 7 87
What I want to do is to split the 'gene' column so that it result like this:
gene cell1 cell2
foo 5 12
bar 9 90
lal 9 90
qux 1 13
woz 7 87
My current approach is this:
import pandas as pd
import timeit
def create():
df = pd.DataFrame({ 'gene':["foo",
"bar // lal",
"qux",
"woz"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
df = df[["gene","cell1","cell2"]]
s = df["gene"].str.split(' // ').apply(pd.Series,1).stack()
s.index = s.index.droplevel(-1)
s.name = "Genes"
del df["gene"]
df.join(s)
if __name__ == '__main__':
print(timeit.timeit("create()", setup="from __main__ import create", number=100))
# 0.608163118362
This is very slow. In reality I have around 40K lines to check and process.
What's the fast implementation of that?
To split text in a column into multiple rows with Python Pandas, we can use the str. split method. to create the df data frame. Then we call str.
Click in a cell, or select multiple cells that you want to split. Under Table Tools, on the Layout tab, in the Merge group, click Split Cells. Enter the number of columns or rows that you want to split the selected cells into.
Series and DataFrame methods define a . explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column. Since you have a list of comma separated strings, split the string on comma to get a list of elements, then call explode on that column.
TBH I think we need a fast built-in way of normalizing elements like this.. although since I've been out of the loop for a bit for all I know there is one by now, and I just don't know it. :-) In the meantime I've been using methods like this:
def create(n):
df = pd.DataFrame({ 'gene':["foo",
"bar // lal",
"qux",
"woz"],
'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
df = df[["gene","cell1","cell2"]]
df = pd.concat([df]*n)
df = df.reset_index(drop=True)
return df
def orig(df):
s = df["gene"].str.split(' // ').apply(pd.Series,1).stack()
s.index = s.index.droplevel(-1)
s.name = "Genes"
del df["gene"]
return df.join(s)
def faster(df):
s = df["gene"].str.split(' // ', expand=True).stack()
i = s.index.get_level_values(0)
df2 = df.loc[i].copy()
df2["gene"] = s.values
return df2
which gives me
>>> df = create(1)
>>> df
gene cell1 cell2
0 foo 5 12
1 bar // lal 9 90
2 qux 1 13
3 woz 7 87
>>> %time orig(df.copy())
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 10.2 ms
cell1 cell2 Genes
0 5 12 foo
1 9 90 bar
1 9 90 lal
2 1 13 qux
3 7 87 woz
>>> %time faster(df.copy())
CPU times: user 16 ms, sys: 0 ns, total: 16 ms
Wall time: 12.4 ms
gene cell1 cell2
0 foo 5 12
1 bar 9 90
1 lal 9 90
2 qux 1 13
3 woz 7 87
for comparable speeds at low sizes, and
>>> df = create(10000)
>>> %timeit z = orig(df.copy())
1 loops, best of 3: 14.2 s per loop
>>> %timeit z = faster(df.copy())
1 loops, best of 3: 231 ms per loop
a 60-fold speedup in the larger case. Note that the only reason I'm using df.copy()
here is because orig
is destructive.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With