Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python iterating through pandas more efficiently without for loop

I am creating a column to add a tag to some strings and have working code here:

import pandas as pd
import numpy as np
import re

data=pd.DataFrame({'Lang':["Python", "Cython", "Scipy", "Numpy", "Pandas"], })
data['Type'] = ""


pat = ["^P\w", "^S\w"]

for i in range (len(data.Lang)):
    if re.search(pat[0],data.Lang.ix[i]):
        data.Type.ix[i] = "B"

    if re.search(pat[1],data.Lang.ix[i]):
        data.Type.ix[i]= "A"


print data

Is there a way to get rid of that for loop? if it was numpy there is a function arange something similar to what I am trying to find.

like image 453
user3084006 Avatar asked Feb 11 '26 23:02

user3084006


2 Answers

This will be faster than the apply soln (and the looping soln)

FYI: (this is in 0.13). In 0.12 you would need to create the Type column first.

In [36]: data.loc[data.Lang.str.match(pat[0]),'Type'] = 'B'

In [37]: data.loc[data.Lang.str.match(pat[1]),'Type'] = 'A'

In [38]: data
Out[38]: 
     Lang Type
0  Python    B
1  Cython  NaN
2   Scipy    A
3   Numpy  NaN
4  Pandas    B

[5 rows x 2 columns]

In [39]: data.fillna('')
Out[39]: 
     Lang Type
0  Python    B
1  Cython     
2   Scipy    A
3   Numpy     
4  Pandas    B

[5 rows x 2 columns]

Here's some timings:

In [34]: bigdata = pd.concat([data]*2000,ignore_index=True)

In [35]: def f3(df):
    df = df.copy()
    df['Type'] = ''
    for i in range(len(df.Lang)):
        if re.search(pat[0],df.Lang.ix[i]):
            df.Type.ix[i] = 'B'
        if re.search(pat[1],df.Lang.ix[i]):
            df.Type.ix[i] = 'A'
   ....:             

In [36]: def f2(df):
    df = df.copy()
    df.loc[df.Lang.str.match(pat[0]),'Type'] = 'B'
    df.loc[df.Lang.str.match(pat[1]),'Type'] = 'A'
    df.fillna('')
   ....:     

In [37]: def f1(df):
    df = df.copy()
    f = lambda s: re.match(pat[0], s) and 'A' or re.match(pat[1], s) and 'B' or ''
    df['Type'] = df['Lang'].apply(f)
   ....:     

Your original soln

In [41]: %timeit f3(bigdata)
1 loops, best of 3: 2.21 s per loop

Direct indexing

In [42]: %timeit f2(bigdata)
100 loops, best of 3: 17.3 ms per loop

Apply

In [43]: %timeit f1(bigdata)
10 loops, best of 3: 21.8 ms per loop

Here's another more general method that is a bit faster, and prob is more useful as you can then combine the patterns in say a groupby if you wanted.

In [107]: pats
Out[107]: {'A': '^P\\w', 'B': '^S\\w'}

In [108]: concat([df,DataFrame(dict([ (c,Series(c,index=df.index)[df.Lang.str.match(p)].reindex(df.index)) for c,p in pats.items() ]))],axis=1)
Out[108]: 
      Lang    A    B
0   Python    A  NaN
1   Cython  NaN  NaN
2    Scipy  NaN    B
3    Numpy  NaN  NaN
4   Pandas    A  NaN
5   Python    A  NaN
6   Cython  NaN  NaN

45  Python    A  NaN
46  Cython  NaN  NaN
47   Scipy  NaN    B
48   Numpy  NaN  NaN
49  Pandas    A  NaN
50  Python    A  NaN
51  Cython  NaN  NaN
52   Scipy  NaN    B
53   Numpy  NaN  NaN
54  Pandas    A  NaN
55  Python    A  NaN
56  Cython  NaN  NaN
57   Scipy  NaN    B
58   Numpy  NaN  NaN
59  Pandas    A  NaN
       ...  ...  ...

[10000 rows x 3 columns]

In [106]: %timeit  concat([df,DataFrame(dict([ (c,Series(c,index=df.index)[df.Lang.str.match(p)].reindex(df.index)) for c,p in pats.items() ]))],axis=1)
100 loops, best of 3: 15.5 ms per loop

This frame tacks on a Series for each of the patters that puts the letter in the correct position (and NaN otherwise).

Create a series of that letter

Series(c,index=df.index)

Select the matches out of it

Series(c,index=df.index)[df.Lang.str.match(p)]

Reindexing puts NaN where the value is not in the index

Series(c,index=df.index)[df.Lang.str.match(p)].reindex(df.index))
like image 82
Jeff Avatar answered Feb 13 '26 15:02

Jeff


You can do both classifications with one lambda:

f = lambda s: re.match(pat[0], s) and 'A' or re.match(pat[1], s) and 'B' or ''

then use apply to get your "Type"

data.Type = data.Lang.apply(f)

output:

     Lang Type
0  Python    A
1  Cython
2   Scipy    B
3   Numpy
4  Pandas    A

Edit: Maybe didn't compare well after benchmarks. If you want to speed things up than just avoid the compile of the regex

def f1(df):
    df = df.copy()
    f = lambda s: re.match(pat[0], s) and 'A' or re.match(pat[1], s) and 'B' or ''
    df['Type'] = df['Lang'].apply(f)
    return df

def f1_1(df):
    df = df.copy()
    re1, re2 = re.compile(pat[0]), re.compile(pat[1])
    f = lambda s: re1.match(s) and 'A' or re2.match(s) and 'B' or ''
    df.Type = df.Lang.apply(f)
    return df

bigdata = pd.concat([data]*2000,ignore_index=True)

original Apply:

In [18]:  %timeit f1(bigdata)
10 loops, best of 3: 23.1 ms per loop

revised Apply:

In [19]: %timeit f1_1(bigdata)
100 loops, best of 3: 6.65 ms per loop
like image 43
Phil Cooper Avatar answered Feb 13 '26 16:02

Phil Cooper