Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster processing of Dataframe in Pandas

Tags:

python

pandas

I am trying to process very large files (10,000+ observsstions) where zip codes are not easily formatted. I need to convert them all to just the first 5 digits, and here is my current code:

def makezip(frame, zipcol):
    i = 0
    while i < len(frame):
        frame[zipcol][i] = frame[zipcol][i][:5]
        i += 1
    return frame

frame is the dataframe, and zipcol is the name of the column containing the zip codes. Although this works, it takes a very long time to process. Is there a quicker way?

like image 206
whateveryousayiam Avatar asked May 19 '15 21:05

whateveryousayiam


People also ask

Is apply faster than Iterrows?

The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.

What is faster than Iterrows pandas?

itertuples() method. The main difference between this method and iterrows is that this method is faster than the iterrows method as well as it also preserve the data type of a column compared to the iterrows method which don't as it returns a Series for each row but dtypes are preserved across columns.


1 Answers

You can use the .str accessor on string columns to access some specific string methods. And on this, you can also slice:

frame[zipcol] = frame[zipcol].str[:5]

Based on a small example, this is around 50 times faster as looping over the rows:

In [29]: s = pd.Series(['testtest']*10000)

In [30]: %timeit s.str[:5]
100 loops, best of 3: 3.06 ms per loop

In [31]: %timeit str_loop(s)
10 loops, best of 3: 164 ms per loop

whith

In [27]: def str_loop(s):
   .....:     for i in range(len(s)):
   .....:         s[i] = s[i][:5]
   .....:
like image 144
joris Avatar answered Oct 13 '22 00:10

joris