I am trying to process very large files (10,000+ observsstions) where zip codes are not easily formatted. I need to convert them all to just the first 5 digits, and here is my current code: <pre class="prettyprint"><code>def makezip(frame, zipcol): i = 0 while i < len(frame): frame[zipcol][i] = frame[zipcol][i][:5] i += 1 return frame </code></pre> frame is the dataframe, and zipcol is the name of the column containing the zip codes. Although this works, it takes a very long time to process. Is there a quicker way?

You can use the <code>.str</code> accessor on string columns to access some specific string methods. And on this, you can also slice: <pre class="prettyprint"><code>frame[zipcol] = frame[zipcol].str[:5] </code></pre> <hr> Based on a small example, this is around 50 times faster as looping over the rows: <pre class="prettyprint"><code>In [29]: s = pd.Series(['testtest']*10000) In [30]: %timeit s.str[:5] 100 loops, best of 3: 3.06 ms per loop In [31]: %timeit str_loop(s) 10 loops, best of 3: 164 ms per loop </code></pre> whith <pre class="prettyprint"><code>In [27]: def str_loop(s): .....: for i in range(len(s)): .....: s[i] = s[i][:5] .....: </code></pre>

Faster processing of Dataframe in Pandas

I am trying to process very large files (10,000+ observsstions) where zip codes are not easily formatted. I need to convert them all to just the first 5 digits, and here is my current code:

def makezip(frame, zipcol):
    i = 0
    while i < len(frame):
        frame[zipcol][i] = frame[zipcol][i][:5]
        i += 1
    return frame

frame is the dataframe, and zipcol is the name of the column containing the zip codes. Although this works, it takes a very long time to process. Is there a quicker way?

Is apply faster than Iterrows?

The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.

What is faster than Iterrows pandas?

itertuples() method. The main difference between this method and iterrows is that this method is faster than the iterrows method as well as it also preserve the data type of a column compared to the iterrows method which don't as it returns a Series for each row but dtypes are preserved across columns.

You can use the .str accessor on string columns to access some specific string methods. And on this, you can also slice:

frame[zipcol] = frame[zipcol].str[:5]

Based on a small example, this is around 50 times faster as looping over the rows:

In [29]: s = pd.Series(['testtest']*10000)

In [30]: %timeit s.str[:5]
100 loops, best of 3: 3.06 ms per loop

In [31]: %timeit str_loop(s)
10 loops, best of 3: 164 ms per loop

whith

In [27]: def str_loop(s):
   .....:     for i in range(len(s)):
   .....:         s[i] = s[i][:5]
   .....:

Faster processing of Dataframe in Pandas

Tags:

python

pandas

whateveryousayiam

People also ask

1 Answers

joris

Recent Activity

Donate For Us

Faster processing of Dataframe in Pandas

Tags:

python

pandas

whateveryousayiam

People also ask

1 Answers

joris

Related questions

Recent Activity

Donate For Us