I am trying to process very large files (10,000+ observsstions) where zip codes are not easily formatted. I need to convert them all to just the first 5 digits, and here is my current code:
def makezip(frame, zipcol):
i = 0
while i < len(frame):
frame[zipcol][i] = frame[zipcol][i][:5]
i += 1
return frame
frame is the dataframe, and zipcol is the name of the column containing the zip codes. Although this works, it takes a very long time to process. Is there a quicker way?
The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.
itertuples() method. The main difference between this method and iterrows is that this method is faster than the iterrows method as well as it also preserve the data type of a column compared to the iterrows method which don't as it returns a Series for each row but dtypes are preserved across columns.
You can use the .str
accessor on string columns to access some specific string methods. And on this, you can also slice:
frame[zipcol] = frame[zipcol].str[:5]
Based on a small example, this is around 50 times faster as looping over the rows:
In [29]: s = pd.Series(['testtest']*10000)
In [30]: %timeit s.str[:5]
100 loops, best of 3: 3.06 ms per loop
In [31]: %timeit str_loop(s)
10 loops, best of 3: 164 ms per loop
whith
In [27]: def str_loop(s):
.....: for i in range(len(s)):
.....: s[i] = s[i][:5]
.....:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With