Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas slow on data frame replace

I have an Excel file (.xlsx) with about 800 rows and 128 columns with pretty dense data in the grid. There are about 9500 cells that I am trying to replace the cell values of using Pandas data frame:

xlsx = pandas.ExcelFile(filename)
frame = xlsx.parse(xlsx.sheet_names[0])
media_frame = frame[media_headers] # just get the cols that need replacing

from_filenames = get_from_filenames() # returns ~9500 filenames to replace in DF
to_filenames = get_to_filenames()

media_frame = media_frame.replace(from_filenames, to_filenames)
frame.update(media_frame)
frame.to_excel(filename)

The replace() takes 60 seconds. Any way to speed this up? This is not huge data or task, I was expecting pandas to move much faster. FYI I tried doing the same processing with same file in CSV, but the time savings was minimal (about 50 seconds on the replace())

like image 949
Neil Avatar asked Oct 04 '16 05:10

Neil


1 Answers

strategy
create pd.Series representing a map from filenames to filenames.
stack our dataframe, map, then unstack

setup

import pandas as pd
import numpy as np
from string import letters

media_frame = pd.DataFrame(
    pd.DataFrame(
        np.random.choice(list(letters), 9500 * 800 * 3) \
          .reshape(3, -1)).sum().values.reshape(9500, -1))

u = np.unique(media_frame.values)
from_filenames = pd.Series(u)
to_filenames = from_filenames.str[1:] + from_filenames.str[0]

m = pd.Series(to_filenames.values, from_filenames.values)

solution

media_frame.stack().map(m).unstack()

timing

5 x 5 dataframe

enter image description here

100 x 100

enter image description here

9500 x 800

enter image description here

9500 x 800
map using series vs dict
d = dict(zip(from_filenames, to_filenames))

enter image description here

like image 132
piRSquared Avatar answered Oct 11 '22 10:10

piRSquared