I have an Excel file (.xlsx) with about 800 rows and 128 columns with pretty dense data in the grid. There are about 9500 cells that I am trying to replace the cell values of using Pandas data frame:
xlsx = pandas.ExcelFile(filename)
frame = xlsx.parse(xlsx.sheet_names[0])
media_frame = frame[media_headers] # just get the cols that need replacing
from_filenames = get_from_filenames() # returns ~9500 filenames to replace in DF
to_filenames = get_to_filenames()
media_frame = media_frame.replace(from_filenames, to_filenames)
frame.update(media_frame)
frame.to_excel(filename)
The replace()
takes 60 seconds. Any way to speed this up? This is not huge data or task, I was expecting pandas to move much faster. FYI I tried doing the same processing with same file in CSV, but the time savings was minimal (about 50 seconds on the replace()
)
strategy
create pd.Series
representing a map
from filenames to filenames.stack
our dataframe, map
, then unstack
setup
import pandas as pd
import numpy as np
from string import letters
media_frame = pd.DataFrame(
pd.DataFrame(
np.random.choice(list(letters), 9500 * 800 * 3) \
.reshape(3, -1)).sum().values.reshape(9500, -1))
u = np.unique(media_frame.values)
from_filenames = pd.Series(u)
to_filenames = from_filenames.str[1:] + from_filenames.str[0]
m = pd.Series(to_filenames.values, from_filenames.values)
solution
media_frame.stack().map(m).unstack()
5 x 5 dataframe
100 x 100
9500 x 800
9500 x 800map
using series
vs dict
d = dict(zip(from_filenames, to_filenames))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With