Does anyone know why setting an item directly on a pandas series is so incredibly slow? Am I doing something wrong, or is it just the way it is?
I ran a couple of tests to see what the fastest method is to set a value on a pandas Series object. Here are the results, ordered from fast to slow:
%%timeit
a = np.empty(1000, dtype='float')
for i in range(len(a)):
a[i] = 1.0
s = pd.Series(data=a)
1000 loops, best of 3: 630 µs per loop
%%timeit
l = []
for i in range(1000):
l.append(1.0)
s = pd.Series(data=l)
1000 loops, best of 3: 1.05 ms per loop
%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
s.set_value(i, 1.0)
100 loops, best of 3: 18.5 ms per loop
%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
s[i] = 1.0
10 loops, best of 3: 30.2 ms per loop
%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
s.iat[i] = 1.0
10 loops, best of 3: 36.2 ms per loop
%%timeit
a = np.empty(1000, dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
s.iloc[i] = 1.0
1 loops, best of 3: 280 ms per loop
From the docs
Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you’re asking for.
So I get the following which should be comparable:
In [13]:
%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
s.iat[i] = 1.0
10 loops, best of 3: 23.3 ms per loop
In [14]:
%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
s.iloc[i] = 1.0
10 loops, best of 3: 159 ms per loop
for the other tests:
In [15]:
%%timeit
l = []
for i in range(1000):
l.append(1.0)
s = pd.Series(data=l)
1000 loops, best of 3: 525 µs per loop
In [16]:
%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
s.set_value(i,1.0)
100 loops, best of 3: 10.1 ms per loop
In [17]:
%%timeit
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
s[i] = 1.0
100 loops, best of 3: 17.5 ms per loop
I figured out how to get past the indexing overhead when setting values on a series object directly:
a = np.empty(1000,dtype='float')
s = pd.Series(data=a)
for i in range(len(a)):
a[i] = 1.0
When initializing the Series from a numpy array, the data is not copied. If a reference is kept to the original array, you can just set values on that!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With