I can't seem to find the reasoning behind the behaviour of .loc. I know it is label based, so if I iterate over Index object the following minimal example should work. But it doesn't. I googled of course but I need additional explanation from someone who has already got a grip on indexing.
import datetime
import pandas as pd
dict_weekday = {1: 'MON', 2: 'TUE', 3: 'WED', 4: 'THU', 5: 'FRI', 6: 'SAT', 7: 'SUN'}
df = pd.DataFrame(pd.date_range(datetime.date(2014, 1, 1), datetime.date(2014, 1, 15), freq='D'), columns=['Date'])
df['Weekday'] = df['Date'].apply(lambda x: dict_weekday[x.isoweekday()])
for idx in df.index:
print df.loc[idx, 'Weekday']
pandas DataFrame. iterrows() is used to iterate over DataFrame rows. This returns (index, Series) where the index is an index of the Row and Series is data or content of each row. To get the data from the series, you should use the column name like row["Fee"] .
Iterating over the rows of a DataFrame You can do so using either iterrows() or itertuples() built-in methods.
In order to iterate over rows, we apply a function itertuples() this function return a tuple for each row in the DataFrame. The first element of the tuple will be the row's corresponding index value, while the remaining values are the row values.
loc attribute is used to access a group of rows and columns by label(s) or a boolean array in the given Series object. Example #1: Use Series. loc attribute to select some values from the given Series object based on the labels.
The problem is not in df.loc
;
df.loc[idx, 'Weekday']
is just returning a Series.
The surprising behavior is due to the way pd.Series
tries to cast datetime-like values to Timestamps.
df.loc[0, 'Weekday']
forms the Series
pd.Series(np.array([pd.Timestamp('2014-01-01 00:00:00'), 'WED'], dtype=object))
When pd.Series(...)
is called, it tries to cast the data to an appropriate dtype.
If you trace through the code, you'll find that it eventually arrives at these lines in pandas.core.common._possibly_infer_to_datetimelike:
sample = v[:min(3,len(v))]
inferred_type = lib.infer_dtype(sample)
which is inspecting the first few elements of the data and trying to infer the dtype.
When one of the values is a pd.Timestamp, Pandas checks to see if all the data can be cast as Timestamps. Indeed, 'Wed'
can be cast to pd.Timestamp:
In [138]: pd.Timestamp('Wed')
Out[138]: Timestamp('2014-12-17 00:00:00')
This is the root of the problem, which results in pd.Series
returning
two Timestamps instead of a Timestamp and a string:
In [139]: pd.Series(np.array([pd.Timestamp('2014-01-01 00:00:00'), 'WED'], dtype=object))
Out[139]:
0 2014-01-01
1 2014-12-17
dtype: datetime64[ns]
and thus this returns
In [140]: df.loc[0, 'Weekday']
Out[140]: Timestamp('2014-12-17 00:00:00')
instead of 'Wed'
.
Alternative: select the Series df['Weekday']
first:
There are many workarounds; EdChum shows that adding a non-datelike (integer) value to the sample can prevent pd.Series from casting all the values to Timestamps.
Alternatively, you could access df['Weekdays']
before using .loc
:
for idx in df.index:
print df['Weekday'].loc[idx]
Alternative: df.loc[[idx], 'Weekday']
:
Another alternative is
for idx in df.index:
print df.loc[[idx], 'Weekday'].item()
df.loc[[idx], 'Weekday']
first selects the DataFrame df.loc[[idx]]
. For example, when idx
equals 0
,
In [10]: df.loc[[0]]
Out[10]:
Date Weekday
0 2014-01-01 WED
whereas df.loc[0]
returns the Series:
In [11]: df.loc[0]
Out[11]:
Date 2014-01-01
Weekday 2014-12-17
Name: 0, dtype: datetime64[ns]
Series tries to cast the values to a single useful dtype. DataFrames can have a different dtype for each column. So the Timestamp in the Date
column does not affect the dtype of the value in the Weekday
column.
So the problem was avoided by using an index selector which returns a DataFrame.
Alternative: use integers for Weekday
Yet another alternative is to store the isoweekday integer in Weekday
, and convert to strings only at the end when you print:
import datetime
import pandas as pd
dict_weekday = {1: 'MON', 2: 'TUE', 3: 'WED', 4: 'THU', 5: 'FRI', 6: 'SAT', 7: 'SUN'}
df = pd.DataFrame(pd.date_range(datetime.date(2014, 1, 1), datetime.date(2014, 1, 15), freq='D'), columns=['Date'])
df['Weekday'] = df['Date'].dt.weekday+1 # add 1 for isoweekday
for idx in df.index:
print dict_weekday[df.loc[idx, 'Weekday']]
Alternative: use df.ix
:
df.loc
is a _LocIndexer
, whereas df.ix
is a _IXIndexer
. They have
different __getitem__
methods. If you step through the code (for example, using pdb) you'll find that df.ix
calls df.getvalue
:
def __getitem__(self, key):
if type(key) is tuple:
try:
values = self.obj.get_value(*key)
and the DataFrame method df.get_value
succeeds in returning 'WED'
:
In [14]: df.get_value(0, 'Weekday')
Out[14]: 'WED'
This is why df.ix
is another alternative that works here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With