I can't seem to find the reasoning behind the behaviour of .loc. I know it is label based, so if I iterate over Index object the following minimal example should work. But it doesn't. I googled of course but I need additional explanation from someone who has already got a grip on indexing. <pre class="prettyprint lang-py prettyprint-override"><code>import datetime import pandas as pd dict_weekday = {1: 'MON', 2: 'TUE', 3: 'WED', 4: 'THU', 5: 'FRI', 6: 'SAT', 7: 'SUN'} df = pd.DataFrame(pd.date_range(datetime.date(2014, 1, 1), datetime.date(2014, 1, 15), freq='D'), columns=['Date']) df['Weekday'] = df['Date'].apply(lambda x: dict_weekday[x.isoweekday()]) for idx in df.index: print df.loc[idx, 'Weekday'] </code></pre>

The problem is not in <code>df.loc</code>; <code>df.loc[idx, 'Weekday']</code> is just returning a Series. The surprising behavior is due to the way <code>pd.Series</code> tries to cast datetime-like values to Timestamps. <pre class="prettyprint"><code>df.loc[0, 'Weekday'] </code></pre> forms the Series <pre class="prettyprint"><code>pd.Series(np.array([pd.Timestamp('2014-01-01 00:00:00'), 'WED'], dtype=object)) </code></pre> When <code>pd.Series(...)</code> is called, it tries to cast the data to an appropriate dtype. If you trace through the code, you'll find that it eventually arrives at these lines in pandas.core.common._possibly_infer_to_datetimelike: <pre class="prettyprint"><code>sample = v[:min(3,len(v))] inferred_type = lib.infer_dtype(sample) </code></pre> which is inspecting the first few elements of the data and trying to infer the dtype. When one of the values is a pd.Timestamp, Pandas checks to see if all the data can be cast as Timestamps. Indeed, <code>'Wed'</code> can be cast to pd.Timestamp: <pre class="prettyprint"><code>In [138]: pd.Timestamp('Wed') Out[138]: Timestamp('2014-12-17 00:00:00') </code></pre> This is the root of the problem, which results in <code>pd.Series</code> returning two Timestamps instead of a Timestamp and a string: <pre class="prettyprint"><code>In [139]: pd.Series(np.array([pd.Timestamp('2014-01-01 00:00:00'), 'WED'], dtype=object)) Out[139]: 0 2014-01-01 1 2014-12-17 dtype: datetime64[ns] </code></pre> and thus this returns <pre class="prettyprint"><code>In [140]: df.loc[0, 'Weekday'] Out[140]: Timestamp('2014-12-17 00:00:00') </code></pre> instead of <code>'Wed'</code>. <hr> Alternative: select the Series <code>df['Weekday']</code> first: There are many workarounds; EdChum shows that adding a non-datelike (integer) value to the sample can prevent pd.Series from casting all the values to Timestamps. Alternatively, you could access <code>df['Weekdays']</code> before using <code>.loc</code>: <pre class="prettyprint"><code>for idx in df.index: print df['Weekday'].loc[idx] </code></pre> <hr> Alternative: <code>df.loc[[idx], 'Weekday']</code>: Another alternative is <pre class="prettyprint"><code>for idx in df.index: print df.loc[[idx], 'Weekday'].item() </code></pre> <code>df.loc[[idx], 'Weekday']</code> first selects the DataFrame <code>df.loc[[idx]]</code>. For example, when <code>idx</code> equals <code>0</code>, <pre class="prettyprint"><code>In [10]: df.loc[[0]] Out[10]: Date Weekday 0 2014-01-01 WED </code></pre> whereas <code>df.loc[0]</code> returns the Series: <pre class="prettyprint"><code>In [11]: df.loc[0] Out[11]: Date 2014-01-01 Weekday 2014-12-17 Name: 0, dtype: datetime64[ns] </code></pre> Series tries to cast the values to a single useful dtype. DataFrames can have a different dtype for each column. So the Timestamp in the <code>Date</code> column does not affect the dtype of the value in the <code>Weekday</code> column. So the problem was avoided by using an index selector which returns a DataFrame. <hr> Alternative: use integers for Weekday Yet another alternative is to store the isoweekday integer in <code>Weekday</code>, and convert to strings only at the end when you print: <pre class="prettyprint"><code>import datetime import pandas as pd dict_weekday = {1: 'MON', 2: 'TUE', 3: 'WED', 4: 'THU', 5: 'FRI', 6: 'SAT', 7: 'SUN'} df = pd.DataFrame(pd.date_range(datetime.date(2014, 1, 1), datetime.date(2014, 1, 15), freq='D'), columns=['Date']) df['Weekday'] = df['Date'].dt.weekday+1 # add 1 for isoweekday for idx in df.index: print dict_weekday[df.loc[idx, 'Weekday']] </code></pre> <hr> Alternative: use <code>df.ix</code>: <code>df.loc</code> is a <code>_LocIndexer</code>, whereas <code>df.ix</code> is a <code>_IXIndexer</code>. They have different <code>__getitem__</code> methods. If you step through the code (for example, using pdb) you'll find that <code>df.ix</code> calls <code>df.getvalue</code>: <pre class="prettyprint"><code>def __getitem__(self, key): if type(key) is tuple: try: values = self.obj.get_value(*key) </code></pre> and the DataFrame method <code>df.get_value</code> succeeds in returning <code>'WED'</code>: <pre class="prettyprint"><code>In [14]: df.get_value(0, 'Weekday') Out[14]: 'WED' </code></pre> This is why <code>df.ix</code> is another alternative that works here.

pandas: iterating over DataFrame index with loc

Tags:

python

indexing

pandas

I can't seem to find the reasoning behind the behaviour of .loc. I know it is label based, so if I iterate over Index object the following minimal example should work. But it doesn't. I googled of course but I need additional explanation from someone who has already got a grip on indexing.

import datetime
import pandas as pd

dict_weekday = {1: 'MON', 2: 'TUE', 3: 'WED', 4: 'THU', 5: 'FRI', 6: 'SAT', 7: 'SUN'}
df = pd.DataFrame(pd.date_range(datetime.date(2014, 1, 1), datetime.date(2014, 1, 15), freq='D'),   columns=['Date'])
df['Weekday'] = df['Date'].apply(lambda x: dict_weekday[x.isoweekday()])

for idx in df.index:
    print df.loc[idx, 'Weekday']

201

asked Dec 16 '14 09:12

user3176500

1 Answers

The problem is not in df.loc; df.loc[idx, 'Weekday'] is just returning a Series. The surprising behavior is due to the way pd.Series tries to cast datetime-like values to Timestamps.

df.loc[0, 'Weekday']

forms the Series

pd.Series(np.array([pd.Timestamp('2014-01-01 00:00:00'), 'WED'], dtype=object))

When pd.Series(...) is called, it tries to cast the data to an appropriate dtype.

If you trace through the code, you'll find that it eventually arrives at these lines in pandas.core.common._possibly_infer_to_datetimelike:

sample = v[:min(3,len(v))]
inferred_type = lib.infer_dtype(sample)

which is inspecting the first few elements of the data and trying to infer the dtype. When one of the values is a pd.Timestamp, Pandas checks to see if all the data can be cast as Timestamps. Indeed, 'Wed' can be cast to pd.Timestamp:

In [138]: pd.Timestamp('Wed')
Out[138]: Timestamp('2014-12-17 00:00:00')

This is the root of the problem, which results in pd.Series returning two Timestamps instead of a Timestamp and a string:

In [139]: pd.Series(np.array([pd.Timestamp('2014-01-01 00:00:00'), 'WED'], dtype=object))
Out[139]: 
0   2014-01-01
1   2014-12-17
dtype: datetime64[ns]

and thus this returns

In [140]: df.loc[0, 'Weekday']
Out[140]: Timestamp('2014-12-17 00:00:00')

instead of 'Wed'.

Alternative: select the Series df['Weekday'] first:

There are many workarounds; EdChum shows that adding a non-datelike (integer) value to the sample can prevent pd.Series from casting all the values to Timestamps.

Alternatively, you could access df['Weekdays'] before using .loc:

for idx in df.index:
    print df['Weekday'].loc[idx]

Alternative: df.loc[[idx], 'Weekday']:

Another alternative is

for idx in df.index:
    print df.loc[[idx], 'Weekday'].item()

df.loc[[idx], 'Weekday'] first selects the DataFrame df.loc[[idx]]. For example, when idx equals 0,

In [10]: df.loc[[0]]
Out[10]: 
        Date Weekday
0 2014-01-01     WED

whereas df.loc[0] returns the Series:

In [11]: df.loc[0]
Out[11]: 
Date      2014-01-01
Weekday   2014-12-17
Name: 0, dtype: datetime64[ns]

Series tries to cast the values to a single useful dtype. DataFrames can have a different dtype for each column. So the Timestamp in the Date column does not affect the dtype of the value in the Weekday column.

So the problem was avoided by using an index selector which returns a DataFrame.

Alternative: use integers for Weekday

Yet another alternative is to store the isoweekday integer in Weekday, and convert to strings only at the end when you print:

import datetime
import pandas as pd

dict_weekday = {1: 'MON', 2: 'TUE', 3: 'WED', 4: 'THU', 5: 'FRI', 6: 'SAT', 7: 'SUN'}
df = pd.DataFrame(pd.date_range(datetime.date(2014, 1, 1), datetime.date(2014, 1, 15), freq='D'),   columns=['Date'])
df['Weekday'] = df['Date'].dt.weekday+1   # add 1 for isoweekday

for idx in df.index:
    print dict_weekday[df.loc[idx, 'Weekday']]

Alternative: use df.ix:

df.loc is a _LocIndexer, whereas df.ix is a _IXIndexer. They have different __getitem__ methods. If you step through the code (for example, using pdb) you'll find that df.ix calls df.getvalue:

def __getitem__(self, key):
    if type(key) is tuple:
        try:
            values = self.obj.get_value(*key)

and the DataFrame method df.get_value succeeds in returning 'WED':

In [14]: df.get_value(0, 'Weekday')
Out[14]: 'WED'

This is why df.ix is another alternative that works here.

answered Oct 18 '22 19:10

unutbu

Related questions
                            
                                Randomly extract x items from a list using python
                            
                                Create selectfield options with custom attributes in WTForms
                            
                                Correct daemon behaviour (from PEP 3143) explained
                            
                                Plotting profile hitstograms in python
                            
                                Running a shell command from a flask app [closed]
                            
                                round exponential float to 2 decimals
                            
                                how to override -DNDEBUG compile flag when building cython module
                            
                                Is there a Python equivalent to the Perl "/x" modifier for regular expressions?
                            
                                Three ways to print in Python -- when to use each?
                            
                                Python - Send HTML-formatted email via Outlook 2007/2010 and win32com
                            
                                Create random numbers with left skewed probability distribution
                            
                                Python's curses module does not refresh pad until first character received
                            
                                using list on postgresql JSON type with sqlalchemy
                            
                                Why do Python findall() and finditer() return empty matches on unanchored .* searches?
                            
                                Flask Blueprint can't find static folder
                            
                                FuncAnimation goes past the frames argument
                            
                                HTMLParser for Python 3.4
                            
                                Print unicode string in python regardless of environment
                            
                                Send SIGINT in python to os.system
                            
                                Best way to permute contents of each column in numpy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With