Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to slice pandas dataframe (with date as key) using date as string

I'm generating an empty dataframe with a series of dates as the index. Data will be added to the dataframe at a later point.

cbd=pd.date_range(start=pd.datetime(2017,01,02),end=pd.datetime(2017,01,30),period=1)

df = pd.DataFrame(data=None,columns=['Test1','Test2'],index=cbd)

df.head()
           Test1 Test2
2017-01-02   NaN   NaN
2017-01-03   NaN   NaN
2017-01-04   NaN   NaN
2017-01-05   NaN   NaN
2017-01-06   NaN   NaN

A few slicing methods don't seem to work. The following returns a KeyError:

df['2017-01-02']

However any of the following work:

df['2017-01-02':'2017-01-02']
df.loc['2017-01-02']

What am I missing here? Why doesn't the first slice return a result?

like image 943
mpny1 Avatar asked Jan 29 '17 15:01

mpny1


People also ask

How does Pandas handle date in DataFrame?

Pandas has a built-in function called to_datetime()that converts date and time in string format to a DateTime object. As you can see, the 'date' column in the DataFrame is currently of a string-type object. Thus, to_datetime() converts the column to a series of the appropriate datetime64 dtype.


3 Answers

Dual behavior of [] in df[]

  • When you don't use : inside [], then the value(s) inside it will be considered as column(s).
  • And when you use : inside [], then the value(s) inside it will be considered as row(s).

Why the dual nature?

Because most of the time people want to slice the rows instead of slicing the columns.

So they decided that x and y in df[x:y] should correspond to rows,

and x in d[x] or x, y in df[[x,y]] should correspond to column(s).

Example:

df = pd.DataFrame(data = [[1,2,3], [1,2,3], [1,2,3]],
                                 index = ['A','B','C'], columns = ['A','B','C'])
print df

Output:

   A  B  C
A  1  2  3
B  1  2  3
C  1  2  3

Now when you do df['B'], it can mean 2 things:

  • Take the 2nd index B and give you the 2nd row 1 2 3

                     OR
    
  • Take the 2nd column B and give you the 2nd column 2 2 2.

So in order to resolve this conflict and keep it unambiguous df['B'] will always mean that you want the column 'B', if there is no such column then it will throw an Error.

Why does df['2017-01-02'] fails?

It will search for a column '2017-01-02', Because there is no such column, it throws an error.

Why does df.loc['2017-01-02'] works then?

Because .loc[] has syntax of df.loc[row,column] and you can leave out the column if you will, as in your case, it simply means df.loc[row]

like image 61
Mohammad Yusuf Avatar answered Oct 20 '22 13:10

Mohammad Yusuf


There is difference, because use different approaches:

For select one row is necessary loc:

df['2017-01-02']

Docs - partial string indexing:

Warning

The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one):

dft['2013-1-15 12:30:00']

To select a single row, use .loc

In [74]: dft.loc['2013-1-15 12:30:00']
Out[74]: 
A    0.193284
Name: 2013-01-15 12:30:00, dtype: float64

df['2017-01-02':'2017-01-02']

This is pure partial string indexing:

This type of slicing will work on a DataFrame with a DateTimeIndex as well. Since the partial string selection is a form of label slicing, the endpoints will be included. This would include matching times on an included date.

like image 29
jezrael Avatar answered Oct 20 '22 15:10

jezrael


First I have updated your test data (just for info) as it returns an 'invalid token' error. Please see changes here:

cbd=pd.date_range(start='2017-01-02',end='2017-01-30',period=1)
df = pd.DataFrame(data=None,columns=['Test1','Test2'],index=cbd)

Now looking at the first row:

In[1]:

df.head(1)

Out[1]:
          Test1 Test2
2017-01-02  NaN NaN

And then trying the initial slicing approach yields this error:

In[2]:    

df['2017-01-02']

Out[2]:

KeyError: '2017-01-02'

Now try this using the column name:

In[3]:    

df.columns

Out[3]:

Index(['Test1', 'Test2'], dtype='object')

In[4]:

We try 'Test1':

df['Test1']

And get the NaN output from this column.

Out[4]:

2017-01-02    NaN
2017-01-03    NaN
2017-01-04    NaN
2017-01-05    NaN

So the format you are using is designed to be used on the column name unless you use this format df['2017-01-02':'2017-01-02'].

The Pandas docs state "The following selection will raise a KeyError; otherwise this selection methodology would be inconsistent with other selection methods in pandas (as this is not a slice, nor does it resolve to one)".

So as you correctly identified, DataFrame.loc is a label based indexer which yields the output you are looking for:

 In[5]:
df.loc['2017-01-02']

 Out[5]:

Test1    NaN
Test2    NaN
Name: 2017-01-02 00:00:00, dtype: object
like image 1
seemo Avatar answered Oct 20 '22 15:10

seemo