Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get last 5 months of data of a Pandas DataFrame?

I have a Pandas dataframe that was generated by fetching data from a KDB database, using QPython.

First, the Date column is returned as a strange dtype: dtype('<M8[ns]')

df = conn.sync("select Date, Open, High, Low, Close from stocktable", pandas=True)
df["Date"].dtype
# dtype('<M8[ns]')

However, when I inspect the contents of the column, the bottom row displays the dtype as datetime.

0      2017-04-17
1      2017-04-13
2      2017-04-12
3      2017-04-11
4      2017-04-10
5      2017-04-07
6      2017-04-06
7      2017-04-05
8      2017-04-04
9      2017-04-03
10     2017-03-31
11     2017-03-30
          ...    

3180   2004-08-27
3181   2004-08-26
3182   2004-08-25
3183   2004-08-24
3184   2004-08-23
3185   2004-08-20
3186   2004-08-19
Name: Date, dtype: datetime64[ns]

Also, method last() is not working correctly. I ask for the last 5 months worth of data, but all of the data is returned.

# Expected to only return last 5 months of data, but returns it all.
df.set_index("Date").last("5M")

How do I get the last rows of this DataFrame?

like image 389
Leonel Avatar asked Apr 28 '17 11:04

Leonel


People also ask

How can you look at the last 5 records in a DataFrame?

Method 1: Using tail() method DataFrame. tail(n) to get the last n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the end). By default n = 5, it return the last 5 rows if the value of n is not passed to the method.

How do I select a specific date range in pandas?

In order to select rows between two dates in pandas DataFrame, first, create a boolean mask using mask = (df['InsertedDates'] > start_date) & (df['InsertedDates'] <= end_date) to represent the start and end of the date range. Then you select the DataFrame that lies within the range using the DataFrame. loc[] method.

How do you drop the last 5 rows in pandas?

Drop Last Row of Pandas DataFrame Using iloc[] iloc[] you can drop the rows from DataFrame and use -1 to drop the last row. For example use df. iloc[:-1,:] to select all rows except the last one and then assign it back to the original variable which ideally drops the last row from DataFrame.


3 Answers

Solved it. The issue was that the data returned by KDB was sorted in DESC order, which confused method last().

The solution is to either add a sort clause to the query (in the Q language, it's with a backtick followed by the keyword xasc)

df = conn.sync("`Date xasc select Date, Open, High, Low, Close from stocktable", pandas=True) \
     .last("5M")

Or else, to sort the data in the Pandas dataframe itself.

df_sorted = stocktable.dataframe() \
    .sort_values(by="Date",ascending=True) \
    .set_index("Date")
    .last("5M")
like image 57
Leonel Avatar answered Sep 28 '22 08:09

Leonel


It works properly for me.

Demo:

In [71]: from pandas_datareader import data as web

In [72]: df = web.DataReader('AAPL', 'yahoo', '2010-04-01')

In [73]: df
Out[73]:
                  Open        High         Low       Close     Volume   Adj Close
Date
2010-04-01  237.410000  238.730003  232.750000  235.969994  150786300   30.572166
2010-04-05  234.980011  238.509998  234.769993  238.489998  171126900   30.898657
2010-04-06  238.200005  240.239998  237.000004  239.540009  111754300   31.034696
2010-04-07  239.549995  241.920010  238.659988  240.600006  157125500   31.172029
2010-04-08  240.440002  241.540001  238.040001  239.950005  143247300   31.087815
2010-04-09  241.430012  241.889996  240.460003  241.789993   83545700   31.326203
2010-04-12  242.199989  243.069996  241.809994  242.290005   83256600   31.390984
2010-04-13  241.860008  242.800003  241.110004  242.430008   76552700   31.409123
2010-04-14  245.280006  245.810005  244.069992  245.690002  101019100   31.831486
2010-04-15  245.779991  249.029999  245.509998  248.920010   94196200   32.249965
...                ...         ...         ...         ...        ...         ...
2017-04-13  141.910004  142.380005  141.050003  141.050003   17652900  141.050003
2017-04-17  141.479996  141.880005  140.869995  141.830002   16424000  141.830002
2017-04-18  141.410004  142.039993  141.110001  141.199997   14660800  141.199997
2017-04-19  141.880005  142.000000  140.449997  140.679993   17271300  140.679993
2017-04-20  141.220001  142.919998  141.160004  142.440002   23251100  142.440002
2017-04-21  142.440002  142.679993  141.850006  142.270004   17245200  142.270004
2017-04-24  143.500000  143.949997  143.179993  143.639999   17099200  143.639999
2017-04-25  143.910004  144.899994  143.869995  144.529999   18290300  144.529999
2017-04-26  144.470001  144.600006  143.380005  143.679993   19769400  143.679993
2017-04-27  143.919998  144.160004  143.309998  143.789993   14106100  143.789993

[1781 rows x 6 columns]

In [74]: df.last('5M')
Out[74]:
                  Open        High         Low       Close    Volume   Adj Close
Date
2016-12-01  110.370003  110.940002  109.029999  109.489998  37086900  109.017344
2016-12-02  109.169998  110.089996  108.849998  109.900002  26528000  109.425578
2016-12-05  110.000000  110.029999  108.250000  109.110001  34324500  108.638987
2016-12-06  109.500000  110.360001  109.190002  109.949997  26195500  109.475358
2016-12-07  109.260002  111.190002  109.160004  111.029999  29998700  110.550697
2016-12-08  110.860001  112.430000  110.599998  112.120003  27068300  111.635996
2016-12-09  112.309998  114.699997  112.309998  113.949997  34402600  113.458090
2016-12-12  113.290001  115.000000  112.489998  113.300003  26374400  112.810902
2016-12-13  113.839996  115.919998  113.750000  115.190002  43733800  114.692743
2016-12-14  115.040001  116.199997  114.980003  115.190002  34031800  114.692743
...                ...         ...         ...         ...       ...         ...
2017-04-13  141.910004  142.380005  141.050003  141.050003  17652900  141.050003
2017-04-17  141.479996  141.880005  140.869995  141.830002  16424000  141.830002
2017-04-18  141.410004  142.039993  141.110001  141.199997  14660800  141.199997
2017-04-19  141.880005  142.000000  140.449997  140.679993  17271300  140.679993
2017-04-20  141.220001  142.919998  141.160004  142.440002  23251100  142.440002
2017-04-21  142.440002  142.679993  141.850006  142.270004  17245200  142.270004
2017-04-24  143.500000  143.949997  143.179993  143.639999  17099200  143.639999
2017-04-25  143.910004  144.899994  143.869995  144.529999  18290300  144.529999
2017-04-26  144.470001  144.600006  143.380005  143.679993  19769400  143.679993
2017-04-27  143.919998  144.160004  143.309998  143.789993  14106100  143.789993

[101 rows x 6 columns]
like image 27
MaxU - stop WAR against UA Avatar answered Sep 28 '22 09:09

MaxU - stop WAR against UA


For me it works nice:

rng = pd.date_range('2017-04-03', periods=10, freq='20D')
df = pd.DataFrame({'Date': rng, 'a': range(10)})  
print (df)
        Date  a
0 2017-04-03  0
1 2017-04-23  1
2 2017-05-13  2
3 2017-06-02  3
4 2017-06-22  4
5 2017-07-12  5
6 2017-08-01  6
7 2017-08-21  7
8 2017-09-10  8
9 2017-09-30  9

df = df.set_index('Date').last('5M')
print (df)
            a
Date         
2017-05-13  2
2017-06-02  3
2017-06-22  4
2017-07-12  5
2017-08-01  6
2017-08-21  7
2017-09-10  8
2017-09-30  9

It works nice with duplicates also, only is necessary sort DateTime column:

rng = pd.date_range('2017-04-03', periods=10, freq='20D')
df = pd.DataFrame({'Date': rng, 'a': range(10)})  
df = pd.concat([df,df], ignore_index=True).sort_values('Date')
print (df)
         Date  a
0  2017-04-03  0
10 2017-04-03  0
1  2017-04-23  1
11 2017-04-23  1
2  2017-05-13  2
12 2017-05-13  2
3  2017-06-02  3
13 2017-06-02  3
4  2017-06-22  4
14 2017-06-22  4
5  2017-07-12  5
15 2017-07-12  5
6  2017-08-01  6
16 2017-08-01  6
17 2017-08-21  7
7  2017-08-21  7
18 2017-09-10  8
8  2017-09-10  8
9  2017-09-30  9
19 2017-09-30  9

df = df.set_index('Date').last('5M')
print (df)
            a
Date         
2017-05-13  2
2017-05-13  2
2017-06-02  3
2017-06-02  3
2017-06-22  4
2017-06-22  4
2017-07-12  5
2017-07-12  5
2017-08-01  6
2017-08-01  6
2017-08-21  7
2017-08-21  7
2017-09-10  8
2017-09-10  8
2017-09-30  9
2017-09-30  9
rng = pd.date_range('2017-04-03', periods=10, freq='20D')
df = pd.DataFrame({'Date': rng, 'a': range(10)})  
df = pd.concat([df,df], ignore_index=True)
print (df)
         Date  a
0  2017-04-03  0
1  2017-04-23  1
2  2017-05-13  2
3  2017-06-02  3
4  2017-06-22  4
5  2017-07-12  5
6  2017-08-01  6
7  2017-08-21  7
8  2017-09-10  8
9  2017-09-30  9
10 2017-04-03  0
11 2017-04-23  1
12 2017-05-13  2
13 2017-06-02  3
14 2017-06-22  4
15 2017-07-12  5
16 2017-08-01  6
17 2017-08-21  7
18 2017-09-10  8
19 2017-09-30  9

df = df.set_index('Date').last('5M')
print (df)
            a
Date         
2017-05-13  2
2017-06-02  3
2017-06-22  4
2017-07-12  5
2017-08-01  6
2017-08-21  7
2017-09-10  8
2017-09-30  9
like image 26
jezrael Avatar answered Sep 28 '22 09:09

jezrael