I have a Pandas dataframe that was generated by fetching data from a KDB database, using QPython.
First, the Date column is returned as a strange dtype: dtype('<M8[ns]')
df = conn.sync("select Date, Open, High, Low, Close from stocktable", pandas=True)
df["Date"].dtype
# dtype('<M8[ns]')
However, when I inspect the contents of the column, the bottom row displays the dtype as datetime.
0 2017-04-17
1 2017-04-13
2 2017-04-12
3 2017-04-11
4 2017-04-10
5 2017-04-07
6 2017-04-06
7 2017-04-05
8 2017-04-04
9 2017-04-03
10 2017-03-31
11 2017-03-30
...
3180 2004-08-27
3181 2004-08-26
3182 2004-08-25
3183 2004-08-24
3184 2004-08-23
3185 2004-08-20
3186 2004-08-19
Name: Date, dtype: datetime64[ns]
Also, method last()
is not working correctly. I ask for the last 5 months worth of data, but all of the data is returned.
# Expected to only return last 5 months of data, but returns it all.
df.set_index("Date").last("5M")
How do I get the last rows of this DataFrame?
Method 1: Using tail() method DataFrame. tail(n) to get the last n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the end). By default n = 5, it return the last 5 rows if the value of n is not passed to the method.
In order to select rows between two dates in pandas DataFrame, first, create a boolean mask using mask = (df['InsertedDates'] > start_date) & (df['InsertedDates'] <= end_date) to represent the start and end of the date range. Then you select the DataFrame that lies within the range using the DataFrame. loc[] method.
Drop Last Row of Pandas DataFrame Using iloc[] iloc[] you can drop the rows from DataFrame and use -1 to drop the last row. For example use df. iloc[:-1,:] to select all rows except the last one and then assign it back to the original variable which ideally drops the last row from DataFrame.
Solved it. The issue was that the data returned by KDB was sorted in DESC order, which confused method last()
.
The solution is to either add a sort clause to the query (in the Q language, it's with a backtick followed by the keyword xasc)
df = conn.sync("`Date xasc select Date, Open, High, Low, Close from stocktable", pandas=True) \
.last("5M")
Or else, to sort the data in the Pandas dataframe itself.
df_sorted = stocktable.dataframe() \
.sort_values(by="Date",ascending=True) \
.set_index("Date")
.last("5M")
It works properly for me.
Demo:
In [71]: from pandas_datareader import data as web
In [72]: df = web.DataReader('AAPL', 'yahoo', '2010-04-01')
In [73]: df
Out[73]:
Open High Low Close Volume Adj Close
Date
2010-04-01 237.410000 238.730003 232.750000 235.969994 150786300 30.572166
2010-04-05 234.980011 238.509998 234.769993 238.489998 171126900 30.898657
2010-04-06 238.200005 240.239998 237.000004 239.540009 111754300 31.034696
2010-04-07 239.549995 241.920010 238.659988 240.600006 157125500 31.172029
2010-04-08 240.440002 241.540001 238.040001 239.950005 143247300 31.087815
2010-04-09 241.430012 241.889996 240.460003 241.789993 83545700 31.326203
2010-04-12 242.199989 243.069996 241.809994 242.290005 83256600 31.390984
2010-04-13 241.860008 242.800003 241.110004 242.430008 76552700 31.409123
2010-04-14 245.280006 245.810005 244.069992 245.690002 101019100 31.831486
2010-04-15 245.779991 249.029999 245.509998 248.920010 94196200 32.249965
... ... ... ... ... ... ...
2017-04-13 141.910004 142.380005 141.050003 141.050003 17652900 141.050003
2017-04-17 141.479996 141.880005 140.869995 141.830002 16424000 141.830002
2017-04-18 141.410004 142.039993 141.110001 141.199997 14660800 141.199997
2017-04-19 141.880005 142.000000 140.449997 140.679993 17271300 140.679993
2017-04-20 141.220001 142.919998 141.160004 142.440002 23251100 142.440002
2017-04-21 142.440002 142.679993 141.850006 142.270004 17245200 142.270004
2017-04-24 143.500000 143.949997 143.179993 143.639999 17099200 143.639999
2017-04-25 143.910004 144.899994 143.869995 144.529999 18290300 144.529999
2017-04-26 144.470001 144.600006 143.380005 143.679993 19769400 143.679993
2017-04-27 143.919998 144.160004 143.309998 143.789993 14106100 143.789993
[1781 rows x 6 columns]
In [74]: df.last('5M')
Out[74]:
Open High Low Close Volume Adj Close
Date
2016-12-01 110.370003 110.940002 109.029999 109.489998 37086900 109.017344
2016-12-02 109.169998 110.089996 108.849998 109.900002 26528000 109.425578
2016-12-05 110.000000 110.029999 108.250000 109.110001 34324500 108.638987
2016-12-06 109.500000 110.360001 109.190002 109.949997 26195500 109.475358
2016-12-07 109.260002 111.190002 109.160004 111.029999 29998700 110.550697
2016-12-08 110.860001 112.430000 110.599998 112.120003 27068300 111.635996
2016-12-09 112.309998 114.699997 112.309998 113.949997 34402600 113.458090
2016-12-12 113.290001 115.000000 112.489998 113.300003 26374400 112.810902
2016-12-13 113.839996 115.919998 113.750000 115.190002 43733800 114.692743
2016-12-14 115.040001 116.199997 114.980003 115.190002 34031800 114.692743
... ... ... ... ... ... ...
2017-04-13 141.910004 142.380005 141.050003 141.050003 17652900 141.050003
2017-04-17 141.479996 141.880005 140.869995 141.830002 16424000 141.830002
2017-04-18 141.410004 142.039993 141.110001 141.199997 14660800 141.199997
2017-04-19 141.880005 142.000000 140.449997 140.679993 17271300 140.679993
2017-04-20 141.220001 142.919998 141.160004 142.440002 23251100 142.440002
2017-04-21 142.440002 142.679993 141.850006 142.270004 17245200 142.270004
2017-04-24 143.500000 143.949997 143.179993 143.639999 17099200 143.639999
2017-04-25 143.910004 144.899994 143.869995 144.529999 18290300 144.529999
2017-04-26 144.470001 144.600006 143.380005 143.679993 19769400 143.679993
2017-04-27 143.919998 144.160004 143.309998 143.789993 14106100 143.789993
[101 rows x 6 columns]
For me it works nice:
rng = pd.date_range('2017-04-03', periods=10, freq='20D')
df = pd.DataFrame({'Date': rng, 'a': range(10)})
print (df)
Date a
0 2017-04-03 0
1 2017-04-23 1
2 2017-05-13 2
3 2017-06-02 3
4 2017-06-22 4
5 2017-07-12 5
6 2017-08-01 6
7 2017-08-21 7
8 2017-09-10 8
9 2017-09-30 9
df = df.set_index('Date').last('5M')
print (df)
a
Date
2017-05-13 2
2017-06-02 3
2017-06-22 4
2017-07-12 5
2017-08-01 6
2017-08-21 7
2017-09-10 8
2017-09-30 9
It works nice with duplicates also, only is necessary sort DateTime
column:
rng = pd.date_range('2017-04-03', periods=10, freq='20D')
df = pd.DataFrame({'Date': rng, 'a': range(10)})
df = pd.concat([df,df], ignore_index=True).sort_values('Date')
print (df)
Date a
0 2017-04-03 0
10 2017-04-03 0
1 2017-04-23 1
11 2017-04-23 1
2 2017-05-13 2
12 2017-05-13 2
3 2017-06-02 3
13 2017-06-02 3
4 2017-06-22 4
14 2017-06-22 4
5 2017-07-12 5
15 2017-07-12 5
6 2017-08-01 6
16 2017-08-01 6
17 2017-08-21 7
7 2017-08-21 7
18 2017-09-10 8
8 2017-09-10 8
9 2017-09-30 9
19 2017-09-30 9
df = df.set_index('Date').last('5M')
print (df)
a
Date
2017-05-13 2
2017-05-13 2
2017-06-02 3
2017-06-02 3
2017-06-22 4
2017-06-22 4
2017-07-12 5
2017-07-12 5
2017-08-01 6
2017-08-01 6
2017-08-21 7
2017-08-21 7
2017-09-10 8
2017-09-10 8
2017-09-30 9
2017-09-30 9
rng = pd.date_range('2017-04-03', periods=10, freq='20D')
df = pd.DataFrame({'Date': rng, 'a': range(10)})
df = pd.concat([df,df], ignore_index=True)
print (df)
Date a
0 2017-04-03 0
1 2017-04-23 1
2 2017-05-13 2
3 2017-06-02 3
4 2017-06-22 4
5 2017-07-12 5
6 2017-08-01 6
7 2017-08-21 7
8 2017-09-10 8
9 2017-09-30 9
10 2017-04-03 0
11 2017-04-23 1
12 2017-05-13 2
13 2017-06-02 3
14 2017-06-22 4
15 2017-07-12 5
16 2017-08-01 6
17 2017-08-21 7
18 2017-09-10 8
19 2017-09-30 9
df = df.set_index('Date').last('5M')
print (df)
a
Date
2017-05-13 2
2017-06-02 3
2017-06-22 4
2017-07-12 5
2017-08-01 6
2017-08-21 7
2017-09-10 8
2017-09-30 9
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With