I'm trying to create a PostgreSQL table of 30-minute data for the S&P 500 ETF (spy30new, for testing freshly inserted data) from a table of several stocks with 15-minute data (all15). all15 has an index on 'dt' (timestamp) and 'instr' (stock symbol). I would like spy30new to have an index on 'dt'.
import numpy as np
import pandas as pd
from datetime import datetime, date, time, timedelta
from dateutil import parser
from sqlalchemy import create_engine
# Query all15
engine = create_engine('postgresql://user:passwd@localhost:5432/stocks')
new15Df = (pd.read_sql_query("SELECT dt, o, h, l, c, v FROM all15 WHERE (instr = 'SPY') AND (date(dt) BETWEEN '2016-06-27' AND '2016-07-15');", engine)).sort_values('dt')
# Correct for Time Zone.
new15Df['dt'] = (new15Df['dt'].copy()).apply(lambda d: d + timedelta(hours=-4))
# spy0030Df contains the 15-minute data at 00 & 30 minute time points
# spy1545Df contains the 15-minute data at 15 & 45 minute time points
spy0030Df = (new15Df[new15Df['dt'].apply(lambda d: d.minute % 30) == 0]).reset_index(drop=True)
spy1545Df = (new15Df[new15Df['dt'].apply(lambda d: d.minute % 30) == 15]).reset_index(drop=True)
high = pd.concat([spy1545Df['h'], spy0030Df['h']], axis=1).max(axis=1)
low = pd.concat([spy1545Df['l'], spy0030Df['l']], axis=1).min(axis=1)
volume = spy1545Df['v'] + spy0030Df['v']
# spy30Df assembled and pushed to PostgreSQL as table spy30new
spy30Df = pd.concat([spy0030Df['dt'], spy1545Df['o'], high, low, spy0030Df['c'], volume], ignore_index = True, axis=1)
spy30Df.columns = ['d', 'o', 'h', 'l', 'c', 'v']
spy30Df.set_index(['dt'], inplace=True)
spy30Df.to_sql('spy30new', engine, if_exists='append', index_label='dt')
This gives the error "ValueError: Cannot cast DatetimeIndex to dtype datetime64[us]"
What I've tried so far (I have successfully pushed CSV files to PG using pandas. But here the source is a PG database):
Not placing an index on 'dt'
spy30Df.set_index(['dt'], inplace=True) # Remove this line
spy30Df.to_sql('spy30new', engine, if_exists='append') # Delete the index_label option
Converting 'dt' from type pandas.tslib.Timestamp to datetime.datetime using to_pydatetime()
(in case psycopg2 can work with python dt, but not pandas Timestamp)
u = (spy0030Df['dt']).tolist()
timesAsPyDt = np.asarray(map((lambda d: d.to_pydatetime()), u))
spy30Df = pd.concat([spy1545Df['o'], high, low, spy0030Df['c'], volume], ignore_index = True, axis=1)
newArray = np.c_[timesAsPyDt, spy30Df.values]
colNames = ['dt', 'o', 'h', 'l', 'c', 'v']
newDf = pd.DataFrame(newArray, columns=colNames)
newDf.set_index(['dt'], inplace=True)
newDf.to_sql('spy30new', engine, if_exists='append', index_label='dt')
Using datetime.utcfromtimestamp()
timesAsDt = (spy0030Df['dt']).apply(lambda d: datetime.utcfromtimestamp(d.tolist()/1e9))
Using pd.to_datetime()
timesAsDt = pd.to_datetime(spy0030Df['dt'])
Using pd.to_datetime() on each element worked. Option 4, which doesn't work, applies pd.to_datetime() to the entire series. Perhaps the Postgres driver understands python datetime, but not datetime64 in pandas & numpy. Option 4 produced the correct output, but I got ValueError (see title) when sending the DF to Postgres
timesAsPyDt = (spy0030Df['dt']).apply(lambda d: pd.to_datetime(str(d)))
I had the same problem and applying pd.to_datetime()
on each element worked as well. But it is orders of magnitude slower than running pd.to_datetime()
on the entire series. For a dataframe with over a 1 million rows:
(df['Time']).apply(lambda d: pd.to_datetime(str(d)))
takes approximately 70 seconds
and
pd.to_datetime(df['Time'])
takes approximately 0.01 seconds
The actual problem is that timezone information is being included. To remove it:
t = pd.to_datetime(df['Time'])
t = t.tz_localize(None)
This should be much faster!
Actually, this was my data frame.
Biomass Fossil Brown coal/Lignite Fossil Coal-derived gas Fossil Gas Fossil Hard coal Fossil Oil Geothermal Hydro Pumped Storage Hydro Run-of-river and poundage Hydro Water Reservoir Nuclear Other Other renewable Solar Waste Wind Offshore Wind Onshore
2018-02-02 00:00:00+01:00 4835.0 16275.0 446.0 1013.0 4071.0 155.0 5.0 7.0 1906.0 35.0 8924.0 3643.0 142.0 0.0 595.0 2517.0 19999.0
2018-02-02 00:15:00+01:00 4834.0 16272.0 446.0 1010.0 3983.0 155.0 5.0 7.0 1908.0 71.0 8996.0 3878.0 142.0 0.0 594.0 2364.0 19854.0
2018-02-02 00:30:00+01:00 4828.0 16393.0 446.0 1019.0 4015.0 155.0 5.0
I was trying to insert into SQL database but getting the same error as in the above question. What i have done is, convert the index of the data frame to the column with a label 'index'.
df.reset_index(level=0, inplace=True)
Rename the column name 'index' to 'DateTime' by using this code.
df = df.rename(columns={'index': 'DateTime'})
Change the datatype to the 'datetime64'.
df['DateTime'] = df['DateTime'].astype('datetime64')
Store it in the sql database using these code.
engine = create_engine('mysql+mysqlconnector://root:Password@localhost/generation_data', echo=True)
df.to_sql(con=engine, name='test', if_exists='replace')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With