I am trying to convert one column of my dataframe to datetime. Following the discussion here https://github.com/dask/dask/issues/863 I tried the following code:
import dask.dataframe as dd
df['time'].map_partitions(pd.to_datetime, columns='time').compute()
But I am getting the following error message
ValueError: Metadata inference failed, please provide `meta` keyword
What exactly should I put under meta? should I put a dictionary of ALL the columns in df or only of the 'time' column? and what type should I put? I have tried dtype and datetime64 but none of them work so far.
Thank you and I appreciate your guidance,
Update
I will include here the new error messages:
1) Using Timestamp
df['trd_exctn_dt'].map_partitions(pd.Timestamp).compute()
TypeError: Cannot convert input to Timestamp
2) Using datetime and meta
meta = ('time', pd.Timestamp)
df['time'].map_partitions(pd.to_datetime,meta=meta).compute()
TypeError: to_datetime() got an unexpected keyword argument 'meta'
3) Just using date time: gets stuck at 2%
In [14]: df['trd_exctn_dt'].map_partitions(pd.to_datetime).compute()
[ ] | 2% Completed | 2min 20.3s
Also, I would like to be able to specify the format in the date, as i would do in pandas:
pd.to_datetime(df['time'], format = '%m%d%Y'
Update 2
After updating to Dask 0.11, I no longer have problems with the meta keyword. Still, I can't get it past 2% on a 2GB dataframe.
df['trd_exctn_dt'].map_partitions(pd.to_datetime, meta=meta).compute()
[ ] | 2% Completed | 30min 45.7s
Update 3
worked better this way:
def parse_dates(df):
return pd.to_datetime(df['time'], format = '%m/%d/%Y')
df.map_partitions(parse_dates, meta=meta)
I'm not sure whether it's the right approach or not
Pandas to_datetime() method helps to convert string Date time into Python Date time object. Syntax: pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, box=True, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=False)
The npartitions property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways. If you don't have enough partitions then you may not be able to use all of your cores effectively. For example if your dask.
Just like Pandas, Dask DataFrame supports label-based indexing with the . loc accessor for selecting rows or columns, and __getitem__ (square brackets) for selecting just columns. To select rows, the DataFrame's divisions must be known (see Internal Design and Dask DataFrames Best Practices for more information.)
astype
You can use the astype
method to convert the dtype of a series to a NumPy dtype
df.time.astype('M8[us]')
There is probably a way to specify a Pandas style dtype as well (edits welcome)
When using black-box methods like map_partitions
, dask.dataframe needs to know the type and names of the output. There are a few ways to do this listed in the docstring for map_partitions
.
You can supply an empty Pandas object with the right dtype and name
meta = pd.Series([], name='time', dtype=pd.Timestamp)
Or you can provide a tuple of (name, dtype)
for a Series or a dict for a DataFrame
meta = ('time', pd.Timestamp)
Then everything should be fine
df.time.map_partitions(pd.to_datetime, meta=meta)
If you were calling map_partitions
on df
instead then you would need to provide the dtypes for everything. That isn't the case in your example though.
Dask also come with to_timedelta so this should work as well.
df['time']=dd.to_datetime(df.time,unit='ns')
The values unit takes is the same as pd.to_timedelta in pandas. This can be found here.
I'm not sure if it this is the right approach, but mapping the column worked for me:
df['time'] = df['time'].map(lambda x: pd.to_datetime(x, errors='coerce'))
This worked for me
ddf["Date"] = ddf["Date"].map_partitions(pd.to_datetime,format='%d/%m/%Y',meta = ('datetime64[ns]'))
If the datetime is in a non ISO format then map_partition
yields better results:
import dask
import pandas as pd
from dask.distributed import Client
client = Client()
ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
.apply(lambda x: x[1]+' '+x[0], meta=('object')))
%%timeit
ddf.datetime = ddf.datetime.astype('M8[s]')
ddf.compute()
11.3 s ± 719 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
.apply(lambda x: x[1]+' '+x[0], meta=('object')))
%%timeit
ddf.datetime_nonISO = (ddf.datetime_nonISO.map_partitions(pd.to_datetime
, format='%H:%M:%S %Y-%m-%d', meta=('datetime64[s]')))
ddf.compute()
8.78 s ± 599 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
.apply(lambda x: x[1]+' '+x[0], meta=('object')))
%%timeit
ddf.datetime_nonISO = ddf.datetime_nonISO.astype('M8[s]')
ddf.compute()
1min 8s ± 3.65 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With