I am trying to convert one column of my dataframe to datetime. Following the discussion here https://github.com/dask/dask/issues/863 I tried the following code: <pre class="prettyprint"><code>import dask.dataframe as dd df['time'].map_partitions(pd.to_datetime, columns='time').compute() </code></pre> But I am getting the following error message <pre class="prettyprint"><code>ValueError: Metadata inference failed, please provide `meta` keyword </code></pre> What exactly should I put under meta? should I put a dictionary of ALL the columns in df or only of the 'time' column? and what type should I put? I have tried dtype and datetime64 but none of them work so far. Thank you and I appreciate your guidance, Update I will include here the new error messages: 1) Using Timestamp <pre class="prettyprint"><code>df['trd_exctn_dt'].map_partitions(pd.Timestamp).compute() TypeError: Cannot convert input to Timestamp </code></pre> 2) Using datetime and meta <pre class="prettyprint"><code>meta = ('time', pd.Timestamp) df['time'].map_partitions(pd.to_datetime,meta=meta).compute() TypeError: to_datetime() got an unexpected keyword argument 'meta' </code></pre> 3) Just using date time: gets stuck at 2% <pre class="prettyprint"><code> In [14]: df['trd_exctn_dt'].map_partitions(pd.to_datetime).compute() [ ] | 2% Completed | 2min 20.3s </code></pre> Also, I would like to be able to specify the format in the date, as i would do in pandas: <pre class="prettyprint"><code>pd.to_datetime(df['time'], format = '%m%d%Y' </code></pre> Update 2 After updating to Dask 0.11, I no longer have problems with the meta keyword. Still, I can't get it past 2% on a 2GB dataframe. <pre class="prettyprint"><code>df['trd_exctn_dt'].map_partitions(pd.to_datetime, meta=meta).compute() [ ] | 2% Completed | 30min 45.7s </code></pre> Update 3 worked better this way: <pre class="prettyprint"><code>def parse_dates(df): return pd.to_datetime(df['time'], format = '%m/%d/%Y') df.map_partitions(parse_dates, meta=meta) </code></pre> I'm not sure whether it's the right approach or not

<h3>Use <code>astype</code> </h3> You can use the <code>astype</code> method to convert the dtype of a series to a NumPy dtype <pre class="prettyprint"><code>df.time.astype('M8[us]') </code></pre> There is probably a way to specify a Pandas style dtype as well (edits welcome) <h3>Use map_partitions and meta</h3> When using black-box methods like <code>map_partitions</code>, dask.dataframe needs to know the type and names of the output. There are a few ways to do this listed in the docstring for <code>map_partitions</code>. You can supply an empty Pandas object with the right dtype and name <pre class="prettyprint"><code>meta = pd.Series([], name='time', dtype=pd.Timestamp) </code></pre> Or you can provide a tuple of <code>(name, dtype)</code> for a Series or a dict for a DataFrame <pre class="prettyprint"><code>meta = ('time', pd.Timestamp) </code></pre> Then everything should be fine <pre class="prettyprint"><code>df.time.map_partitions(pd.to_datetime, meta=meta) </code></pre> If you were calling <code>map_partitions</code> on <code>df</code> instead then you would need to provide the dtypes for everything. That isn't the case in your example though.

Dask also come with to_timedelta so this should work as well. <pre class="prettyprint lang-py prettyprint-override"><code>df['time']=dd.to_datetime(df.time,unit='ns') </code></pre> The values unit takes is the same as pd.to_timedelta in pandas. This can be found here.

I'm not sure if it this is the right approach, but mapping the column worked for me: <pre class="prettyprint"><code>df['time'] = df['time'].map(lambda x: pd.to_datetime(x, errors='coerce')) </code></pre>

dask dataframe how to convert column to to_datetime

Tags:

python

pandas

dask

I am trying to convert one column of my dataframe to datetime. Following the discussion here https://github.com/dask/dask/issues/863 I tried the following code:

import dask.dataframe as dd
df['time'].map_partitions(pd.to_datetime, columns='time').compute()

But I am getting the following error message

ValueError: Metadata inference failed, please provide `meta` keyword

What exactly should I put under meta? should I put a dictionary of ALL the columns in df or only of the 'time' column? and what type should I put? I have tried dtype and datetime64 but none of them work so far.

Thank you and I appreciate your guidance,

Update

I will include here the new error messages:

1) Using Timestamp

df['trd_exctn_dt'].map_partitions(pd.Timestamp).compute()

TypeError: Cannot convert input to Timestamp

2) Using datetime and meta

meta = ('time', pd.Timestamp)
df['time'].map_partitions(pd.to_datetime,meta=meta).compute()
TypeError: to_datetime() got an unexpected keyword argument 'meta'

3) Just using date time: gets stuck at 2%

    In [14]: df['trd_exctn_dt'].map_partitions(pd.to_datetime).compute()
[                                        ] | 2% Completed |  2min 20.3s

Also, I would like to be able to specify the format in the date, as i would do in pandas:

pd.to_datetime(df['time'], format = '%m%d%Y'

Update 2

After updating to Dask 0.11, I no longer have problems with the meta keyword. Still, I can't get it past 2% on a 2GB dataframe.

df['trd_exctn_dt'].map_partitions(pd.to_datetime, meta=meta).compute()
    [                                        ] | 2% Completed |  30min 45.7s

Update 3

worked better this way:

def parse_dates(df):
  return pd.to_datetime(df['time'], format = '%m/%d/%Y')

df.map_partitions(parse_dates, meta=meta)

I'm not sure whether it's the right approach or not

969

asked Sep 20 '16 00:09

dleal

5 Answers

Use `astype`

You can use the astype method to convert the dtype of a series to a NumPy dtype

df.time.astype('M8[us]')

There is probably a way to specify a Pandas style dtype as well (edits welcome)

Use map_partitions and meta

When using black-box methods like map_partitions, dask.dataframe needs to know the type and names of the output. There are a few ways to do this listed in the docstring for map_partitions.

You can supply an empty Pandas object with the right dtype and name

meta = pd.Series([], name='time', dtype=pd.Timestamp)

Or you can provide a tuple of (name, dtype) for a Series or a dict for a DataFrame

meta = ('time', pd.Timestamp)

Then everything should be fine

df.time.map_partitions(pd.to_datetime, meta=meta)

If you were calling map_partitions on df instead then you would need to provide the dtypes for everything. That isn't the case in your example though.

200

answered Oct 18 '22 18:10

MRocklin

Dask also come with to_timedelta so this should work as well.

df['time']=dd.to_datetime(df.time,unit='ns')

The values unit takes is the same as pd.to_timedelta in pandas. This can be found here.

answered Oct 18 '22 18:10

Arundathi

I'm not sure if it this is the right approach, but mapping the column worked for me:

df['time'] = df['time'].map(lambda x: pd.to_datetime(x, errors='coerce'))

answered Oct 18 '22 17:10

tmsss

This worked for me

ddf["Date"] = ddf["Date"].map_partitions(pd.to_datetime,format='%d/%m/%Y',meta = ('datetime64[ns]'))

answered Oct 18 '22 17:10

citynorman

If the datetime is in a non ISO format then map_partition yields better results:

import dask
import pandas as pd
from dask.distributed import Client
client = Client()

ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
                                 .apply(lambda x: x[1]+' '+x[0], meta=('object'))) 

%%timeit
ddf.datetime = ddf.datetime.astype('M8[s]')
ddf.compute()

11.3 s ± 719 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
                                 .apply(lambda x: x[1]+' '+x[0], meta=('object'))) 


%%timeit
ddf.datetime_nonISO = (ddf.datetime_nonISO.map_partitions(pd.to_datetime
                       ,  format='%H:%M:%S %Y-%m-%d', meta=('datetime64[s]')))
ddf.compute()

8.78 s ± 599 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ddf = dask.datasets.timeseries()
ddf = ddf.assign(datetime=ddf.index.astype(object))
ddf = (ddf.assign(datetime_nonISO = ddf['datetime'].astype(str).str.split(' ')
                                 .apply(lambda x: x[1]+' '+x[0], meta=('object'))) 

%%timeit
ddf.datetime_nonISO = ddf.datetime_nonISO.astype('M8[s]')
ddf.compute()

1min 8s ± 3.65 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

answered Oct 18 '22 17:10

skibee

Related questions
                            
                                Python psycopg2 check row exists
                            
                                Dot notation string manipulation
                            
                                Scikit-learn grid search with SVM regression
                            
                                How to serialize a Marshmallow field under a different name
                            
                                Flask and Keras model Error ''_thread._local' object has no attribute 'value''?
                            
                                Tool to enforce python code style/standards [closed]
                            
                                How to extract a string between 2 other strings in python?
                            
                                Installing pythonstartup file
                            
                                django: how to do calculation inside the template html page?
                            
                                Best practices for getting the most testing coverage with Django/Python?
                            
                                Iterating over submitted form fields in Flask?
                            
                                marshal dumps faster, cPickle loads faster
                            
                                Programmatically change image resolution
                            
                                How to get python to display current time (eastern)
                            
                                Python: Module Error with pprint, no error with print
                            
                                Installing Django with pip [duplicate]
                            
                                write multiple lines in a file in python
                            
                                Fastest way to populate QTableView from Pandas data frame
                            
                                Anaconda Python installation error
                            
                                How can I make a video from array of images in matplotlib?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

dask dataframe how to convert column to to_datetime

Tags:

python

pandas

dask

dleal

People also ask

5 Answers

Use `astype`

Use map_partitions and meta

MRocklin

Arundathi

tmsss

citynorman

skibee

Recent Activity

Donate For Us

dask dataframe how to convert column to to_datetime

Tags:

python

pandas

dask

dleal

People also ask

5 Answers

Use astype

Use map_partitions and meta

MRocklin

Arundathi

tmsss

citynorman

skibee

Related questions

Recent Activity

Donate For Us

Use `astype`