dask dataframe apply meta

Tags:

I'm wanting to do a frequency count on a single column of a dask dataframe. The code works, but I get an warning complaining that meta is not defined. If I try to define meta I get an error AttributeError: 'DataFrame' object has no attribute 'name'. For this particular use case it doesn't look like I need to define meta but I'd like to know how to do that for future reference.

Dummy dataframe and the column frequencies

import pandas as pd from dask import dataframe as dd  df = pd.DataFrame([['Sam', 'Alex', 'David', 'Sarah', 'Alice', 'Sam', 'Anna'],                    ['Sam', 'David', 'David', 'Alice', 'Sam', 'Alice', 'Sam'],                    [12, 10, 15, 23, 18, 20, 26]],                   index=['Column A', 'Column B', 'Column C']).T dask_df = dd.from_pandas(df)

In [39]: dask_df.head() Out[39]:    Column A Column B Column C 0      Sam      Sam       12 1     Alex    David       10 2    David    David       15 3    Sarah    Alice       23 4    Alice      Sam       18

(dask_df.groupby('Column B')         .apply(lambda group: len(group))        ).compute()  UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.   Before: .apply(func)   After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result   or:     .apply(func, meta=('x', 'f8'))            for series result   warnings.warn(msg) Out[60]:  Column B Alice    2 David    2 Sam      3 dtype: int64

Trying to define meta produces AttributeError

 (dask_df.groupby('Column B')          .apply(lambda d: len(d), meta={'Column B': 'int'})).compute()

same for this

 (dask_df.groupby('Column B')          .apply(lambda d: len(d), meta=pd.DataFrame({'Column B': 'int'}))).compute()

same if I try having the dtype be int instead of "int" or for that matter 'f8' or np.float64 so it doesn't seem like it's the dtype that is causing the problem.

The documentation on meta seems to imply that I should be doing exactly what I'm trying to do (http://dask.pydata.org/en/latest/dataframe-design.html#metadata).

What is meta? and how am I supposed to define it?

Using python 3.6 dask 0.14.3 and pandas 0.20.2

607

asked Jun 08 '17 10:06

Matti Lyra

1 Answers

meta is the prescription of the names/types of the output from the computation. This is required because apply() is flexible enough that it can produce just about anything from a dataframe. As you can see, if you don't provide a meta, then dask actually computes part of the data, to see what the types should be - which is fine, but you should know it is happening. You can avoid this pre-computation (which can be expensive) and be more explicit when you know what the output should look like, by providing a zero-row version of the output (dataframe or series), or just the types.

The output of your computation is actually a series, so the following is the simplest that works

(dask_df.groupby('Column B')      .apply(len, meta=('int'))).compute()

but more accurate would be

(dask_df.groupby('Column B')      .apply(len, meta=pd.Series(dtype='int', name='Column B')))

196

answered Sep 23 '22 18:09

mdurant

Related questions
                            
                                Convert Java Project to Gradle Project in Intellij
                            
                                Using Slice Pipe with variable parameters in *ngFor
                            
                                regular expressions in base R: 'perl=TRUE' vs. the default (PCRE vs. TRE)
                            
                                Excluding Lombok classes from Sonar coverage report
                            
                                Can dockerfile be put in .dockerignore?
                            
                                Use Webpack's DefinePlugin vars in Jest tests
                            
                                Is it possible to include a block of XAML in another block of XAML?
                            
                                How to upload file to AWS S3 using AWS AppSync
                            
                                Google app script desktop ide [duplicate]
                            
                                Angular 5 caching http service api calls
                            
                                Spring Boot - how to communicate between microservices?
                            
                                Object.prototype returns empty object in Node

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With