Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to specify metadata for dask.dataframe

The docs provide good examples, how metadata can be provided. However I still feel unsure, when it comes to picking the right dtypes for my dataframe.

  • Could I do something like meta={'x': int 'y': float, 'z': float} instead of meta={'x': 'i8', 'y': 'f8', 'z': 'f8'}?
  • Could somebody hint me to a list of possible values like 'i8'? What dtypes exist?
  • How can I specify a column, that contains arbitrary objects? How can I specify a column, that contains only instances of one class?
like image 530
Arco Bast Avatar asked Sep 01 '16 07:09

Arco Bast


People also ask

What is Npartitions in Dask DataFrame?

The npartitions property is the number of Pandas dataframes that compose a single Dask dataframe. This affects performance in two main ways. If you don't have enough partitions then you may not be able to use all of your cores effectively. For example if your dask.

How do I select columns in Dask DataFrame?

Just like Pandas, Dask DataFrame supports label-based indexing with the . loc accessor for selecting rows or columns, and __getitem__ (square brackets) for selecting just columns.

Is Dask DataFrame faster than pandas?

Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.


2 Answers

The available basic data types are the ones which are offered through numpy. Have a look at the documentation for a list.

Not included in this set are datetime-formats (e.g. datetime64), for which additional information can be found in the pandas and numpy documentation.

The meta-argument for dask dataframes usually expects an empty pandas dataframe holding definitions for columns, indices and dtypes.

One way to construct such a DataFrame is:

import pandas as pd
import numpy as np
meta = pd.DataFrame(columns=['a', 'b', 'c'])
meta.a = meta.a.astype(np.int64)
meta.b = meta.b.astype(np.datetime64)

There is also a way to provide a dtype to the constructor of the pandas dataframe, however, I am not sure how to provide them for individual columns each. As you can see, it is possible to provide not only the "name" for datatypes, but also the actual numpy dtype.

Regarding your last question, the datatype you are looking for is "object". For example:

import pandas as pd

class Foo:
    def __init__(self, foo):
        self.bar = foo

df = pd.DataFrame(data=[Foo(1), Foo(2)], columns=['a'], dtype='object')
df.a
# 0    <__main__.Foo object at 0x00000000058AC550>
# 1    <__main__.Foo object at 0x00000000058AC358>
like image 181
sim Avatar answered Sep 28 '22 03:09

sim


Both Dask.dataframe and Pandas use NumPy dtypes. In particular, anything within that you can pass to np.dtype. This includes the following:

  1. NumPy dtype objects, like np.float64
  2. Python type objects, like float
  3. NumPy dtype strings, like 'f8'

Here is a more extensive list taken from the NumPy docs: http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html#specifying-and-constructing-data-types

like image 29
MRocklin Avatar answered Sep 28 '22 04:09

MRocklin