Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Renaming columns in dask dataframe

I have two questions about dask. First: The documentation for dask clearly states that you can rename columns with the same syntax as pandas. I am using dask 1.0.0. Any reason why I am getting these errors below?

df = pd.DataFrame(dictionary)
df

enter image description here

# I am not sure how to choose values for divisions, meta, and name. I am also pretty unsure about what these really do.
ddf = dd.DataFrame(dictionary, divisions=[8], meta=pd.DataFrame(dictionary), name='ddf')    
ddf

enter image description here

cols = {'Key':'key', '0':'Datetime','1':'col1','2':'col2','3':'col3','4':'col4','5':'col5'}

ddf.rename(columns=cols, inplace=True)

TypeError: rename() got an unexpected keyword argument 'inplace'

Ok so i removed the inplace=True and tried this:

ddf = ddf.rename(columns=cols)

ValueError: dictionary update sequence element #0 has length 6; 2 is required

The pandas dataframe is showing a real dataframe, but when I call ddf.compute() I get an empty dataframe.

enter image description here

My second question is that I am slightly confused about how to assign divisions, meta, and name. How is this useful/hurtful if I use dask to parallelize on a single machine vs a cluster?

like image 959
Matt Elgazar Avatar asked Dec 17 '18 07:12

Matt Elgazar


People also ask

How do I rename a column in a Dask DataFrame?

FWIW, creating a dictionary to remap each column name (even the ones I don't want to change, and then using ddf = ddf. rename(columns=cols) worked just fine for me.

Can we rename column in DataFrame?

One way of renaming the columns in a Pandas Dataframe is by using the rename() function. This method is quite useful when we need to rename some selected columns because we need to specify information only for the columns which are to be renamed. Example 1: Rename a single column.

How do I permanently rename a column in Pandas?

Pandas Rename Single Column If you want to rename a single column, just pass the single key-value pair in the columns dict parameter. The result will be the same if there is a non-matching mapping in the columns dictionary.


2 Answers

Regarding the renaming, this is how I usually go about changing feature names when I'm using dask, perhaps this will work for you too:

new_columns = ['key', 'Datetime', 'col1', 'col2', 'col3', 'col4', 'col5']
df = df.rename(columns=dict(zip(df.columns, new_columns)))

As for the determining the number of partitions, the documentation gives a pretty good example using time series data for deciding how to divide the dataframe: http://docs.dask.org/en/latest/dataframe-design.html#partitions.

like image 75
Sam Comber Avatar answered Oct 12 '22 23:10

Sam Comber


I could not get this line to work (because I was passing dictionary as a basic Python dictionary, which is not the right input)

ddf = dd.DataFrame(dictionary, divisions=[2], meta=pd.DataFrame(dictionary,
                                              index=list(range(2))), name='ddf')

print(ddf.compute())
() # this is the output of ddf.compute(); clearly something is not right

So, I had to create some dummy data and use that in my approach to creating a dask dataframe.

Generate dummy data in a dictionary

d = {0: [388]*2,
 1: [387]*2,
 2: [386]*2,
 3: [385]*2,
 5: [384]*2,
 '2012-06-13': [389]*2,
 '2012-06-14': [389]*2,}

Create Dask dataframe from dictionary dask bag

  • this means you must first use pandas to convert the dictionary to a pandas DataFrame and then use .to_dict(..., orient='records') to get the sequence (list of row-wise dictionaries) you need to create a dask bag

So, here is how I created the required sequence

d = pd.DataFrame(d, index=list(range(2))).to_dict('records')

print(d)
[{0: 388,
  1: 387,
  2: 386,
  3: 385,
  5: 384,
  '2012-06-13': 389,
  '2012-06-14': 389},
 {0: 388,
  1: 387,
  2: 386,
  3: 385,
  5: 384,
  '2012-06-13': 389,
  '2012-06-14': 389}]

Now I use the list of dictionaries to create a dask bag

dask_bag = db.from_sequence(d, npartitions=2)

print(dask_bag)
dask.bag<from_se..., npartitions=2>

Convert dask bag to dask dataframe

df = dask_bag.to_dataframe()

Rename columns in dask dataframe

cols = {0:'Datetime',1:'col1',2:'col2',3:'col3',5:'col5'}
df = df.rename(columns=cols)

print(df)
Dask DataFrame Structure:
              Datetime   col1   col2   col3   col5 2012-06-13 2012-06-14
npartitions=2                                                           
                 int64  int64  int64  int64  int64      int64      int64
                   ...    ...    ...    ...    ...        ...        ...
                   ...    ...    ...    ...    ...        ...        ...
Dask Name: rename, 6 tasks

Compute the dask dataframe (will not get output of () this time !)

print(ddf.compute())
   Datetime  col1  col2  col3  col5  2012-06-13  2012-06-14
0       388   387   386   385   384         389         389
0       388   387   386   385   384         389         389

Notes:

  1. Also from the .rename documentation: inplace is not supported.
  2. I think your renaming dictionary contained strings '0', '1', etc. for the column names that were integers. It could be the case for your data (as is the case with the dummy data here) that the dictionary should just have been integers 0, 1, etc.
  3. Per the dask docs, I used this approach based on a 1-1 renaming dictionary and column names not included in the renaming dict will be left unchanged
    • this means you don't need to pass in column names that you do not need to be renamed
like image 37
edesz Avatar answered Oct 12 '22 22:10

edesz