Renaming columns in dask dataframe

Tags:

I have two questions about dask. First: The documentation for dask clearly states that you can rename columns with the same syntax as pandas. I am using dask 1.0.0. Any reason why I am getting these errors below?

df = pd.DataFrame(dictionary)
df

enter image description here

# I am not sure how to choose values for divisions, meta, and name. I am also pretty unsure about what these really do.
ddf = dd.DataFrame(dictionary, divisions=[8], meta=pd.DataFrame(dictionary), name='ddf')    
ddf

enter image description here

cols = {'Key':'key', '0':'Datetime','1':'col1','2':'col2','3':'col3','4':'col4','5':'col5'}

ddf.rename(columns=cols, inplace=True)

TypeError: rename() got an unexpected keyword argument 'inplace'

Ok so i removed the inplace=True and tried this:

ddf = ddf.rename(columns=cols)

ValueError: dictionary update sequence element #0 has length 6; 2 is required

The pandas dataframe is showing a real dataframe, but when I call ddf.compute() I get an empty dataframe.

enter image description here

My second question is that I am slightly confused about how to assign divisions, meta, and name. How is this useful/hurtful if I use dask to parallelize on a single machine vs a cluster?

959

asked Dec 17 '18 07:12

Matt Elgazar

2 Answers

Regarding the renaming, this is how I usually go about changing feature names when I'm using dask, perhaps this will work for you too:

new_columns = ['key', 'Datetime', 'col1', 'col2', 'col3', 'col4', 'col5']
df = df.rename(columns=dict(zip(df.columns, new_columns)))

As for the determining the number of partitions, the documentation gives a pretty good example using time series data for deciding how to divide the dataframe: http://docs.dask.org/en/latest/dataframe-design.html#partitions.

answered Oct 12 '22 23:10

Sam Comber

I could not get this line to work (because I was passing dictionary as a basic Python dictionary, which is not the right input)

ddf = dd.DataFrame(dictionary, divisions=[2], meta=pd.DataFrame(dictionary,
                                              index=list(range(2))), name='ddf')

print(ddf.compute())
() # this is the output of ddf.compute(); clearly something is not right

So, I had to create some dummy data and use that in my approach to creating a dask dataframe.

Generate dummy data in a dictionary

d = {0: [388]*2,
 1: [387]*2,
 2: [386]*2,
 3: [385]*2,
 5: [384]*2,
 '2012-06-13': [389]*2,
 '2012-06-14': [389]*2,}

Create Dask dataframe from ~~dictionary~~ dask bag

this means you must first use pandas to convert the dictionary to a pandas DataFrame and then use .to_dict(..., orient='records') to get the sequence (list of row-wise dictionaries) you need to create a dask bag

So, here is how I created the required sequence

d = pd.DataFrame(d, index=list(range(2))).to_dict('records')

print(d)
[{0: 388,
  1: 387,
  2: 386,
  3: 385,
  5: 384,
  '2012-06-13': 389,
  '2012-06-14': 389},
 {0: 388,
  1: 387,
  2: 386,
  3: 385,
  5: 384,
  '2012-06-13': 389,
  '2012-06-14': 389}]

Now I use the list of dictionaries to create a dask bag

dask_bag = db.from_sequence(d, npartitions=2)

print(dask_bag)
dask.bag<from_se..., npartitions=2>

Convert dask bag to dask dataframe

df = dask_bag.to_dataframe()

Rename columns in dask dataframe

cols = {0:'Datetime',1:'col1',2:'col2',3:'col3',5:'col5'}
df = df.rename(columns=cols)

print(df)
Dask DataFrame Structure:
              Datetime   col1   col2   col3   col5 2012-06-13 2012-06-14
npartitions=2                                                           
                 int64  int64  int64  int64  int64      int64      int64
                   ...    ...    ...    ...    ...        ...        ...
                   ...    ...    ...    ...    ...        ...        ...
Dask Name: rename, 6 tasks

Compute the dask dataframe (will not get output of () this time !)

print(ddf.compute())
   Datetime  col1  col2  col3  col5  2012-06-13  2012-06-14
0       388   387   386   385   384         389         389
0       388   387   386   385   384         389         389

Notes:

Also from the .rename documentation: inplace is not supported.
I think your renaming dictionary contained strings '0', '1', etc. for the column names that were integers. It could be the case for your data (as is the case with the dummy data here) that the dictionary should just have been integers 0, 1, etc.
Per the dask docs, I used this approach based on a 1-1 renaming dictionary and column names not included in the renaming dict will be left unchanged
- this means you don't need to pass in column names that you do not need to be renamed

answered Oct 12 '22 22:10

edesz

Related questions
                            
                                Share x-axis between Bokeh plots
                            
                                How to pass proxy-authentication to python Requests module
                            
                                python logging.config not available?
                            
                                Google Drive API: The user has not granted the app error
                            
                                Rename tooltip in altair
                            
                                OpenCV+python: HoughLines accumulator access since 3.4.2
                            
                                Showing points coordinate in plot in Python [duplicate]
                            
                                Best way to revert to a random seed after temporarily fixing it?
                            
                                Remove all last commas between brackets
                            
                                Why does Statsmodels OLS doesn't support reading in columns with multiple words?
                            
                                TCP Traceroute in python
                            
                                pandas groupby() with custom aggregate function and put result in a new column
                            
                                Why doesn't asyncio always use executors?
                            
                                How to get browser network logs using python selenium
                            
                                np.where(condition is None) not equal to np.where(condition == None)
                            
                                numpy swap multiple elements in an array
                            
                                Remove items from dictionary if the length of the item is 1 or less
                            
                                OpenCV - Remove text from image
                            
                                Is `list()` considered a function?
                            
                                Cross tabulate counts between pairs of keywords per group with pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Renaming columns in dask dataframe

Tags:

python

pandas

dask

Matt Elgazar

People also ask

2 Answers

Sam Comber

edesz

Recent Activity

Donate For Us