I have two questions about dask. First: The documentation for dask clearly states that you can rename columns with the same syntax as pandas. I am using dask 1.0.0. Any reason why I am getting these errors below?
df = pd.DataFrame(dictionary)
df
# I am not sure how to choose values for divisions, meta, and name. I am also pretty unsure about what these really do.
ddf = dd.DataFrame(dictionary, divisions=[8], meta=pd.DataFrame(dictionary), name='ddf')
ddf
cols = {'Key':'key', '0':'Datetime','1':'col1','2':'col2','3':'col3','4':'col4','5':'col5'}
ddf.rename(columns=cols, inplace=True)
TypeError: rename() got an unexpected keyword argument 'inplace'
Ok so i removed the inplace=True
and tried this:
ddf = ddf.rename(columns=cols)
ValueError: dictionary update sequence element #0 has length 6; 2 is required
The pandas dataframe is showing a real dataframe, but when I call ddf.compute()
I get an empty dataframe.
My second question is that I am slightly confused about how to assign divisions, meta, and name. How is this useful/hurtful if I use dask to parallelize on a single machine vs a cluster?
FWIW, creating a dictionary to remap each column name (even the ones I don't want to change, and then using ddf = ddf. rename(columns=cols) worked just fine for me.
One way of renaming the columns in a Pandas Dataframe is by using the rename() function. This method is quite useful when we need to rename some selected columns because we need to specify information only for the columns which are to be renamed. Example 1: Rename a single column.
Pandas Rename Single Column If you want to rename a single column, just pass the single key-value pair in the columns dict parameter. The result will be the same if there is a non-matching mapping in the columns dictionary.
Regarding the renaming, this is how I usually go about changing feature names when I'm using dask, perhaps this will work for you too:
new_columns = ['key', 'Datetime', 'col1', 'col2', 'col3', 'col4', 'col5']
df = df.rename(columns=dict(zip(df.columns, new_columns)))
As for the determining the number of partitions, the documentation gives a pretty good example using time series data for deciding how to divide the dataframe: http://docs.dask.org/en/latest/dataframe-design.html#partitions.
I could not get this line to work (because I was passing dictionary
as a basic Python dictionary, which is not the right input)
ddf = dd.DataFrame(dictionary, divisions=[2], meta=pd.DataFrame(dictionary,
index=list(range(2))), name='ddf')
print(ddf.compute())
() # this is the output of ddf.compute(); clearly something is not right
So, I had to create some dummy data and use that in my approach to creating a dask dataframe
.
Generate dummy data in a dictionary
d = {0: [388]*2,
1: [387]*2,
2: [386]*2,
3: [385]*2,
5: [384]*2,
'2012-06-13': [389]*2,
'2012-06-14': [389]*2,}
Create Dask dataframe
from dictionary dask bag
DataFrame
and then use .to_dict(..., orient='records')
to get the sequence (list of row-wise dictionaries) you need to create a dask bagSo, here is how I created the required sequence
d = pd.DataFrame(d, index=list(range(2))).to_dict('records')
print(d)
[{0: 388,
1: 387,
2: 386,
3: 385,
5: 384,
'2012-06-13': 389,
'2012-06-14': 389},
{0: 388,
1: 387,
2: 386,
3: 385,
5: 384,
'2012-06-13': 389,
'2012-06-14': 389}]
Now I use the list of dictionaries to create a dask bag
dask_bag = db.from_sequence(d, npartitions=2)
print(dask_bag)
dask.bag<from_se..., npartitions=2>
Convert dask bag to dask dataframe
df = dask_bag.to_dataframe()
Rename columns in dask dataframe
cols = {0:'Datetime',1:'col1',2:'col2',3:'col3',5:'col5'}
df = df.rename(columns=cols)
print(df)
Dask DataFrame Structure:
Datetime col1 col2 col3 col5 2012-06-13 2012-06-14
npartitions=2
int64 int64 int64 int64 int64 int64 int64
... ... ... ... ... ... ...
... ... ... ... ... ... ...
Dask Name: rename, 6 tasks
Compute the dask dataframe
(will not get output of ()
this time !)
print(ddf.compute())
Datetime col1 col2 col3 col5 2012-06-13 2012-06-14
0 388 387 386 385 384 389 389
0 388 387 386 385 384 389 389
Notes:
.rename
documentation: inplace
is not supported.'0'
, '1'
, etc. for the column names that were integers. It could be the case for your data (as is the case with the dummy data here) that the dictionary should just have been integers 0
, 1
, etc.dask
docs, I used this approach based on a 1-1 renaming dictionary and column names not included in the renaming dict will be left unchanged
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With