I'm using Dask to read in a 10m row csv+ and perform some calculations. So far it's proving to be 10x faster than Pandas.
I have a piece of code, below, that when used with pandas works fine, but with dask throws a type error. I am unsure of how to overcome the typerror. It seems like an array is being handed back to the dataframe/column by the select function when using dask, but not when using pandas? But I don't want to switch the whole thing back to pandas and lose the 10x performance benefit.
This answer is the result of some help of some others on Stack Overflow, however I think that question has deviated far enough from the initial question that this is altogether different. Code below.
PANDAS: Works Time Taken excluding AndHeathSolRadFact: 40 seconds
import pandas as pd
import numpy as np
from timeit import default_timer as timer
start = timer()
df = pd.read_csv(r'C:\Users\i5-Desktop\Downloads\Weathergrids.csv')
df['DateTime'] = pd.to_datetime(df['Date'], format='%Y-%d-%m %H:%M')
df['Month'] = df['DateTime'].dt.month
df['Grass_FMC'] = (97.7+4.06*df['RH'])/(df['Temperature']+6)-0.00854*df['RH']+3000/df['Curing']-30
df["AndHeathSolRadFact"] = np.select(
[
(df['Month'].between(8,12)),
(df['Month'].between(1,2) & df['CloudCover']>30)
], #list of conditions
[1, 1], #list of results
default=0) #default if no match
print(df.head())
#print(ddf.tail())
end = timer()
print(end - start)
DASK: BROKEN Time Taken excluding AndHeathSolRadFact: 4 seconds
import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
import pandas as pd
import numpy as np
# Dataframes implement the Pandas API
import dask.dataframe as dd
from timeit import default_timer as timer
start = timer()
ddf = dd.read_csv(r'C:\Users\i5-Desktop\Downloads\Weathergrids.csv')
ddf['DateTime'] = dd.to_datetime(ddf['Date'], format='%Y-%d-%m %H:%M')
ddf['Month'] = ddf['DateTime'].dt.month
ddf['Grass_FMC'] = (97.7+4.06*ddf['RH'])/(ddf['Temperature']+6)-0.00854*ddf['RH']+3000/ddf['Curing']-30
ddf["AndHeathSolRadFact"] = np.select(
[
(ddf['Month'].between(8,12)),
(ddf['Month'].between(1,2) & ddf['CloudCover']>30)
], #list of conditions
[1, 1], #list of results
default=0) #default if no match
print(ddf.head())
#print(ddf.tail())
end = timer()
print(end - start)
Error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-50-86c08f38bce6> in <module>
29 ], #list of conditions
30 [1, 1], #list of results
---> 31 default=0) #default if no match
32
33
~\Anaconda3\lib\site-packages\dask\dataframe\core.py in __setitem__(self, key, value)
3276 df = self.assign(**{k: value for k in key})
3277 else:
-> 3278 df = self.assign(**{key: value})
3279
3280 self.dask = df.dask
~\Anaconda3\lib\site-packages\dask\dataframe\core.py in assign(self, **kwargs)
3510 raise TypeError(
3511 "Column assignment doesn't support type "
-> 3512 "{0}".format(typename(type(v)))
3513 )
3514 if callable(v):
TypeError: Column assignment doesn't support type numpy.ndarray
Sample Weathegrids CSV
Location,Date,Temperature,RH,WindDir,WindSpeed,DroughtFactor,Curing,CloudCover
1075,2019-20-09 04:00,6.8,99.3,143.9,5.6,10.0,93.0,1.0
1075,2019-20-09 05:00,6.4,100.0,93.6,7.2,10.0,93.0,1.0
1075,2019-20-09 06:00,6.7,99.3,130.3,6.9,10.0,93.0,1.0
1075,2019-20-09 07:00,8.6,95.4,68.5,6.3,10.0,93.0,1.0
1075,2019-20-09 08:00,12.2,76.0,86.4,6.1,10.0,93.0,1.0
I just had a similar problem, and I was able to get it to work by converting the ndarray
into a Dask array. I also had to ensure the number of partitions matched between the ndarray
and the Dask DataFrame.
Assigning series to Dask column works.
dask_df['col'] = pd.Series(list or array)
For some reason that's not entirely clear to me yet, the above-mentioned solutions didn't work for me.
I ended up defining a function that does the column assignment per pandas dataframe and then mapping that function to all of my dask partitions.
def map_randoms(df):
df['col_rand'] = np.random.randint(0,2, size=len(df))
return df
ddf = ddf.map_partitions(map_randoms)
ddf.persist()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With