I have 34 million row and only have a column. I want to split string into 4 column.
Here is my sample dataset (df):
Log
0 Apr 4 20:30:33 100.51.100.254 dns,packet user: --- got query from 10.5.14.243:30648:
1 Apr 4 20:30:33 100.51.100.254 dns,packet user: id:78a4 rd:1 tc:0 aa:0 qr:0 ra:0 QUERY 'no error'
2 Apr 4 20:30:33 100.51.100.254 dns,packet user: question: tracking.intl.miui.com:A:IN
3 Apr 4 20:30:33 dns user: query from 9.5.10.243: #4746190 tracking.intl.miui.com. A
I want to split it into four column using this code:
df1 = df['Log'].str.split(n=3, expand=True)
df1.columns=['Month','Date','Time','Log']
df1.head()
Here is the result that i expected
Month Date Time Log
0 Apr 4 20:30:33 100.51.100.254 dns,packet user: --- go...
1 Apr 4 20:30:33 100.51.100.254 dns,packet user: id:78a...
2 Apr 4 20:30:33 100.51.100.254 dns,packet user: questi...
3 Apr 4 20:30:33 dns transjakarta: query from 9.5.10.243: #474...
4 Apr 4 20:30:33 100.51.100.254 dns,packet user: --- se...
but the respond is like this:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-36-c9b2023fbf3e> in <module>
----> 1 df1 = df['Log'].str.split(n=3, expand=True)
2 df1.columns=['Month','Date','Time','Log']
3 df1.head()
TypeError: split() got an unexpected keyword argument 'expand'
Is there any solution to split the string using dask?
Dask dataframe does support the str.split method's expand= keyword if you provide an n=
keyword as well to tell it how many splits to expect.
It looks like dask dataframes's str.split
method doesn't implement the expand= keyword. You might raise an issue if one does not already exist.
As a short term workaround, you could make a Pandas function, and then use the map_partitions method to scale that across your dask dataframe
def f(df: pandas.DataFrame) -> pandas.DataFrame:
""" This is your code from above, as a function """
df1 = df['Log'].str.split(n=3, expand=True)
df1.columns=['Month','Date','Time','Log']
return df
ddf = ddf.map_partitions(f) # apply to all pandas dataframes within dask dataframe
Because Dask dataframes are just collections of Pandas dataframes it's relatively easy to build things yourself when Dask dataframe doesn't support them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With