I have a large dataset (4GB) like this:
userID date timeofday seq
0 1000014754 20211028 20 133669542676:1:148;133658378700:1:16;133650937891:1:85
1 1000019906 20211028 6 508420199:0:0;133669581685:1:19
2 1000019906 20211028 22 133665269544:0:0
From this, I would like to split "seq" by ";" first and create a new dataset with renames. It looks like this:
userID date timeofday seq1 seq2 seq3 ... seqN
0 1000014754 20211028 20 133669542676:1:148 133658378700:1:16 133650937891:1:85
1 1000019906 20211028 6 508420199:0:0 133669581685:1:19 None None
2 1000019906 20211028 22 133665269544:0:0 None None None
Then I want to split the seq1,seq2,...,seqN by ":", and create a new dataset with renames. It looks like this:
userID date timeofday name1 click1 time1 name2 click2 time2 ....nameN clickN timeN
0 1000014754 20211028 20 133669542676 1 148 133658378700 1 16 133650937891 1 85 None None None
1 1000019906 20211028 6 508420199 0 0 133669581685 1 19 None None None None None None
2 1000019906 20211028 22 133665269544 0 0 None None None None None None None None None
I know pandas.split can split the columns, but I don't know how to split it effficiently. Thank you!
A clean solution is to use a regex and extractall, then reshape using unstack, rename the columns and join to the original dataframe.
Assuming df the dataframe name
df2 = (df['seq'].str.extractall(r'(?P<name>[^:]+):(?P<click>[^:]+):(?P<time>[^;]+);?')
.unstack('match')
.sort_index(level=1, axis=1, sort_remaining=False)
)
df2.columns = df2.columns.map(lambda x: f'{x[0]}{x[1]+1}')
df2 = df.drop(columns='seq').join(df2)
output:
userID date timeofday name1 click1 time1 name2 click2 time2 name3 click3 time3
0 1000014754 20211028 20 133669542676 1 148 133658378700 1 16 133650937891 1 85
1 1000019906 20211028 6 508420199 0 0 133669581685 1 19 NaN NaN NaN
2 1000019906 20211028 22 133665269544 0 0 NaN NaN NaN NaN NaN NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With