I have a pandas dataframe which looks like this :
Un_ID P_ID segment
0 Q8TDU6 7bw0 1( 16- 41), 2( 51- 73), 3( 86- 108)
1 P63092 7bw0 1( 16- 41), 2( 51- 73), 3( 86- 108)
2 Q8TDU6 7cfm 1( 22- 41), 2( 51- 72), 3( 86- 108)
I want to split the third column'segment' into three columns i.e TM,starting,ending
Un_ID P_ID segment TM starting ending
0 Q8TDU6 7bw0 1( 16- 41), 2( 51- 73), 3( 86- 108) TM1 16 41
1 P63092 7bw0 1( 16- 41), 2( 51- 73), 3( 86- 108) TM1 16 41
2 Q8TDU6 7cfm 1( 22- 41), 2( 51- 72), 3( 86- 108) TM1 22 41
0 Q8TDU6 7bw0 1( 16- 41), 2( 51- 73), 3( 86- 108) TM2 51 73
1 P63092 7bw0 1( 16- 41), 2( 51- 73), 3( 86- 108) TM2 51 73
2 Q8TDU6 7cfm 1( 22- 41), 2( 51- 72), 3( 86- 108) TM2 51 72
0 Q8TDU6 7bw0 1( 16- 41), 2( 51- 73), 3( 86- 108) TM3 86 108
1 P63092 7bw0 1( 16- 41), 2( 51- 73), 3( 86- 108) TM3 86 108
2 Q8TDU6 7cfm 1( 22- 41), 2( 51- 72), 3( 86- 108) TM3 86 108
I tried following code
df[['TM','starting','ending']] = df.segment.apply(lambda x: pd.Series(str(x).split(","))
But,I am not sure how to manipulate the above code to get the dataframe as I want..
Try:
import re
r = re.compile(r"(\d+)\(\s*(\d+)-\s*(\d+)\)")
df["segment"] = df["segment"].apply(lambda x: r.findall(x))
df = df.explode("segment")
df[["TM", "starting", "ending"]] = df.pop("segment").apply(pd.Series)
df = df.sort_values(by="TM")
df["TM"] = "TM" + df["TM"].astype(str)
print(df)
Prints:
Un_ID P_ID TM starting ending
0 Q8TDU6 7bw0 TM1 16 41
1 P63092 7bw0 TM1 16 41
2 Q8TDU6 7cfm TM1 22 41
0 Q8TDU6 7bw0 TM2 51 73
1 P63092 7bw0 TM2 51 73
2 Q8TDU6 7cfm TM2 51 72
0 Q8TDU6 7bw0 TM3 86 108
1 P63092 7bw0 TM3 86 108
2 Q8TDU6 7cfm TM3 86 108
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With