Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

creating new columns in a data set based on values of a column using Regex

This is my data frame

index     duration 
1           7 year   
2           2day
3           4 week
4           8 month

I need to separate numbers from time and put them in two new columns. The output is like this:

index     duration         number     time
1           7 year          7         year
2           2day            2         day
3           4 week          4        week
4           8 month         8         month

This is my code:

df ['numer'] = df.duration.replace(r'\d.*' , r'\d', regex=True, inplace = True)
df [ 'time']= df.duration.replace (r'\.w.+',r'\w.+', regex=True, inplace = True )

But it does not work. Any suggestion ?

I also need to create another column based on the values of time column. So the new dataset is like this:

 index     duration         number     time      time_days
    1           7 year          7         year       365
    2           2day            2         day         1
    3           4 week          4        week         7
    4           8 month         8         month       30

df['time_day']= df.time.replace(r'(year|month|week|day)', r'(365|30|7|1)', regex=True, inplace=True)

Any suggestion ?

like image 255
Mary Avatar asked Jun 28 '17 13:06

Mary


1 Answers

we can use Series.str.extract here:

In [67]: df[['number','time']] = df.duration.str.extract(r'(\d+)\s*(.*)', expand=True)

In [68]: df
Out[68]:
   index duration number    time
0      1   7 year      7    year
1      2     2day      2     day
2      3   4 week      4    week
3      4  8 month      8   month

RegEx explained - regex101.com is IMO one of the best online RegEx parser, tester and explainer

you may also want to convert number column to integer dtype:

In [69]: df['number'] = df['number'].astype(int)

In [70]: df.dtypes
Out[70]:
index        int64
duration    object
number       int32
time        object
dtype: object

UPDATE:

In [167]: df['time_day'] = df['time'].replace(['year','month','week','day'], [365, 30, 7, 1], regex=True)

In [168]: df
Out[168]:
   index duration number    time  time_day
0      1   7 year      7    year       365
1      2     2day      2     day         1
2      3   4 week      4    week         7
3      4  8 month      8   month        30
like image 183
MaxU - stop WAR against UA Avatar answered Nov 14 '22 21:11

MaxU - stop WAR against UA