I have a data set as below:
date_time srch_co srch_ci
0 2014-11-03 16:02:28 2014-12-19 2014-12-15
1 2013-03-13 19:25:01 2013-03-14 2013-03-13
2 2014-10-13 13:20:25 2015-04-10 2015-04-03
3 2013-11-05 10:40:34 2013-11-08 2013-11-07
4 2014-06-10 13:34:56 2014-08-08 2014-08-03
5 2014-12-16 14:34:39 2014-12-17 2014-12-16
And this is the information of the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
date_time 100000 non-null datetime64[ns]
srch_co 99878 non-null object
srch_ci 99878 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 2.3+ MB
What I would like to do is create 2 new columns using the following function:
def duration(row):
delta = (row['srch_co'] - row['srch_ci'])/np.timedelta64(1, 'D')
if delta <= 0:
return np.nan
else:
return delta
sample['duration'] = sample.apply(duration, axis=1)
def days_in_advance(row):
delta = (row['srch_ci'] - row['date_time'])/np.timedelta64(1, 'D')
if delta < 0:
return np.nan
else:
return delta
sample['days_in_advance'] = sample.apply(days_in_advance, axis=1)
However, it seems like the date calculation I want to run constantly hits errors. I've searched and found several solutions and tried, but either they would create error or turn the date into inaccurate value.
The methods I've tried to use are such as:
#1)
def to_integer(dt_time):
return 10000*dt_time.year + 100*dt_time.month + dt_time.day
#2)
datetime.strptime(str(row[2]), '%Y%m%d%H%M%S')
#3)
pd.to_numeric(sample['date_time'], errors='coerce')
#4)
sample['srch_ci_int'] = sample['srch_ci'].astype(str).astype(int)
I just want to create new column that would calculate the difference of each columns:
sample["duration"] = sample["srch_co"] - sample["srch_ci"]
sample["days_in_advance"] = sample["srch_co"] - sample["date_time"]
Any tips appreciated.
You need convert columns srch_co and srch_ci to_datetime first and then use mask for replace values less as 0 to NaN (default value of mask function):
sample["srch_co"] = pd.to_datetime(sample["srch_co"])
sample["srch_ci"] = pd.to_datetime(sample["srch_ci"])
sample["duration"] = (sample["srch_co"] - sample["srch_ci"])/np.timedelta64(1, 'D')
sample["days_in_advance"] = (sample["srch_co"] - sample["date_time"])/np.timedelta64(1, 'D')
cols = ['duration','days_in_advance']
sample[cols] = sample[cols].mask(sample[cols] < 0)
#first value of srch_ci column was changed for NaN output
print (sample)
date_time srch_co srch_ci duration days_in_advance
0 2014-11-03 16:02:28 2014-12-19 2015-12-15 NaN 45.331620
1 2013-03-13 19:25:01 2013-03-14 2013-03-13 1.0 0.190961
2 2014-10-13 13:20:25 2015-04-10 2015-04-03 7.0 178.444155
3 2013-11-05 10:40:34 2013-11-08 2013-11-07 1.0 2.555162
4 2014-06-10 13:34:56 2014-08-08 2014-08-03 5.0 58.434074
5 2014-12-16 14:34:39 2014-12-17 2014-12-16 1.0 0.392604
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With