I am having trouble with some dates from zipped xlsx files. These files are loaded into a sqlite database then exported as .csv. Each file is about 40,000 rows per day. The issue I run into is that pd.to_datetime
does not seem to work on these objects (dates from Excel format is causing the issue I think - pure .csv files work fine with this command). This is fine actually - I do not need them to be in datetime format.
What I am trying to achieve is creating a column called ShortDate which is %m/%d/%Y
. How can I do this on a datetime object (format is mm/dd/yyyy hh:mm:ss from Excel). I will then create a new column called RosterID which combines the EmployeeID field and the ShortDate field together into a unique ID.
I am very new to pandas and I am currently only using it to process .csv files (rename and select certain columns, create unique IDs to use in filters in Tableau, etc).
rep = pd.read_csv(r'C:\Users\Desktop\test.csv.gz', dtype = 'str', compression = 'gzip', usecols = ['etc','etc2'])
print('Read successfully.')
rep['Total']=1
rep['UniqueID']= rep['EmployeeID'] + rep['InteractionID']
rep['ShortDate'] = ??? #what do I do here to get what I am looking for?
rep['RosterID']= rep['EmployeeID'] + rep['ShortDate'] # this is my goal
print('Modified successfully.')
Here is some of the raw data from the .csv. Column names would be
InteractionID, Created Date, EmployeeID, Repeat Date
07927,04/01/2014 14:05:10,912a,04/01/2014 14:50:03
02158,04/01/2014 13:44:05,172r,04/04/2014 17:47:29
44279,04/01/2014 17:28:36,217y,04/07/2014 22:06:19
Note that if the date is not a pandas datetime date, you need to first covert it using pd.to_datetime () before you can use the dt.date attribute. Let’s look at some examples of using the above syntax. 1. Remove time from a pandas date Let’s first look at how to time from a pandas datetime object. For this, apply the .date () function.
The pandas library provides a DateTime object with nanosecond precision called Timestamp to work with date and time values. The Timestamp object derives from the NumPy’s datetime64 data type, making it more accurate and significantly faster than Python’s DateTime object.
pandas.Series.dt.year returns the year of the date time. pandas.Series.dt.month returns the month of the date time. pandas.Series.dt.day returns the day of the date time. pandas.Series.dt.hour returns the hour of the date time.
How can i write the code to remove the time from the datetime. Assuming all your datetime strings are in a similar format then just convert them to datetime using to_datetime and then call the dt.date attribute to get just the date portion:
Create a new column, then just apply simple datetime
functions using lambda
and apply
.
In [14]: df['Short Date']= pd.to_datetime(df['Created Date'])
In [15]: df
Out[15]:
InteractionID Created Date EmployeeID Repeat Date \
0 7927 4/1/2014 14:05 912a 4/1/2014 14:50
1 2158 4/1/2014 13:44 172r 4/4/2014 17:47
2 44279 4/1/2014 17:28 217y 4/7/2014 22:06
Short Date
0 2014-04-01 14:05:00
1 2014-04-01 13:44:00
2 2014-04-01 17:28:00
In [16]: df['Short Date'] = df['Short Date'].apply(lambda x:x.date().strftime('%m%d%y'))
In [17]: df
Out[17]:
InteractionID Created Date EmployeeID Repeat Date Short Date
0 7927 4/1/2014 14:05 912a 4/1/2014 14:50 040114
1 2158 4/1/2014 13:44 172r 4/4/2014 17:47 040114
2 44279 4/1/2014 17:28 217y 4/7/2014 22:06 040114
Then just concatenate the two columns. Convert the Short Date
column to strings to avoid errors on concatenation of strings and integers.
In [32]: df['Roster ID'] = df['EmployeeID'] + df['Short Date'].map(str)
In [33]: df
Out[33]:
InteractionID Created Date EmployeeID Repeat Date Short Date \
0 7927 4/1/2014 14:05 912a 4/1/2014 14:50 040114
1 2158 4/1/2014 13:44 172r 4/4/2014 17:47 040114
2 44279 4/1/2014 17:28 217y 4/7/2014 22:06 040114
Roster ID
0 912a040114
1 172r040114
2 217y040114
You can apply a post-processing step that first converts the string to a datetime and then applies a lambda to keep just the date portion:
In [29]:
df['Created Date'] = pd.to_datetime(df['Created Date']).apply(lambda x: x.date())
df['Repeat Date'] = pd.to_datetime(df['Repeat Date']).apply(lambda x: x.date())
df
Out[29]:
InteractionID Created Date EmployeeID Repeat Date
0 7927 2014-04-01 912a 2014-04-01
1 2158 2014-04-01 172r 2014-04-04
2 44279 2014-04-01 217y 2014-04-07
EDIT
After looking at this again, you can access just the date component using dt.date
if your version of pandas is greater than 0.15.0
:
In [18]:
df['just_date'] = df['Repeat Date'].dt.date
df
Out[18]:
InteractionID Created Date EmployeeID Repeat Date \
0 7927 2014-04-01 14:05:10 912a 2014-04-01 14:50:03
1 2158 2014-04-01 13:44:05 172r 2014-04-04 17:47:29
2 44279 2014-04-01 17:28:36 217y 2014-04-07 22:06:19
just_date
0 2014-04-01
1 2014-04-04
2 2014-04-07
Additionally you can also do dt.strftime
now rather than use apply
to achieve the result you want:
In [28]:
df['short_date'] = df['Repeat Date'].dt.strftime('%m%d%Y')
df
Out[28]:
InteractionID Created Date EmployeeID Repeat Date \
0 7927 2014-04-01 14:05:10 912a 2014-04-01 14:50:03
1 2158 2014-04-01 13:44:05 172r 2014-04-04 17:47:29
2 44279 2014-04-01 17:28:36 217y 2014-04-07 22:06:19
just_date short_date
0 2014-04-01 04012014
1 2014-04-04 04042014
2 2014-04-07 04072014
So generating the Roster Id's is now a trivial exercise of adding the 2 new columns:
In [30]:
df['Roster ID'] = df['EmployeeID'] + df['short_date']
df
Out[30]:
InteractionID Created Date EmployeeID Repeat Date \
0 7927 2014-04-01 14:05:10 912a 2014-04-01 14:50:03
1 2158 2014-04-01 13:44:05 172r 2014-04-04 17:47:29
2 44279 2014-04-01 17:28:36 217y 2014-04-07 22:06:19
just_date short_date Roster ID
0 2014-04-01 04012014 912a04012014
1 2014-04-04 04042014 172r04042014
2 2014-04-07 04072014 217y04072014
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With