Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with Messy Dates

Tags:

r

I hope you didn't think I was asking for relationship advice.

Infrequently, I have to offer survey respondents the ability to specify when an event occurred. What results is a horribly messy string that I honestly just don't know what to do with. Beyond recoding by hand.

Here it is a short sample, of thousands:

 c("May2/ 12 noon", "9:45 am", "11:00 AM AST", "April 27 / 12:00 AST",  "11:40 AM AST", "April 25 2011", "April 12th 2011 / 8:44", "April 12 2011 / 8:36am",  "April 12 2011 / 8:30am", "April 12th 2011 / 8:18", "April 12 2011 / 8:12am",  "April 11th 2011 / 5:57pm", "April 11th 2011 / 5:49pm", "April 11th 2011 / 5:42pm",  "April 11th 2011 / 5:36pm", "April 11th 2011 / 5:27", "April 5 @ 11:26am",  "8:50", "April 4th 12:45pm", "April 4th around 10am", "April 4th around 10am",  "Mar 18, 2011 9:33am", "Mar 18, 2011 9:27am", "df", "fg", "12:16",  "9:50", "Feb 8, 2011 / 12:20pm", "8:34 am  2/4/11", "Jan 31, 2011 2:50pm",  "Jan 31, 2011 2:45pm", "Jan 31, 2011 2:38pm", "Jan 31, 2011 2:26pm",  "11h09", "11:00 am", "1h02 pm", "10h03", "2h10", "Jan 13, 2011 9:50am Van",  "Jan 12, 2011", "Jan 12, 2011 3:59pm", "Jan 12     14:19PM",  "Jan 12, 2011 1:35pm", "Jan 12,2011 1:28pm", "1h36", "9h15",  "9h09", "8h51", "8h45", "8h35", "1h12 pm", "12h59", "11h52 am",  "10h45", "15h55", "Dec 31, 10 11:11am", "Dec 31,10 10:15am",  "Dec 30, 2010 12:32pm", "Dec 30, 2010 12:18pm", "9:16 am", "11h16 am",  "11h12", "9h29 am", "11h38", "Dec 16, 2010", "December 16, 2010",  "December 16, 2010", "Dec 15,2010", "DEC 14 2010", "Dec 14 11:38",  "Dec 14 11:35", "Dec 14 11:25", "December 13, 2010", "Dec 10, 1:38 pm",  "Dec 10, 1:26 pm", "Dec 10, 1:20 pm", "Dec 10, 1:12 pm", "December 9 2010",  "11h10 am", "10h59 am", "10:50 am", "Tues Dec 7th, 9:45 Van time",  "Dec 3, 2010 12:30pm", "Dec 3, 2010 12:20pm", "Dec 3, 2010 12:10 pm",  "November 30, 2010 4.02pm", "November 30, 2010", "november 29 120pm",  "November 29 2010 11:27", "10:12am November 29, 2010", "Nov 26/10 1:18pm",  "10:56 am", "Nov 24", "nov 24/ 4:20 PM AST", "Nov 24/4:00 PM AST",  "NOVEMBER 24/10  2:10 pm", "November 24/10  11:00 a.m.", "12:05 MST",  "3.55PM", "Nov. 17/10 12:45 pm", "Nov. 16/10  12:00 noon", "Nov. 16/10 11;50 a.m.",  "nov 16/10  11:30 a.m.", "November 12, 2010 @ 12:23pm", "november 11 2010  2:20pm",  "November 11 2010  2:15pm", "November 11 2:00pm", "Nov. 10/10:22am",  "nov. 8/10...3:19 pm", "Nov 8/10  1;50 p.m.", "November 8/10...12 noon",  "November 8/10..10: am", "Nov 5, 2010  1:10 pm", "11:32 am CST",  "Nov 4  11:10", "nov 3 10am", "9:30 am", "11/02/2010 1:50PM",  "Oct 29/10 2:50PM", "Oct 28 @ 11:20am", "27Oct10 10:40am", "10/26/2010 11:18",  "Oct 26/10 11am", "Oct 26/10 10:30 am", "Oct 26 10:50", "10/25/2010 13:50",  "10/22/2010  10:15", "Oct 22/10 10AM", "Oct 21, 2010 3:00 pm",  "Oct 21, 2010 2:59", "10/21/2010 11:50", "10/21/2010 11:45",  "10/21/2010 11:40", "10/21/2010 11:30", "11:30", "Oct 20 approx 1pm",  "Oct 20/10 4:50PM", "13:48", "13:45", "Oct 20, 2010 11:45 am",  "October 19th 3:05pm", "Oct 18,2010 2:15pm", "Oct 18/10 3:10PM",  "10:30 am", "Oct 15/10 11:50am", "oct 14 @ 11:05am", "Oct 14/ 11:06",  "4:40 oct 13 atlantic", "oct 13 4:05 pm atlantic", "oct 13 1:45 atlantic time",  "Oct 13 / 10:37", "OCT 12 3:33", "Oct 12,2010 1:10pm", "Oct 12 / 11:45",  "Oct 12 / 9:45", "Oct 8. 2010/ 2:00", "Oct 8/10- 1145am", "2 Sept 2010 3.52pm",  "2 Sept 2010 10.21am", "1 Sept 2010 2.05pm", "1 Sept 2010", "31 Aug 2010 - 11.52am",  "31 aug 10:40am", "31 aug 2010 - 10am") 

Generally, these events occur near to the date which the respondent fills out the survey, but not always. The survey date is recorded automatically and in a consistent format and is easily to translate into POSIX using as.Date so, elements that only contain the time can be ignored and merged with the date that they filled out the survey.

Your thoughts are much appreciated.

Note1: Some of you may say, you should have done X, Y, or Z in terms of validating the responses. To you, I say - hell yes - next time. I didn't design it! I just have to deal with it.

A few facts that can assist in a workaround:

  • The times will always be business day hours, 9am-6pm (hence am/pm doesn't matter)
  • The years don't matter as I can pull them from another field (it will always only ever be 2011/2010, which is thankfully outside of the possible timeframe in any notation)
  • I don't care about timezones, as I have their geographic location

What I've done so far:

mos <- strsplit(' jan feb mar apr may jun jul aug sep oct nov dec january february march april may june july august september october november december ', '\n')[[1]][-1]  days <- strsplit(' mon tue wed thu fri sat sun monday tuesday wednesday thursday friday saturday sunday ', '\n')[[1]][-1] ## Messy Date Wrangling x <- ## that hot ghetto mess above # minimize x <- tolower(x) # remove unnecessary crap x <- sub("2011"," ",x) x <- sub("2010"," ",x) x <- sub("am"," ",x) x <- sub("pm"," ",x) x <- sub("[p][.][m]"," ",x) x <- sub("[a][.][m]"," ",x) x <- sub("[.]{3}"," ",x) x <- str_trim(x, side="both") # divide x <- strsplit(x,c(" ")) # conquer?  lapply(x, function(x) pmatch(x,mos)) lapply(x, function(x) pmatch(x,days)) 
like image 492
Brandon Bertelsen Avatar asked May 31 '11 22:05

Brandon Bertelsen


People also ask

What is the best way to deal with messy data?

The tidyverse has a collection of packages to deal with messy data (see dplyr and tidyr in particular) AND a philosophy that helps you in doing so. People use the phrase data cleaning to mean a wide range of things.

How do you deal with a messy partner?

“If you’re with someone who is chronically messy and derives creative juice from living in chaos, their lifestyle may be integral to their identity,” says Masini. “If this is the case, stop trying to change the other person and focus on yourself.

How does a messy spouse affect your day-to-day life?

It’s so easy to get wrapped up into how a messy spouse affects your day-to-day life without stopping to think about the potential reasons why he or she may not be living up to your expectations. For example, maybe she works night shifts and needs to spend the majority of her days off catching up on rest and social obligations.

How do you know if you’re in a messy relationship?

Think in terms of what you absolutely cannot tolerate and certain things that you can either live with or seek out help for (i.e. using a laundry service). “You two are sharing a space and the cycle will continue if you expect the ‘messy level’ of your home to be on your terms only,” says Derichs.


2 Answers

My sympathy that your date didn't turn out as pretty as expected. ;-)

I have constructed a (still partial) solution along the lines suggested by @Rguy.

(Please note that this code still has a bug: It does't always return the correct time. For some reason, it doesn't always do a greedy match on the digits before the colon, thus sometimes returning 1:00 when the time is 11:00.)

First, construct a helper function that wraps around gsub and grep. This function takes a character vector as one of its arguments and collapses this into a single string separated by |. The effect of this is to allow you to easily pass multiple patterns to be matched by a regex:

find.pattern <- function(x, pattern_list){   pattern <- paste(pattern_list, collapse="|")   ret <- gsub(paste("^.*(", pattern, ").*", sep=""), "\\1", x, ignore.case=TRUE)   ret[ret==x] <- NA    ret2 <- grepl(paste("^(", pattern, ")$", sep=""), x, ignore.case=TRUE)   ret[ret2] <- x[ret2]    ret } 

Next, use some built-in variable names to construct a vector of months and abbreviations:

all.month <- c(month.name, month.abb) 

Finally, construct a data frame with different extracts:

ret <- data.frame(     data = dat,      date1 = find.pattern(dat, "\\d+/\\d+/\\d+"),     date2 = find.pattern(dat,        paste(all.month, "\\s*\\d+[(th)|,]*\\s{0,3}[(2010)|(2011)]*", collapse="|", sep="")),     year = find.pattern(dat, c(2010, 2011)),     month = find.pattern(dat, month.abb), #Use base R variable called month.abb for month names     hour = find.pattern(dat, c("\\d+[\\.:h]\\d+", "12 noon")),     ampm = find.pattern(dat, c("am", "pm")) ) 

The results:

head(ret, 50)                       data  date1        date2 year month  hour ampm 20   April 4th around 10am   <NA>   April 4th  <NA>   Apr  <NA>   am 21   April 4th around 10am   <NA>   April 4th  <NA>   Apr  <NA>   am 22     Mar 18, 2011 9:33am   <NA> Mar 18, 2011 2011   Mar  9:33   am 23     Mar 18, 2011 9:27am   <NA> Mar 18, 2011 2011   Mar  9:27   am 24                      df   <NA>         <NA> <NA>  <NA>  <NA> <NA> 25                      fg   <NA>         <NA> <NA>  <NA>  <NA> <NA> 26                   12:16   <NA>         <NA> <NA>  <NA> 12:16 <NA> 27                    9:50   <NA>         <NA> <NA>  <NA>  9:50 <NA> 28   Feb 8, 2011 / 12:20pm   <NA>  Feb 8, 2011 2011   Feb  2:20   pm 29         8:34 am  2/4/11 2/4/11         <NA> <NA>  <NA>  8:34   am 30     Jan 31, 2011 2:50pm   <NA> Jan 31, 2011 2011   Jan  2:50   pm 31     Jan 31, 2011 2:45pm   <NA> Jan 31, 2011 2011   Jan  2:45   pm 32     Jan 31, 2011 2:38pm   <NA> Jan 31, 2011 2011   Jan  2:38   pm 33     Jan 31, 2011 2:26pm   <NA> Jan 31, 2011 2011   Jan  2:26   pm 34                   11h09   <NA>         <NA> <NA>  <NA> 11h09 <NA> 35                11:00 am   <NA>         <NA> <NA>  <NA>  1:00   am 36                 1h02 pm   <NA>         <NA> <NA>  <NA>  1h02   pm 37                   10h03   <NA>         <NA> <NA>  <NA> 10h03 <NA> 38                    2h10   <NA>         <NA> <NA>  <NA>  2h10 <NA> 39 Jan 13, 2011 9:50am Van   <NA> Jan 13, 2011 2011   Jan  9:50   am 40            Jan 12, 2011   <NA> Jan 12, 2011 2011   Jan  <NA> <NA> 
like image 116
Andrie Avatar answered Sep 28 '22 03:09

Andrie


This may be one of the few cases where another tool other than R is the best to use. I know that there are some modules for Perl that have already been developed to parse messy looking dates, on module DateTime::Format::Natural::Lang::EN can parse strings like: "1st tuesday last november". I seem to remember another module that could understand things like "the second tuesday after the first Monday in February".

There is also a tool at http://www.datasciencetoolkit.org/ that grabs what looks like dates in text and converts them to a standard format.

like image 30
Greg Snow Avatar answered Sep 28 '22 04:09

Greg Snow