Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count number of spaces just after the date information?

Tags:

r

I have unstructured data that look like this:

data <- c("24-March-2017      product 1              color 1",
"March-2017-24              product 2                 color 2",
"2017-24-March  product 3              color 3")

I would like to count number of spaces between the date and the first character (product column) for each line. As shown in the sample data, the date format can vary. This information will be used to put the data into structured format.

What is the best way to perform this in R? I believe gsub can be used in this case, just not sure how to apply to count only number of spaces at the beginning of each line.

like image 902
Curious Avatar asked Apr 03 '17 22:04

Curious


2 Answers

One approach would be to use regexpr that will return information about the first match of a given regular expression. In your case, you are looking for the first instance of a repeated white space. So, the following would tell you (1) where in your string you'll find the first white spaces, and (2) in the attributes how many white spaces you have:

regexpr("\\s+", data)
# [1] 14 14 14
# attr(,"match.length")
# [1]  6 14  2
# attr(,"useBytes")
# [1] TRUE

You can then use attr to extract the match.length attribute:

attr(regexpr("\\s+", data), "match.length")

EDIT

As pointed out by @xehpuk, using \\s+ will match at least one space. If your date column contained spaces that could be problematic. Instead you'd need to use \\s{2,}.

like image 55
sinQueso Avatar answered Sep 19 '22 17:09

sinQueso


You can sub out that section, then take the number of characters.

nchar(sub("\\S+(\\s+).*", "\\1", data))
# [1]  6 14  2

Or this one is kinda fun:

nchar(data) - nchar(sub("\\s+", "", data))
# [1]  6 14  2
like image 22
Rich Scriven Avatar answered Sep 19 '22 17:09

Rich Scriven