I have unstructured data that look like this:
data <- c("24-March-2017 product 1 color 1",
"March-2017-24 product 2 color 2",
"2017-24-March product 3 color 3")
I would like to count number of spaces between the date and the first character (product column) for each line. As shown in the sample data, the date format can vary. This information will be used to put the data into structured format.
What is the best way to perform this in R? I believe gsub
can be used in this case, just not sure how to apply to count only number of spaces at the beginning of each line.
One approach would be to use regexpr
that will return information about the first match of a given regular expression. In your case, you are looking for the first instance of a repeated white space. So, the following would tell you (1) where in your string you'll find the first white spaces, and (2) in the attributes how many white spaces you have:
regexpr("\\s+", data)
# [1] 14 14 14
# attr(,"match.length")
# [1] 6 14 2
# attr(,"useBytes")
# [1] TRUE
You can then use attr
to extract the match.length
attribute:
attr(regexpr("\\s+", data), "match.length")
EDIT
As pointed out by @xehpuk, using \\s+
will match at least one space. If your date column contained spaces that could be problematic. Instead you'd need to use \\s{2,}
.
You can sub out that section, then take the number of characters.
nchar(sub("\\S+(\\s+).*", "\\1", data))
# [1] 6 14 2
Or this one is kinda fun:
nchar(data) - nchar(sub("\\s+", "", data))
# [1] 6 14 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With