I have a character vector which is the file of some PDF scraping via pdftotext
(command line tool).
Everything is (blissfully) nicely lined up. However, the vector is riddled with a type of whitespace that eludes my regular expressions:
> test
[1] "Address:" "Clinic Information:" "Store " "351 South Washburn" "Aurora Quick Care"
[6] "Info" "St. Oshkosh, WI 54904" "Phone: 920‐232‐0718" "Pewaukee"
> grepl("[0-9]+ [A-Za-z ]+",test)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> dput(test)
c("Address:", "Clinic Information:", "Store ", "351 South Washburn",
"Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718",
"Pewaukee")
> test.pasted <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn",
+ "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718",
+ "Pewaukee")
> grepl("[0-9]+ [A-Za-z ]+",test.pasted)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
> Encoding(test)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
> Encoding(test.pasted)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8" "unknown"
Clearly there's some character that's not getting assigned in the dput
, as in the question below:
How to properly dput internationalized text?
I can't copy/paste the entire vector.... How do I search-and-destroy this non-whitespace whitespace?
Edit
Clearly I wasn't even close to clear because answers are all over the place. Here's an even simpler test case:
> grepl("Clinic Information:", test[2])
[1] FALSE
> grepl("Clinic Information:", "Clinic Information:") # Where the second phrase is copy/pasted from the screen
[1] TRUE
There is a single space between the word "Clinic" and "Information" printed on the screen and in the dput
output, but whatever is in the string is not a standard space. My goal is to eliminate this so I can properly grep that element out.
Click File and select Options. On the right pane, select Display. Under the Page display options, uncheck the box Show white space between pages in Print Layout view. Press OK to save.
Upgrading my comment to an answer:
Your string contains a non-breaking space (U+00A0) which got translated to a normal space when you pasted it. Matching all the strange space-like characters in Unicode is easy with a perl-style regular expression:
grepl("[0-9]+\\p{Zs}[A-Za-z ]+", test, perl=TRUE)
The perl regexp syntax is \p{categoryName}
, the extra backslash is part of the syntax of a string containing a backslash, and "Zs" is the "Separator" Unicode category, "space" subcategory. A simpler method for just the U+00A0 character would be
grepl("[0-9]+[ \\xa0][A-Za-z ]+", test)
I think you're after trailing and leading white space. If so maybe this function will work:
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
Also keep an eye out for tabs and such and this may be useful:
clean <- function(text) {
gsub("\\s+", " ", gsub("\r|\n|\t", " ", text))
}
so use the clean and then the Trim as in:
Trim(clean(test))
Also be on the look out for the en dash (–) and the em dash (—)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With