Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where is this whitespace hiding?

Tags:

regex

r

I have a character vector which is the file of some PDF scraping via pdftotext (command line tool).

Everything is (blissfully) nicely lined up. However, the vector is riddled with a type of whitespace that eludes my regular expressions:

> test
[1] "Address:"              "Clinic Information:"   "Store "                "351 South Washburn"    "Aurora Quick Care"    
[6] "Info"                  "St. Oshkosh, WI 54904" "Phone: 920‐232‐0718"   "Pewaukee"  

> grepl("[0-9]+ [A-Za-z ]+",test)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> dput(test)
c("Address:", "Clinic Information:", "Store ", "351 South Washburn", 
"Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", 
"Pewaukee")

> test.pasted <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn", 
+                  "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", 
+                  "Pewaukee")

> grepl("[0-9]+ [A-Za-z ]+",test.pasted)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

> Encoding(test)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"

> Encoding(test.pasted)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8"   "unknown"

Clearly there's some character that's not getting assigned in the dput, as in the question below:

How to properly dput internationalized text?

I can't copy/paste the entire vector.... How do I search-and-destroy this non-whitespace whitespace?

Edit

Clearly I wasn't even close to clear because answers are all over the place. Here's an even simpler test case:

> grepl("Clinic Information:", test[2])
[1] FALSE
> grepl("Clinic Information:", "Clinic Information:") # Where the second phrase is copy/pasted from the screen
[1] TRUE

There is a single space between the word "Clinic" and "Information" printed on the screen and in the dput output, but whatever is in the string is not a standard space. My goal is to eliminate this so I can properly grep that element out.

like image 929
Ari B. Friedman Avatar asked Jul 28 '12 16:07

Ari B. Friedman


People also ask

How do I undo double click to hide white space?

Click File and select Options. On the right pane, select Display. Under the Page display options, uncheck the box Show white space between pages in Print Layout view. Press OK to save.


2 Answers

Upgrading my comment to an answer:

Your string contains a non-breaking space (U+00A0) which got translated to a normal space when you pasted it. Matching all the strange space-like characters in Unicode is easy with a perl-style regular expression:

grepl("[0-9]+\\p{Zs}[A-Za-z ]+", test, perl=TRUE)

The perl regexp syntax is \p{categoryName}, the extra backslash is part of the syntax of a string containing a backslash, and "Zs" is the "Separator" Unicode category, "space" subcategory. A simpler method for just the U+00A0 character would be

grepl("[0-9]+[ \\xa0][A-Za-z ]+", test)
like image 94
Alan Curry Avatar answered Sep 28 '22 01:09

Alan Curry


I think you're after trailing and leading white space. If so maybe this function will work:

Trim <- function (x) gsub("^\\s+|\\s+$", "", x)

Also keep an eye out for tabs and such and this may be useful:

clean <- function(text) {
    gsub("\\s+", " ", gsub("\r|\n|\t", " ", text))
}

so use the clean and then the Trim as in:

Trim(clean(test))

Also be on the look out for the en dash (–) and the em dash (—)

like image 45
Tyler Rinker Avatar answered Sep 27 '22 23:09

Tyler Rinker