I have a character vector which is the file of some PDF scraping via <code>pdftotext</code> (command line tool). Everything is (blissfully) nicely lined up. However, the vector is riddled with a type of whitespace that eludes my regular expressions: <pre class="prettyprint"><code>> test [1] "Address:" "Clinic Information:" "Store " "351 South Washburn" "Aurora Quick Care" [6] "Info" "St. Oshkosh, WI 54904" "Phone: 920‐232‐0718" "Pewaukee" > grepl("[0-9]+ [A-Za-z ]+",test) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > dput(test) c("Address:", "Clinic Information:", "Store ", "351 South Washburn", "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", "Pewaukee") > test.pasted <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn", + "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", + "Pewaukee") > grepl("[0-9]+ [A-Za-z ]+",test.pasted) [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE > Encoding(test) [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" > Encoding(test.pasted) [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8" "unknown" </code></pre> Clearly there's some character that's not getting assigned in the <code>dput</code>, as in the question below: How to properly dput internationalized text? I can't copy/paste the entire vector.... How do I search-and-destroy this non-whitespace whitespace? Edit Clearly I wasn't even close to clear because answers are all over the place. Here's an even simpler test case: <pre class="prettyprint"><code>> grepl("Clinic Information:", test[2]) [1] FALSE > grepl("Clinic Information:", "Clinic Information:") # Where the second phrase is copy/pasted from the screen [1] TRUE </code></pre> There is a single space between the word "Clinic" and "Information" printed on the screen and in the <code>dput</code> output, but whatever is in the string is not a standard space. My goal is to eliminate this so I can properly grep that element out.

Upgrading my comment to an answer: Your string contains a non-breaking space (U+00A0) which got translated to a normal space when you pasted it. Matching all the strange space-like characters in Unicode is easy with a perl-style regular expression: <pre class="prettyprint"><code>grepl("[0-9]+\\p{Zs}[A-Za-z ]+", test, perl=TRUE) </code></pre> The perl regexp syntax is <code>\p{categoryName}</code>, the extra backslash is part of the syntax of a string containing a backslash, and "Zs" is the "Separator" Unicode category, "space" subcategory. A simpler method for just the U+00A0 character would be <pre class="prettyprint"><code>grepl("[0-9]+[ \\xa0][A-Za-z ]+", test) </code></pre>

I think you're after trailing and leading white space. If so maybe this function will work: <pre class="prettyprint"><code>Trim <- function (x) gsub("^\\s+|\\s+$", "", x) </code></pre> Also keep an eye out for tabs and such and this may be useful: <pre class="prettyprint"><code>clean <- function(text) { gsub("\\s+", " ", gsub("\r|\n|\t", " ", text)) } </code></pre> so use the clean and then the Trim as in: <pre class="prettyprint"><code>Trim(clean(test)) </code></pre> Also be on the look out for the en dash (–) and the em dash (—)

Where is this whitespace hiding?

Tags:

regex

r

I have a character vector which is the file of some PDF scraping via pdftotext (command line tool).

Everything is (blissfully) nicely lined up. However, the vector is riddled with a type of whitespace that eludes my regular expressions:

> test
[1] "Address:"              "Clinic Information:"   "Store "                "351 South Washburn"    "Aurora Quick Care"    
[6] "Info"                  "St. Oshkosh, WI 54904" "Phone: 920‐232‐0718"   "Pewaukee"  

> grepl("[0-9]+ [A-Za-z ]+",test)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> dput(test)
c("Address:", "Clinic Information:", "Store ", "351 South Washburn", 
"Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", 
"Pewaukee")

> test.pasted <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn", 
+                  "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718", 
+                  "Pewaukee")

> grepl("[0-9]+ [A-Za-z ]+",test.pasted)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

> Encoding(test)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"

> Encoding(test.pasted)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8"   "unknown"

Clearly there's some character that's not getting assigned in the dput, as in the question below:

How to properly dput internationalized text?

I can't copy/paste the entire vector.... How do I search-and-destroy this non-whitespace whitespace?

Edit

Clearly I wasn't even close to clear because answers are all over the place. Here's an even simpler test case:

> grepl("Clinic Information:", test[2])
[1] FALSE
> grepl("Clinic Information:", "Clinic Information:") # Where the second phrase is copy/pasted from the screen
[1] TRUE

There is a single space between the word "Clinic" and "Information" printed on the screen and in the dput output, but whatever is in the string is not a standard space. My goal is to eliminate this so I can properly grep that element out.

929

asked Jul 28 '12 16:07

Ari B. Friedman

2 Answers

Upgrading my comment to an answer:

Your string contains a non-breaking space (U+00A0) which got translated to a normal space when you pasted it. Matching all the strange space-like characters in Unicode is easy with a perl-style regular expression:

grepl("[0-9]+\\p{Zs}[A-Za-z ]+", test, perl=TRUE)

The perl regexp syntax is \p{categoryName}, the extra backslash is part of the syntax of a string containing a backslash, and "Zs" is the "Separator" Unicode category, "space" subcategory. A simpler method for just the U+00A0 character would be

grepl("[0-9]+[ \\xa0][A-Za-z ]+", test)

answered Sep 28 '22 01:09

Alan Curry

I think you're after trailing and leading white space. If so maybe this function will work:

Trim <- function (x) gsub("^\\s+|\\s+$", "", x)

Also keep an eye out for tabs and such and this may be useful:

clean <- function(text) {
    gsub("\\s+", " ", gsub("\r|\n|\t", " ", text))
}

so use the clean and then the Trim as in:

Trim(clean(test))

Also be on the look out for the en dash (–) and the em dash (—)

answered Sep 27 '22 23:09

Tyler Rinker

Related questions
                            
                                Match all lines prefixed with four spaces
                            
                                Regular expression to remove some special chars
                            
                                C# Email Regular Expression -- Any out there that adhere to the RFC 2822 guidelines?
                            
                                How to remove certain attributes from XML using XLST
                            
                                PHP URL to Link with Regex
                            
                                Python regex not to match http://
                            
                                Have a regular expression keep matching as much as possible?
                            
                                is it possible to match consecutive lines that start with the same word/pattern
                            
                                Matching case sensitive unicode strings with regular expressions in Python
                            
                                Need Regular expression javascript to get all images
                            
                                Splitting a String in Java throws PatternSyntaxException
                            
                                fastest way to search huge list of big texts
                            
                                replace URL querystring value on change of dropdown
                            
                                Using regular expressions to compare numbers
                            
                                How do you read this ternary condition in Ruby?
                            
                                Canonical equivalence in Pattern
                            
                                How can I expand a setting or variable in a vim regular expression?
                            
                                Regular expression pattern matching for number,alphabetcic blocks
                            
                                SPARQL regex filter
                            
                                Mysql Regular Expression search with no repeating characters

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With