I'm used to use trimws
to get rid of any blank spaces on text.
Now I've a df that was made with scraped data.
I've 2 columns that relate to money but are chr vectors because they where scraped from a web, as mentioned before. To one column I can apply trimws
with no problem, but not to the other one.
str(lacuracao_tvs$precio_actual[1])
chr " 1199.00"
Why?
new_precio_actual <- trimws(lacuracao_tvs$precio_actual[1])
dput(new_precio_actual)
" 1199.00"
trimws works in precio_antes but not in precio_actual:
> str(lacuracao_tvs)
'data.frame': 100 obs. of 4 variables:
$ ecommerce : chr "la-curacao" "la-curacao" "la-curacao" "la-curacao" ...
$ producto : chr "TV LED AOC Ultra HD Smart 50\" LE50U7970" "TV Samsung Ultra HD 4K Smart 58\" UN-58RU7100G" "TV LG Ultra HD 4K Smart AI 55\" 55UK6200" "TV AOC Ultra HD 4K Smart 55\" 55U6285" ...
$ precio_antes : chr "1899.00" "1899.00" "1899.00" "1899.00" ...
$ precio_actual: chr " 1199.00" " 1199.00" " 1199.00" " 1199.00" ...
SessionInfo:
Sys.info()
sysname release version nodename
"Windows" "10 x64" "build 17763" "DESKTOP-MNDUKBD"
machine login user effective_user
"x86-64" "OGONZALES" "OGONZALES" "OGONZALES"
> sessionInfo(package = NULL)
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.7.8 rvest_0.3.2 xml2_1.2.0 RSelenium_1.7.5
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 rstudioapi_0.9.0 bindr_0.1.1 magrittr_1.5
[5] rappdirs_0.3.1 tidyselect_0.2.5 R6_2.3.0 rlang_0.3.1
[9] stringr_1.3.1 httr_1.4.0 caTools_1.17.1.1 tools_3.5.2
[13] binman_0.1.1 selectr_0.4-1 semver_0.2.0 subprocess_0.8.3
[17] yaml_2.2.0 openssl_1.1 assertthat_0.2.0 tibble_2.0.1
[21] crayon_1.3.4 bindrcpp_0.2.2 purrr_0.2.5 bitops_1.0-6
[25] curl_3.3 glue_1.3.0 wdman_0.2.4 stringi_1.2.4
[29] compiler_3.5.2 pillar_1.3.1 XML_3.98-1.20 jsonlite_1.6
[33] pkgconfig_2.0.2
UPDATE 1:
utf8ToInt(lacuracao_tvs$precio_actual[1])
[1] 160 49 49 57 57 46 48 48
trim() The trim() method removes whitespace from both ends of a string and returns a new string, without modifying the original string. Whitespace in this context is all the whitespace characters (space, tab, no-break space, etc.)
R – trimws() Function trimws() function in R Language is used to trim the leading white spaces.
strip(): The strip() method is the most commonly accepted method to remove whitespaces in Python. It is a Python built-in function that trims a string by removing all leading and trailing whitespaces.
Method 1: Using gsub() The function used which is applied to each row in the dataframe is the gsub() function, this used to replace all the matches of a pattern from a string, we have used to gsub() function to find whitespace(\s), which is then replaced by “”, this removes the whitespaces.
The character with ASCII code 160 is called a "non-breaking space." One can read about it in Wikipedia:
https://en.wikipedia.org/wiki/Non-breaking_space
The trimws()
function does not include it in the list of characters that are removed by the function:
x <- intToUtf8(c(160,49,49,57,57,46,48,48))
x
#[1] " 1199.00"
trimws(x)
#[1] " 1199.00"
One way to get rid of it is by using str_trim()
function from the stringr library:
library(stringr)
y <- str_trim(x)
trimws(y)
[1] "1199.00"
Another way is by applying iconv()
function first:
y <- iconv(x, from = 'UTF-8', to = 'ASCII//TRANSLIT')
trimws(y)
#[1] "1199.00"
UPDATE To explain why trimws() does not remove the "invisible" character described above and stringr::str_trim() does.
Here is what we read from trimws()
help:
For portability, ‘whitespace’ is taken as the character class [ \t\r\n] (space, horizontal tab, line feed, carriage return)
For stringr::str_trim()
help topic itself does not specify what is considered a "white space" but if you look at the help for stri_trim_both
which is called by str_trim()
you will see: stri_trim_both(str, pattern = "\\P{Wspace}")
Basically in this case it is using a wider range of characters that are considered as a white space.
UPDATE 2
As @H1 noted, version 3.6.0 provides an option to specify what to consider a whitespace character:
Internally, 'sub(re, "", *, perl = TRUE)', i.e., PCRE library regular expressions are used. For portability, the default 'whitespace' is the character class '[ \t\r\n]' (space, horizontal tab, carriage return, newline). Alternatively, '[\h\v]' is a good (PCRE) generalization to match all Unicode horizontal and vertical white space characters, see also <URL: https://www.pcre.org>.
So if you are using version 3.6.0 or later you can simply do:
> trimws(x,whitespace = "[\\h\\v]")
#[1] "1199.00"
From R version 3.6.0 trimws()
has an argument allowing you to define what is considered whitespace which in this case is a no break space.
trimws(x, whitespace = "\u00A0|\\s")
[1] "1199.00"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With