I have a large dataset with participants from all over the world. Some of these participants entered data using dots/periods/commas to indicate the thousand separators, but R reads them as commas which totally skews my data... e.g. 1234 become 1,234.
I want to remove all dots/periods/commas. My data is entirely composed of full numbers so there shouldn't be any decimals anywhere.
I tried using stringr, but can't quite figure out. Here is a (I hope) reproducible example with a small sample of my data:
structure(
list(
chnb = c(10L, 35L, 55L),
B1_1_77 = c(117.586,
4022, 4.921),
C1_1_88 = c(NA, 2206, 1.111),
C1_1_99 = c(6.172,
1884, 0),
C1_3_99 = c(5.62, 129, 0)
),
row.names = c(NA,-3L),
class = c("tbl_df",
"tbl", "data.frame")
)
I tried this:
prob1 <- prob %>% str_replace_all('\\.', '')
which gives me this:
> prob
[1] "c(10, 35, 55)" "c(117586, 4022, 4921)" "c(NA, 2206, 1111)"
[4] "c(6172, 1884, 0)" "c(562, 129, 0)"
It did indeed remove the dots but it gives me a simple list, and totally lost my data structure. An online search suggested I did this:
prob1 <- prob %>% mutate_all(list(str_replace(., '\\.', '')))
but I get an error message:
Error:
.fn
must be a length 1 string Callrlang::last_error()
to see a backtrace In addition: Warning message: In stri_replace_first_regex(string, pattern, fix_replacement(replacement), : argument is not an atomic vector; coercing
Am I approaching the whole thing wrong? Any help would be greatly appreciated. I hope my question is clear enough, my apologies if it isn't (I'm new to this).
You'll want to convert to character, then replace, then convert back to numeric:
library(tidyverse)
dat %>%
mutate_all(~as.numeric(str_remove_all(as.character(.x), '\\.')))
# A tibble: 3 x 5
chnb B1_1_77 C1_1_88 C1_1_99 C1_3_99
<dbl> <dbl> <dbl> <dbl> <dbl>
1 10 117586 NA 6172 562
2 35 4022 2206 1884 129
3 55 4921 1111 0 0
Credit to @camille for the stringr::str_remove_all
suggestion.
It's also occurred to me that R may be rounding when you don't intend it to in the instance of a trailing zero. Take the first entry of C1_3_99
in your example, 5.62. This may need to be 5,620 (if the period was a thousands separator), rather than 562, as my first solution gives. You can handle this by using a formatter and thoughtful division:
dat %>%
mutate_all(~as.numeric(str_remove_all(format(round(.x, 3), nsmall = 3), '\\.')) /
if_else(str_detect(.x, "\\."), 1, 1000))
# A tibble: 3 x 5
chnb B1_1_77 C1_1_88 C1_1_99 C1_3_99
<dbl> <dbl> <dbl> <dbl> <dbl>
1 10 117586 NA 6172 5620
2 35 4022 2206 1884 129
3 55 4921 1111 0 0
Warning message:
In (function (..., .x = ..1, .y = ..2, . = ..1) :
NAs introduced by coercion
The formatter ensures there are 3 digits after the decimal, but will add three 0s for numbers without a decimal (formatting code yanked from here), so you divide by 1000 if there is no decimal present. More elegant solutions here are welcome.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With