Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove periods/dots in entire data frame

Tags:

r

dplyr

stringr

I have a large dataset with participants from all over the world. Some of these participants entered data using dots/periods/commas to indicate the thousand separators, but R reads them as commas which totally skews my data... e.g. 1234 become 1,234.

I want to remove all dots/periods/commas. My data is entirely composed of full numbers so there shouldn't be any decimals anywhere.

I tried using stringr, but can't quite figure out. Here is a (I hope) reproducible example with a small sample of my data:

structure(
  list(
    chnb = c(10L, 35L, 55L),
    B1_1_77 = c(117.586,
                4022, 4.921),
    C1_1_88 = c(NA, 2206, 1.111),
    C1_1_99 = c(6.172,
                1884, 0),
    C1_3_99 = c(5.62, 129, 0)
  ),
  row.names = c(NA,-3L),
  class = c("tbl_df",
            "tbl", "data.frame")
)

I tried this:

prob1 <- prob %>% str_replace_all('\\.', '')

which gives me this:

> prob
[1] "c(10, 35, 55)"         "c(117586, 4022, 4921)" "c(NA, 2206, 1111)"    
[4] "c(6172, 1884, 0)"      "c(562, 129, 0)"  

It did indeed remove the dots but it gives me a simple list, and totally lost my data structure. An online search suggested I did this:

prob1 <- prob %>% mutate_all(list(str_replace(., '\\.', '')))

but I get an error message:

Error: .fn must be a length 1 string Call rlang::last_error() to see a backtrace In addition: Warning message: In stri_replace_first_regex(string, pattern, fix_replacement(replacement), : argument is not an atomic vector; coercing

Am I approaching the whole thing wrong? Any help would be greatly appreciated. I hope my question is clear enough, my apologies if it isn't (I'm new to this).

like image 262
Andrea Avatar asked Sep 04 '25 17:09

Andrea


1 Answers

You'll want to convert to character, then replace, then convert back to numeric:

library(tidyverse)
dat %>%
  mutate_all(~as.numeric(str_remove_all(as.character(.x), '\\.')))

# A tibble: 3 x 5
   chnb B1_1_77 C1_1_88 C1_1_99 C1_3_99
  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1    10  117586      NA    6172     562
2    35    4022    2206    1884     129
3    55    4921    1111       0       0

Credit to @camille for the stringr::str_remove_all suggestion.

It's also occurred to me that R may be rounding when you don't intend it to in the instance of a trailing zero. Take the first entry of C1_3_99 in your example, 5.62. This may need to be 5,620 (if the period was a thousands separator), rather than 562, as my first solution gives. You can handle this by using a formatter and thoughtful division:

dat %>%
  mutate_all(~as.numeric(str_remove_all(format(round(.x, 3), nsmall = 3), '\\.')) / 
               if_else(str_detect(.x, "\\."), 1, 1000))

# A tibble: 3 x 5
   chnb B1_1_77 C1_1_88 C1_1_99 C1_3_99
  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1    10  117586      NA    6172    5620
2    35    4022    2206    1884     129
3    55    4921    1111       0       0
Warning message:
In (function (..., .x = ..1, .y = ..2, . = ..1)  :
  NAs introduced by coercion

The formatter ensures there are 3 digits after the decimal, but will add three 0s for numbers without a decimal (formatting code yanked from here), so you divide by 1000 if there is no decimal present. More elegant solutions here are welcome.

like image 135
zack Avatar answered Sep 06 '25 14:09

zack