Context
I need to clean financial data with mixed formats. The data has been punched in manually by different departments, some of them using "." as decimal and "," as grouping digit (e.g. US notation: $1,000,000.00) while others are using "," as decimal and "." as grouping digit (e.g. notation used in certain European countries: $1.000.000,00).
Input:
Here's a fictional example set:
df <- data.frame(Y2019= c("17.530.000,03","28000000.05", "256.000,23", "23,000",
"256.355.855","2565467,566","225,453.126")
)
Y2019
1 17.530.000,03
2 28000000.05
3 256.000,23
4 23,000
5 256.355.855
6 2565467,566
7 225,453.126
Desired result:
Y2019
1 17530000.03
2 28000000.05
3 256000.23
4 23000.00
5 256355855.00
6 2565467.566
7 225453.126
My attempt:
I got pretty close by considering the first occurrence (starting from the right) of "," or "." as the decimal operator and replacing the other occurrences accordingly. However, some entries are without decimals (e.g. entry 4 and 5) or have a variable number of decimals, rendering this strategy less useful.
Any input is greatly appreciated!
Edit: As per request, I salvaged some of the code of the original attempt. I am sure it could be written a lot cleaner.
df %>%
mutate(Y2019r = ifelse(str_length(Y2019)- data.frame(str_locate(pattern =",",Y2019 ))[,1]==2, gsub("\\.","", Y2019),NA )) %>%
mutate(Y2019r = ifelse((is.na(Y2019r) & str_length(Y2019)- data.frame(str_locate(pattern ="\\.",Y2019 ))[,1]==2), gsub("\\.",",", Y2019),Y2019r ))%>%
mutate(Y2019r = gsub(",",".", Y2019r))
Y2019 Y2019r
1 17.530.000,03 17530000.03
2 28000000.05 28000000.05
3 256.000,23 256000.23
4 23,000 <NA>
5 256.355.855 <NA>
6 2565467,566 <NA>
7 225,453.126 <NA>
Here's a functional approach to build up the logic needed to parse the strings you might come across. I suppose it is built up from thinking about how we parse these strings when we read them, and trying to emulate that.
I think the key is realising that all we really need to know is whether the value after the last delimiter is decimal or not. If we could somehow label the strings as having a decimal portion it would be easy to parse the strings then.
The following method involves splitting the character strings at the points and commas and trying to label them as having a terminal decimal or not. The split strings will be held as a list of string vectors, with each vector being composed of the "chunks" of digits between the delimiters.
First we will write two helper functions to create the final numbers from the string vectors once we have correctly labeled them as having a terminal decimal portion or not:
last_element_is_decimal <- function(x)
{
as.numeric(paste0(paste(x[-length(x)], collapse = ""), ".", x[length(x)]))
}
last_element_is_whole <- function(x)
{
as.numeric(paste0(x, collapse = ""))
}
It will be easy to decide what to do in the event of no delimiters, since we assume these are just whole numbers. Similarly, it is easy to see that any numbers containing both a comma and a stop (in either order) must have a terminal decimal component.
However, it is less obvious what to do when there is only a single delimiter; in these cases we have to use the length of the digit chunks to decide. If any chunk is longer than three digits, then a thousands seperator isn't in use, and the presence of a delimiter indicates we have a decimal component. If the terminal chunk contains only two digits then we must have a decimal. In all other cases, we assume a whole number.
This says the same thing in code:
decide_last_element <- function(x)
{
if(max(nchar(x)) > 3)
return(last_element_is_decimal(x))
if(nchar(x[length(x)]) < 3)
return(last_element_is_decimal(x))
return(last_element_is_whole(x))
}
Now we can write our main function. It takes our strings as input and classifies each string into having either two types of delimiter, one type of delimiter or no delimiter. Then we can apply the functions above using lapply
accordingly.
parse_money <- function(money_strings)
{
any_comma <- grepl(",", money_strings)
any_point <- grepl("[.]", money_strings)
both <- any_comma & any_point
neither <- !any_comma & !any_point
single <- (any_comma & !any_point) | (any_point & !any_comma)
digit_groups <- strsplit(money_strings, "[.]|,")
values <- rep(0, length(money_strings))
values[neither] <- as.numeric(money_strings[neither])
values[both] <- sapply(digit_groups[both], last_element_is_decimal)
values[single] <- sapply(digit_groups[single], decide_last_element)
return(format(round(values, 2), nsmall = 2))
}
So now we can just do
parse_money(df$Y2019)
#> [1] " 17530000.03" " 28000000.05" " 256000.23" " 23000.00" "256355855.00"
#> [6] " 2565467.57" " 225453.13"
Note I have output as strings so that rounding inaccuracies in the console output aren't ascribed to mistakes in the code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With