I have (html-)texts and I want to change the ö
things to real characters like ä, ü, ö, and so on because otherwise the xml-package does not accept it.
So I wrote a little function which cycles through a replacement table (link1, link2) and does replace special character by special character by sp... the function looks like this (only looonger):
html.charconv <- function(text){
replacer <- matrix(c(
"Á", "Á",
"á", "á",
"Â", "Â",
"â", "â",
"´", "´"
)
,ncol=2,byrow=T)
for(i in 1:length(replacer[,1])){
text <- str_replace_all(text,replacer[i,2],replacer[i,1])
}
text
}
How might I speed this up? I thought about vectorization but did not come with any helping solution because for each cycle the result of the last cycle is its starting point.
You can get a significant speedup by constructing your function a bit different, and forget about the text tools. Basically you :
You can do that with following function :
html.fastconv <- function(x,old,new){
xs <- strsplit(x,"&|;")
old <- gsub("&|;","",old)
xs <- lapply(xs,function(i){
id <- match(i,old,0L)
i[id!=0] <- new[id]
return(i)
})
sapply(xs,paste,collapse="")
}
This works as :
> sometext <- c("Ádd somá leÂtterâ acute problems et´ cetera",
+ "Ádd somá leÂtterâ acute p ..." ... [TRUNCATED]
> newchar <- c("Á","á","Â","â","´")
> oldchar <- c("Á","á","Â","â","´")
> html.fastconv(sometext,oldchar,newchar)
[1] "Ádd somá leÂtterâ acute problems et´ cetera" "Ádd somá leÂtterâ acute problems et´ cetera"
For the record, some benchmarking :
require(rbenchmark)
benchmark(html.fastconv(sometext,oldchar,newchar),html.charconv(sometext),
columns=c("test","elapsed","relative"),
replications=1000)
test elapsed relative
2 html.charconv(sometext) 0.79 5.643
1 html.fastconv(sometext, oldchar, newchar) 0.14 1.000
Just for fun, here is a version based on Rcpp
.
#include <Rcpp.h>
using namespace Rcpp ;
// [[Rcpp::export]]
CharacterVector rcpp_conv(
CharacterVector text, CharacterVector old , CharacterVector new_){
int n = text.size() ;
int nr = old.size() ;
std::string buffer, current_old, current_new ;
size_t pos, current_size ;
CharacterVector res(n) ;
for( int i=0; i<n; i++){
buffer = text[i] ;
for( int j=0; j<nr; j++){
current_old = old[j] ;
current_size = current_old.size() ;
current_new = new_[j] ;
pos = 0 ;
pos = buffer.find( current_old ) ;
while( pos != std::string::npos ){
buffer.replace(
pos, current_size,
current_new
) ;
pos = buffer.find( current_old ) ;
}
}
res[i] = buffer ;
}
return res ;
}
For which I get quite a further performance gain:
> microbenchmark(
+ html.fastconv( sometext,oldchar,newchar),
+ html.fastconvJC(sometext, oldchar, newchar),
+ rcpp_conv( sometext, oldchar, newchar)
+ )
Unit: microseconds
expr min lq median uq
1 html.fastconv(sometext, oldchar, newchar) 97.588 99.9845 101.4195 103.072
2 html.fastconvJC(sometext, oldchar, newchar) 19.945 23.3060 25.8110 28.134
3 rcpp_conv(sometext, oldchar, newchar) 4.047 5.1555 6.2340 9.275
max
1 256.061
2 40.647
3 25.763
Here is an implementation based on the Rcpp::String
feature, available from Rcpp >= 0.10.2
:
class StringConv{
public:
typedef String result_type ;
StringConv( CharacterVector old_, CharacterVector new__):
nr(old_.size()), old(old_), new_(new__){}
String operator()(String text) const {
for( int i=0; i<nr; i++){
text.replace_all( old[i], new_[i] ) ;
}
return text ;
}
private:
int nr ;
CharacterVector old ;
CharacterVector new_ ;
} ;
// [[Rcpp::export]]
CharacterVector test_sapply_string(
CharacterVector text, CharacterVector old , CharacterVector new_
){
CharacterVector res = sapply( text, StringConv( old, new_ ) ) ;
return res ;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With