Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort map with (Spanish) accented words in Rcpp

Tags:

c++

sorting

std

r

rcpp

While I can successfully sort Spanish words with accented vowels by specifying a UTF-8 locale within std::sort,

// [[Rcpp::export]]
std::vector<std::string> sort_words(std::vector<std::string> x) {
  std::sort(x.begin(), x.end(), std::locale("en_US.UTF-8"));
  return x;
}

/*** R
words <- c("casa", "árbol", "zona", "árbol", "casa", "libro")
sort_words(words)
*/

returns (as expected):
[1] "árbol" "árbol" "casa"  "casa"  "libro" "zona"

I can't figure out how to do the same with a map:

// slightly modified version of tableC on http://adv-r.had.co.nz/Rcpp.html
// [[Rcpp::export]]
std::map<String, int> table_words(CharacterVector x) {
  std::setlocale(LC_ALL, "en_US.UTF-8");
  // std::setlocale(LC_COLLATE, "en_US.UTF-8"); // also tried this instead of previous line
  std::map<String, int> counts;
  int n = x.size();
  for (int i = 0; i < n; i++) {
    counts[x[i]]++;
  }
  return counts;
}

/*** R
words <- c("casa", "árbol", "zona", "árbol", "casa", "libro")
table_words(words)
*/

returns:
casa    libro   zona    árbol
    2       1       1       2

but I want:
árbol   casa    libro   zona    
    2       2       1       1

Any ideas on how to have table_words put the accented "árbol" before "casa", with Rcpp or even back out in R, with base::sort?

Also, std::sort(..., std::locale("en_US.UTF-8")) only words on my Linux machine with: gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1). It does not work on the Mac 10.10.3 with: Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn). Any clues on what my Mac compiler is missing that my Linux compiler has?

Here's my script and my sessionInfo, for both machines:

// [[Rcpp::plugins(cpp11)]]
#include <locale>
#include <clocale>
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
std::vector<std::string> sort_words(std::vector<std::string> x) {
  std::sort(x.begin(), x.end(), std::locale("en_US.UTF-8"));
  return x;
}

// [[Rcpp::export]]
std::map<String, int> table_words(CharacterVector x) {
  // std::setlocale(LC_ALL, "en_US.UTF-8"); // tried this instead of next line
  std::setlocale(LC_COLLATE, "en_US.UTF-8");
  std::map<String, int> counts;
  int n = x.size();
  for (int i = 0; i < n; i++) {
    counts[x[i]]++;
  }
  return counts;
}

/*** R
words <- c("casa", "árbol", "zona", "árbol", "casa", "libro")
sort_words(words)
table_words(words)
sort(table_words(words), decreasing = T)
output_from_Rcpp <- table_words(words)
sort(names(output_from_Rcpp))
*/

> words <- c("casa", "árbol", "zona", "árbol", "casa", "libro")

> sort_words(words)
[1] "árbol" "árbol" "casa"  "casa"  "libro" "zona" 

> table_words(words)
 casa libro  zona árbol 
    2     1     1     2 

> sort(table_words(words), decreasing = T)
 casa árbol libro  zona 
    2     2     1     1 

> output_from_Rcpp <- table_words(words)

> sort(names(output_from_Rcpp))
[1] "árbol" "casa"  "libro" "zona" 

sessionInfo on linux machine:
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04 LTS

locale:
[1] en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.2.0 Rcpp_0.11.6

sessionInfo on Mac:
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] textcat_1.0-3 readr_0.1.1   rvest_0.2.0  

loaded via a namespace (and not attached):
 [1] httr_1.0.0    selectr_0.2-3 R6_2.1.0      magrittr_1.5  tools_3.2.1   curl_0.9.1    Rcpp_0.11.6   slam_0.1-32   stringi_0.5-5
[10] tau_0.0-18    stringr_1.0.0 XML_3.98-1.3 
like image 716
Earl Brown Avatar asked Jul 19 '15 04:07

Earl Brown


1 Answers

It does not make sense to apply std::sort on a std::map, because a map is always sorted, by definition. That definition is part of the concrete type instantiated by the template. std::map has a third, "hidden" type parameter for the comparison function used to order keys, which defaults to std::less for the key type. See http://en.cppreference.com/w/cpp/container/map.

In your case, you can use std::locale as the comparison type, and pass std::locale("en-US") (or whatever fits your system) to the constructor.

Here is an example. It uses C++11, but you can easily use the same solution in C++03.

#include <map>
#include <iostream>
#include <string>
#include <locale>
#include <exception>

using Map = std::map<std::string, int, std::locale>;

int main()
{
    try
    {
        Map map(std::locale("en-US"));
        map["casa"] = 1;
        map["árbol"] = 2;
        map["zona"] = 3;
        map["árbol"] = 4;
        map["casa"] = 5;
        map["libro"] = 6;

        for (auto const& map_entry : map)
        {
            std::cout << map_entry.first << " -> " << map_entry.second << "\n";
        }
    }
    catch (std::exception const& exc)
    {
        std::cerr << exc.what() << "\n";
    }
}

Output:

árbol -> 4
casa -> 5
libro -> 6
zona -> 3

Of course, you must be aware of the fact that std::locale is highly implementation-dependent. You may be better off with Boost.Locale.

Another problem is that this solution may look confusing, because a std::locale is not exactly something many programmers would associate with a comparison function. It's almost a bit too clever.

Hence a possibly more readable alternative:

#include <map>
#include <iostream>
#include <string>
#include <locale>
#include <exception>

struct ComparisonUsingLocale
{
    std::locale locale{ "en-US" };

    bool operator()(std::string const& lhs, std::string const& rhs) const
    {
        return locale(lhs, rhs);
    }
};

using Map = std::map<std::string, int, ComparisonUsingLocale>;

int main()
{
    try
    {
        Map map;
        map["casa"] = 1;
        map["árbol"] = 2;
        map["zona"] = 3;
        map["árbol"] = 4;
        map["casa"] = 5;
        map["libro"] = 6;

        for (auto const& map_entry : map)
        {
            std::cout << map_entry.first << " -> " << map_entry.second << "\n";
        }
    }
    catch (std::exception const& exc)
    {
        std::cerr << exc.what() << "\n";
    }
}
like image 51
Christian Hackl Avatar answered Nov 07 '22 16:11

Christian Hackl