Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R friendly greek characters

I noticed that I can use some Greek letters as names while others will be either illegal or just aliases to letters from the latin alphabet.

Basically I can use β or µ (though β is changed to ß when printing and ß and β act as alliases)

list(β = 1)
# $ß
# [1] 1
list(μ = 1)
# $µ
# [1] 1

α, Γ, δ, ε, Θ, π, Σ, σ, τ, Φ, φ and Ω are allowed but act as alliases to latin letters.

list(α = 1)
# $a
# [1] 1

αa <- 42
aa
# [1] 42

GG <- 33
ΓΓ 
# [1] 33

Other letters I've tested just don't "work":

ι <- 1
# Error: unexpected input in "\"
Λ <- 1
# Error: unexpected input in "\"
λ <- 1
#Error: unexpected input in "\"

I was surprised about λ as it's defined by the package wrapr's define_lambda, so I assume this depends on the system.

I know similar or identical looking characters can have different encodings, and some of them don't go well with copy/paste between apps, the code of this question returns the described output when pasted back to RStudio.

?make.names says :

A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number

So part of the question is : what's a letter ? and what's going on here ?

More specifically:

  • Are there greek characters that will safely work on all R installations, in particular, are µ and β (or ß) safe to use in a package.
  • why isn't λ ( intToUtf8(955) ) usable on my system while it seems to be commonly use by wrapr's users.
  • Are there other non latin letters, greek or not, that I could safely use in my code ? (for instance Norwegian ø looks cool and seems to work on my system)

This all was prompted by the fact I'm looking for a one (or 2) character function name that wouldn't conflict with an existing or commonly used name, and would look a bit funky. . is already used a lot and I use .. already as well.

from sessionInfo() :

R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252  
like image 592
Moody_Mudskipper Avatar asked Apr 05 '19 09:04

Moody_Mudskipper


People also ask

How do I type Greek symbols in R?

To make a Greek letter in R, You just use \ and then the name of the letter. If you want a subscript, like β1 , you use $\beta_1$ .

How do you put Greek letters in ggplot2?

Adding Greek symbols to Plot Title In this method to use Greeks symbols in ggplot2 user need to call the expression function which is a base function of the R programming language, and pass the name of the Greek symbols to be used as the parameters to this function to get a Greek symbol to the ggplot2.

What are the 24 Greek letters called?

The letters of the Greek alphabet are: alpha, beta, gamma, delta, epsilon, zeta, eta, theta, iota, kappa, lambda, mu, nu1, xi, omicron, pi1, rho, sigma, tau, upsilon, phi, chi1, psi1, omega.


1 Answers

I'm not an expert by any means but let's try to analyze the problem. In the end, your R-code needs to be understood by the compiler therefore the source-code of make.names() may be helpful:

names <- as.character(names)
names2 <- .Internal(make.names(names, allow_))
if (unique) {
  o <- order(names != names2)
  names2[o] <- make.unique(names2[o])
}
names2

Now, .Internal() calls the R-interpreter (written in C) so we need to go a little deeper. The C-code responsible for handling the make.names() request can be found here: https://github.com/wch/r-source/blob/0dccb93e114b00b2fcbe75e8721f11a8f2ffdff4/src/main/character.c

A short snipped:

SEXP attribute_hidden do_makenames(SEXP call, SEXP op, SEXP args, SEXP env)
{
    SEXP arg, ans;
    R_xlen_t i, n;
    int l, allow_;
    char *p, *tmp = NULL, *cbuf;
    const char *This;
    Rboolean need_prefix;
    const void *vmax;

    checkArity(op ,args);
    arg = CAR(args);
    if (!isString(arg))
    error(_("non-character names"));
    n = XLENGTH(arg);
    allow_ = asLogical(CADR(args));
    if (allow_ == NA_LOGICAL)
    error(_("invalid '%s' value"), "allow_");
    PROTECT(ans = allocVector(STRSXP, n));
    vmax = vmaxget();
    for (i = 0 ; i < n ; i++) {
    This = translateChar(STRING_ELT(arg, i));
    l = (int) strlen(This);
    /* need to prefix names not beginning with alpha or ., as
       well as . followed by a number */
    need_prefix = FALSE;
    if (mbcslocale && This[0]) {
        int nc = l, used;
        wchar_t wc;
        mbstate_t mb_st;
        const char *pp = This;
        mbs_init(&mb_st);
        used = (int) Mbrtowc(&wc, pp, MB_CUR_MAX, &mb_st);
        pp += used; nc -= used;
        if (wc == L'.') {
        if (nc > 0) {
            Mbrtowc(&wc, pp, MB_CUR_MAX, &mb_st);
            if (iswdigit(wc))  need_prefix = TRUE;
        }
        } else if (!iswalpha(wc)) need_prefix = TRUE;
    } else {
        if (This[0] == '.') {
        if (l >= 1 && isdigit(0xff & (int) This[1])) need_prefix = TRUE;
        } else if (!isalpha(0xff & (int) This[0])) need_prefix = TRUE;
    }
    if (need_prefix) {
        tmp = Calloc(l+2, char);
        strcpy(tmp, "X");
        strcat(tmp, translateChar(STRING_ELT(arg, i)));
    } else {
        tmp = Calloc(l+1, char);
        strcpy(tmp, translateChar(STRING_ELT(arg, i)));
    }
    if (mbcslocale) {
        /* This cannot lengthen the string, so safe to overwrite it. */
        int nc = (int) mbstowcs(NULL, tmp, 0);
        if (nc >= 0) {
        wchar_t *wstr = Calloc(nc+1, wchar_t);
        mbstowcs(wstr, tmp, nc+1);
        for (wchar_t * wc = wstr; *wc; wc++) {
            if (*wc == L'.' || (allow_ && *wc == L'_'))
            /* leave alone */;
            else if (!iswalnum((int)*wc)) *wc = L'.';
        }
        wcstombs(tmp, wstr, strlen(tmp)+1);
        Free(wstr);
        } else error(_("invalid multibyte string %d"), i+1);
    } else {
        for (p = tmp; *p; p++) {
        if (*p == '.' || (allow_ && *p == '_')) /* leave alone */;
        else if (!isalnum(0xff & (int)*p)) *p = '.';
        /* else leave alone */
        }
    }
//  l = (int) strlen(tmp);        /* needed? */
    SET_STRING_ELT(ans, i, mkChar(tmp));
    /* do we have a reserved word?  If so the name is invalid */
    if (!isValidName(tmp)) {
        /* FIXME: could use R_Realloc instead */
        cbuf = CallocCharBuf(strlen(tmp) + 1);
        strcpy(cbuf, tmp);
        strcat(cbuf, ".");
        SET_STRING_ELT(ans, i, mkChar(cbuf));
        Free(cbuf);
    }
    Free(tmp);
    vmaxset(vmax);
    }
    UNPROTECT(1);
    return ans;
}

As we can see, compiler-dependent datatypes such as wchar_t (http://icu-project.org/docs/papers/unicode_wchar_t.html) are used. This means that the behavior of make.names() depends on the C-compiler used to compile the R-interpreter itself. The problem is that C-compilers aren't very standardized therefore no assumption about the behavior of characters can be made. Everything including operating system, hardware, locale etc. can change this behavior.

In conclusion, I would stick to ASCII characters if you want to be save, especially when sharing your code between different operating systems.

like image 99
pgigeruzh Avatar answered Nov 15 '22 20:11

pgigeruzh