Using German characters (ü, ö, ä, etc.) in text analysis (R)

Question

I'm doing some text mining in R. The text I want to analyse is in German.

The problem is that German characters aren't displayed correctly either in the text itself or in the results.

I'm working on Mac OS.

I've found similar threads on here and tried out proposed solutions:

Sys.setlocale("LC_ALL", "de_DE.UTF-8")

seems to change the language (i.e. doesn't give an error message) but the characters are still displayed incorrectly, e.g. Erste-Hilfe-Ma\xa7nahmen instead of Erste-Hilfe-Maßnahmen.

text <- readLines("Erste Hilfe.txt", encoding="de_DE.UTF-8")

Result: Erste-Hilfe-Ma\xa7nahmen

text <- readLines("Erste Hilfe.txt", encoding="ISO/IEC 8859-15")

Result: Erste-Hilfe-Ma\xa7nahmen

Do you have any other solutions?

JBGruber · Accepted Answer

It depends a bit on what you want to do with the files but generally stri_read_lines() from the package stringi does a fine job with umlauts even when encoding is left on "auto"

library(stringi)

lines <- stri_read_lines("Erste Hilfe.txt", encoding = "auto")

If you display the lines vector and it still has problems, you can try to detect the encoding:

lines_raw <- stri_read_raw("Erste Hilfe.txt")

stri_enc_detect(lines_raw)

Output will look something like this:

      Encoding Language Confidence
1        UTF-8                1.00
2 windows-1252       de       0.55
3         Big5       zh       0.44
4 windows-1254       tr       0.25
5 windows-1250       hu       0.14
6     UTF-16BE                0.10
7     UTF-16LE                0.10
8      GB18030       zh       0.10
9   IBM424_rtl       he       0.01

In this case, I read in a text file in UTF-8 with many umlauts in it and there was no problem for stringi to guess the encoding correctly. But you might want to try a few encodings if the confidence is not quite as high.

I can also display it without any problem in the Console (despite having set locale to en_GB.UTF-8) but in some cases, that might cause a problem. If you want to check if your encoding was really destroyed by reading in the file or if the Console might not be able to display it, you can try to write the lines back to a file and check:

stri_write_lines(lines, "Erste Hilfe_new.txt")

You could also try to just create a character vector with umlauts and just see if it gets displayed correctly:

"äöü"

Usually, RStudio above version 0.93 shouldn't have a problem with that though. Hope this helps.

Edit:

In the comments, it turned out that the source for the text is still available on the internet - I had not considered that possibility. Often encoding problems happen because some editors force a certain encoding when saving a file. If the source is available online though, you can read text directly into R using the rvest package:

library(rvest)
lines <- read_html("https://www.zeit.de/wissen/2018-10/erste-hilfe-kinder-rotes-kreuz-kurs-ersthelfer-notfall/komplettansicht") %>% 
  html_nodes(".article__item") %>% 
  html_text()

> grep("Maßnahmen", lines, value = TRUE)[1]
[1] "In vielen europäischen Ländern, etwa in Belgien und Dänemark, steht Erste Hilfe spätestens in der Sekundarstufe im Schullehrplan. Auch Großbritannien arbeitet an einem Gesetzesentwurf, der vorsieht, dass Grundschulkindern grundlegende Erste-Hilfe-Maßnahmen beigebracht werden. Die Schülerinnen und Schüler weiterführender Schulen sollen in Zukunft die Reanimation üben, also Beatmung und Herzdruckmassage.
"

Please refer to their documentation to see how you can determine the correct input for html_nodes. I usually use the chrome extension selectorgadget.

Using German characters (ü, ö, ä, etc.) in text analysis (R)

Tags:

text

r

character-encoding

BigMadAndy

1 Answers

Edit:

JBGruber

Recent Activity

Donate For Us

Using German characters (ü, ö, ä, etc.) in text analysis (R)

Tags:

text

r

character-encoding

BigMadAndy

1 Answers

Edit:

JBGruber

Related questions

Recent Activity

Donate For Us