Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I find out the internal code representation of a WINDOWS-1252 character?

Tags:

r

I am processing SPSS data from a questionnaire that must have originated in M$ Word. Word automatically changes hyphens into long hyphens, and gets converted into characters that don't display properly, i.e. "-" turns into "ú".

My question: What is the equivalent to utf8ToInt() in the WINDOWS-1252 character set?

utf8ToInt("A")
[1] 65

When I do this with my own data, I get an error:

x <- str_sub(levels(sd$j1)[1], 7, 7)
print(x)
[1] "ú"

utf8ToInt(x)
Error in utf8ToInt(x) : invalid UTF-8 string

However, the contents of x are perfectly usable in grep and gsub expressions.

> Sys.getlocale()
[1] "LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252"
like image 870
Andrie Avatar asked Mar 05 '11 16:03

Andrie


People also ask

What is 1252 character set?

Windows-1252 and ASCII The first part of Windows-1252 (entity numbers from 0-127) is the original ASCII character-set. It contains numbers, upper and lowercase English letters, and some special characters.

Is Windows-1252 ANSI?

Originally, Windows code page 1252, the code page commonly used for English and other Western European languages, was based on an American National Standards Institute (ANSI) draft.

How do I change the encoding from Windows-1252 to UTF-8?

Just open up the windows-1252 encoded file in Notepad, then choose 'Save as' and set encoding to UTF-8.


2 Answers

If you load the SPSS sav file via read.spss form package foreign, you could easily import the data frame with correct encoding via specifying the encoding like:

read.spss("foo.sav", reencode="CP1252")
like image 106
daroczig Avatar answered Oct 29 '22 11:10

daroczig


After some head-scratching, lots of reading help files and trial-and-error, I created two little functions that does what I need. These functions work by converting their input into UTF-8 and then returning the integer vector for the UTF-8 encoded character vector, and vice versa.

# Convert character to integer vector
# Optional encoding specifies encoding of x, defaults to current locale
encToInt <- function(x, encoding=localeToCharset()){
    utf8ToInt(iconv(x, encoding, "UTF-8"))
}

# Convert integer vector to character vector
# Optional encoding specifies encoding of x, defaults to current locale
intToEnc <- function(x, encoding=localeToCharset()){
    iconv(intToUtf8(x), "utf-8",  encoding)
}

Some examples:

x <- "\xfa"
encToInt(x)
[1] 250

intToEnc(250)
[1] "ú"
like image 28
Andrie Avatar answered Oct 29 '22 10:10

Andrie