I'd like to write a clojure function that takes a string in one encoding and converts it to another. The iconv library does this.
For example, let's look at the character "è". In ISO-8859-1 (http://www.ascii-code.com/), that's e8
as hex. In UTF-8 (http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C3%A8&mode=char), it's c3 a8
.
So let's say we have iso.txt, which contains our letter and EOL:
$ hexdump iso.txt
0000000 e8 0a
0000002
Now we can convert it to UTF-8 like this:
$ iconv -f ISO-8859-1 -t UTF-8 iso.txt | hexdump
0000000 c3 a8 0a
0000003
How should I write something equivalent in clojure? I'm happy to use any external libraries, but I don't know where I'd go to find them. Looking around I couldn't figure out how to use libiconv itself on the JVM, but there's probably an alternative?
Edit
After reading Alex's link in the comment, this is so simple and so cool:
user> (new String (byte-array 2 (map unchecked-byte [0xc3 0xa8])) "UTF-8")
"è"
user> (new String (byte-array 1 [(unchecked-byte 0xe8)]) "ISO-8859-1")
"è"
If you want a simple whole-file conversion to UTF-8, slurp
allows for specifying the file encoding with the :encoding
option and spit
will output UTF-8 by default. This method will read the entire file into memory, so large files might require a different approach.
$ printf "\xe8\n" > iso.txt
$ hexdump iso.txt
0000000 e8 0a
0000002
(spit "/Users/path/iso2.txt"
(slurp "/Users/path/iso.txt" :encoding "ISO-8859-1"))
$ hexdump iso2.txt
0000000 c3 a8 0a
0000003
Note: slurp
will assume UTF-8 if you do not specify an encoding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With