Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert ISO-8859-1 to UTF-8 using groovy

i need to convert a ISO-8859-1 file to utf-8 encoding, without loosing content intormations...

i have a file which looks like this:

<?xml version="1.0" encoding="ISO-8859-1" ?> 
<HelloEncodingWorld>Üöäüßßß Test!!!</HelloEncodingWorld>

Not i want to encode it into UTF-8. I tried following:

f=new File('c:/temp/myiso88591.xml').getText('ISO-8859-1')
ts=new String(f.getBytes("UTF-8"), "UTF-8")
g=new File('c:/temp/myutf8.xml').write(ts)

didnt work due to String incompatibilities. Then i read something about bytestreamreaders/writers/streamingmarkupbuilder and other...

then i tried

f=new File('c:/temp/myiso88591.xml').getText('ISO-8859-1')
mb = new groovy.xml.StreamingMarkupBuilder()
mb.encoding = "UTF-8"

new OutputStreamWriter(new FileOutputStream('c:/temp/myutf8.xml'),'utf-8') << mb.bind {
    mkp.xmlDeclaration()
    out << f
}

this was totally not that what i wanted..

I just want to get the content of an xml read with an ISO-8859-1 reader and then put it into a new (old) file... why this is so complicated :-/

The result should just be, and the file should be really encoded in utf-8:

<?xml version="1.0" encoding="UTF-8" ?> 
<HelloEncodingWorld>Üöäüßßß Test!!!</HelloEncodingWorld>

Thanks for any answers Cheers

like image 746
Booyeoo Avatar asked Sep 02 '11 09:09

Booyeoo


People also ask

Is ISO 8859 the same as UTF-8?

UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.

How do I convert UTF-8 to ISO 8859-1?

byte[] utf8 = ... byte[] latin1 = new String(utf8, "UTF-8"). getBytes("ISO-8859-1"); You can exercise more control by using the lower-level Charset APIs. For example, you can raise an exception when an un-encodable character is found, or use a different character for replacement text.

What is encoding ISO 8859?

ISO/IEC 8859-1 encodes what it refers to as "Latin alphabet no. 1", consisting of 191 characters from the Latin script. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa.


2 Answers

Making it a little more Groovy, and not requiring the whole file to fit in memory, you can use the readers and writers to stream the file. This was my solution when I had files too big for plain old Unix iconv(1).

new FileOutputStream('out.txt').withWriter('UTF-8') { writer ->
    new FileInputStream('in.txt').withReader('ISO-8859-1') { reader ->
        writer << reader
    }
}
  • http://www.hjsoft.com/blog/link/A_Useful_Example_in_Java_Ruby_and_Groovy
like image 96
John Flinchbaugh Avatar answered Oct 23 '22 01:10

John Flinchbaugh


def f=new File('c:/data/myiso88591.xml').getText('ISO-8859-1')
new File('c:/data/myutf8.xml').write(f,'utf-8')

(I just gave it a try, it works :-)

same as in java: the libraries do the conversion for you... as deceze said: when you specify an encoding, it will be converted to an internal format (utf-16 afaik). When you specify another encoding when you write the string, it will be converted to this encoding.

But if you work with XML, you shouldn't have to worry about the encoding anyway because the XML parser will take care of it. It will read the first characters <?xml and determines the basic encoding from those characters. After that, it is able to read the encoding information from your xml header and use this.

like image 45
rdmueller Avatar answered Oct 23 '22 02:10

rdmueller