Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a text file with mixed encodings in Scala or Java?

I am trying to parse a CSV file, ideally using weka.core.converters.CSVLoader. However the file I have is not a valid UTF-8 file. It is mostly a UTF-8 file but some of the field values are in different encodings, so there is no encoding in which the whole file is valid, but I need to parse it anyway. Apart from using java libraries like Weka, I am mainly working in Scala. I am not even able to read the file usin scala.io.Source: For example

Source.   fromFile(filename)("UTF-8").   foreach(print); 

throws:

    java.nio.charset.MalformedInputException: Input length = 1 at java.nio.charset.CoderResult.throwException(CoderResult.java:277) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:337) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:176) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:153) at java.io.BufferedReader.read(BufferedReader.java:174) at scala.io.BufferedSource$$anonfun$iter$1$$anonfun$apply$mcI$sp$1.apply$mcI$sp(BufferedSource.scala:38) at scala.io.Codec.wrap(Codec.scala:64) at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38) at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38) at scala.collection.Iterator$$anon$14.next(Iterator.scala:150) at scala.collection.Iterator$$anon$25.hasNext(Iterator.scala:562) at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400) at scala.io.Source.hasNext(Source.scala:238) at scala.collection.Iterator$class.foreach(Iterator.scala:772) at scala.io.Source.foreach(Source.scala:181) 

I am perfectly happy to throw all the invalid characters away or replace them with some dummy. I am going to have lots of text like this to process in various ways and may need to pass the data to various third party libraries. An ideal solution would be some kind of global setting that would cause all the low level java libraries to ignore invalid bytes in text, so that that I can call third party libraries on this data without modification.

SOLUTION:

import java.nio.charset.CodingErrorAction import scala.io.Codec  implicit val codec = Codec("UTF-8") codec.onMalformedInput(CodingErrorAction.REPLACE) codec.onUnmappableCharacter(CodingErrorAction.REPLACE)  val src = Source.   fromFile(filename).   foreach(print) 

Thanks to +Esailija for pointing me in the right direction. This lead me to How to detect illegal UTF-8 byte sequences to replace them in java inputstream? which provides the core java solution. In Scala I can make this the default behaviour by making the codec implicit. I think I can make it the default behaviour for the entire package by putting it the implicit codec definition in the package object.

like image 458
Daniel Mahler Avatar asked Nov 29 '12 11:11

Daniel Mahler


People also ask

How do you check a charset file?

Open up your file using regular old vanilla Notepad that comes with Windows. It will show you the encoding of the file when you click "Save As...". Whatever the default-selected encoding is, that is what your current encoding is for the file.


1 Answers

This is how I managed to do it with java:

    FileInputStream input;     String result = null;     try {         input = new FileInputStream(new File("invalid.txt"));         CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();         decoder.onMalformedInput(CodingErrorAction.IGNORE);         InputStreamReader reader = new InputStreamReader(input, decoder);         BufferedReader bufferedReader = new BufferedReader( reader );         StringBuilder sb = new StringBuilder();         String line = bufferedReader.readLine();         while( line != null ) {             sb.append( line );             line = bufferedReader.readLine();         }         bufferedReader.close();         result = sb.toString();      } catch (FileNotFoundException e) {         e.printStackTrace();     } catch( IOException e ) {         e.printStackTrace();     }      System.out.println(result); 

The invalid file is created with bytes:

0x68, 0x80, 0x65, 0x6C, 0x6C, 0xC3, 0xB6, 0xFE, 0x20, 0x77, 0xC3, 0xB6, 0x9C, 0x72, 0x6C, 0x64, 0x94 

Which is hellö wörld in UTF-8 with 4 invalid bytes mixed in.

With .REPLACE you see the standard unicode replacement character being used:

//"h�ellö� wö�rld�" 

With .IGNORE, you see the invalid bytes ignored:

//"hellö wörld" 

Without specifying .onMalformedInput, you get

java.nio.charset.MalformedInputException: Input length = 1     at java.nio.charset.CoderResult.throwException(Unknown Source)     at sun.nio.cs.StreamDecoder.implRead(Unknown Source)     at sun.nio.cs.StreamDecoder.read(Unknown Source)     at java.io.InputStreamReader.read(Unknown Source)     at java.io.BufferedReader.fill(Unknown Source)     at java.io.BufferedReader.readLine(Unknown Source)     at java.io.BufferedReader.readLine(Unknown Source) 
like image 72
Esailija Avatar answered Sep 19 '22 23:09

Esailija