How to read a text file with mixed encodings in Scala or Java?

Tags:

I am trying to parse a CSV file, ideally using weka.core.converters.CSVLoader. However the file I have is not a valid UTF-8 file. It is mostly a UTF-8 file but some of the field values are in different encodings, so there is no encoding in which the whole file is valid, but I need to parse it anyway. Apart from using java libraries like Weka, I am mainly working in Scala. I am not even able to read the file usin scala.io.Source: For example

Source.   fromFile(filename)("UTF-8").   foreach(print);

throws:

    java.nio.charset.MalformedInputException: Input length = 1 at java.nio.charset.CoderResult.throwException(CoderResult.java:277) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:337) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:176) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:153) at java.io.BufferedReader.read(BufferedReader.java:174) at scala.io.BufferedSource$$anonfun$iter$1$$anonfun$apply$mcI$sp$1.apply$mcI$sp(BufferedSource.scala:38) at scala.io.Codec.wrap(Codec.scala:64) at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38) at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38) at scala.collection.Iterator$$anon$14.next(Iterator.scala:150) at scala.collection.Iterator$$anon$25.hasNext(Iterator.scala:562) at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400) at scala.io.Source.hasNext(Source.scala:238) at scala.collection.Iterator$class.foreach(Iterator.scala:772) at scala.io.Source.foreach(Source.scala:181)

I am perfectly happy to throw all the invalid characters away or replace them with some dummy. I am going to have lots of text like this to process in various ways and may need to pass the data to various third party libraries. An ideal solution would be some kind of global setting that would cause all the low level java libraries to ignore invalid bytes in text, so that that I can call third party libraries on this data without modification.

SOLUTION:

import java.nio.charset.CodingErrorAction import scala.io.Codec  implicit val codec = Codec("UTF-8") codec.onMalformedInput(CodingErrorAction.REPLACE) codec.onUnmappableCharacter(CodingErrorAction.REPLACE)  val src = Source.   fromFile(filename).   foreach(print)

Thanks to +Esailija for pointing me in the right direction. This lead me to How to detect illegal UTF-8 byte sequences to replace them in java inputstream? which provides the core java solution. In Scala I can make this the default behaviour by making the codec implicit. I think I can make it the default behaviour for the entire package by putting it the implicit codec definition in the package object.

458

asked Nov 29 '12 11:11

Daniel Mahler

1 Answers

This is how I managed to do it with java:

    FileInputStream input;     String result = null;     try {         input = new FileInputStream(new File("invalid.txt"));         CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();         decoder.onMalformedInput(CodingErrorAction.IGNORE);         InputStreamReader reader = new InputStreamReader(input, decoder);         BufferedReader bufferedReader = new BufferedReader( reader );         StringBuilder sb = new StringBuilder();         String line = bufferedReader.readLine();         while( line != null ) {             sb.append( line );             line = bufferedReader.readLine();         }         bufferedReader.close();         result = sb.toString();      } catch (FileNotFoundException e) {         e.printStackTrace();     } catch( IOException e ) {         e.printStackTrace();     }      System.out.println(result);

The invalid file is created with bytes:

0x68, 0x80, 0x65, 0x6C, 0x6C, 0xC3, 0xB6, 0xFE, 0x20, 0x77, 0xC3, 0xB6, 0x9C, 0x72, 0x6C, 0x64, 0x94

Which is hellö wörld in UTF-8 with 4 invalid bytes mixed in.

With .REPLACE you see the standard unicode replacement character being used:

//"h�ellö� wö�rld�"

With .IGNORE, you see the invalid bytes ignored:

//"hellö wörld"

Without specifying .onMalformedInput, you get

java.nio.charset.MalformedInputException: Input length = 1     at java.nio.charset.CoderResult.throwException(Unknown Source)     at sun.nio.cs.StreamDecoder.implRead(Unknown Source)     at sun.nio.cs.StreamDecoder.read(Unknown Source)     at java.io.InputStreamReader.read(Unknown Source)     at java.io.BufferedReader.fill(Unknown Source)     at java.io.BufferedReader.readLine(Unknown Source)     at java.io.BufferedReader.readLine(Unknown Source)

answered Sep 19 '22 23:09

Esailija

Related questions
                            
                                How JVM stack, heap and threads are mapped to physical memory or operation system
                            
                                How can I set default project location for all projects in IntelliJ IDEA?
                            
                                Hibernate: ids for this class must be manually assigned before calling save()
                            
                                Converting from JSONArray to String then back again
                            
                                Constructor reference - no warning when generics array is created
                            
                                The output -1 becomes a slash in the loop
                            
                                Java's "os.name" for Windows 10?
                            
                                Library for OAuth Provider (Java) [closed]
                            
                                Why attempt to print uninitialized variable does not always result in an error message
                            
                                How do you assign a lambda to a variable in Java 8?
                            
                                Exception : javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated
                            
                                extends of the class with private constructor
                            
                                What's the Jackson deserialization equivalent of @JsonUnwrapped?
                            
                                Are there any real life uses for the Java _signed_ byte primitive type?
                            
                                How do you declare x and y so that x+=y gives a compilation error and x=x+y not?
                            
                                Java OutOfMemoryError strange behaviour
                            
                                What is the alternative to using the Deprecated Hamcrest method is()?
                            
                                Registering multiple keystores in JVM
                            
                                Why are wait() and notify() declared in Java's Object class?
                            
                                @JsonProperty annotation on field as well as getter/setter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read a text file with mixed encodings in Scala or Java?

Tags:

java

character-encoding

utf-8

scala

weka

Daniel Mahler

People also ask

1 Answers

Esailija

Recent Activity

Donate For Us