Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Files.lines to skip broken lines in Java8

I am reading a very large (500mb) file with Files.lines(...). It reads a part of the file but at some point it breaks with java.io.UncheckedIOException: java.nio.charset.MalformedInputException: Input length = 1

I think the file has lines with different charsets. Is there a way to skip these broken lines? I know that the stream returned is backed by a Reader and with the reader I know how to skip, but don't know how to get the Reader from the stream to set it up as I like.

    List<String> lines = new ArrayList<>();
    try (Stream<String> stream = Files.lines(Paths.get(getClass().getClassLoader().getResource("bigtest.txt").toURI()), Charset.forName("UTF-8"))) {
        stream
            .filter(s -> s.substring(0, 2).equalsIgnoreCase("aa"))
            .forEach(lines::add);
    } catch (final IOException e) {
        // catch
    }
like image 945
Francesco Avatar asked Sep 26 '14 17:09

Francesco


2 Answers

You can’t filter lines with invalid characters after the decoding when the preconfigured decoder already stops the decoding with an exception. You have to configure a CharsetDecoder manually to tell it to ignore invalid input or replace that input with a special character.

CharsetDecoder dec=StandardCharsets.UTF_8.newDecoder()
                  .onMalformedInput(CodingErrorAction.IGNORE);
Path path=Paths.get(getClass().getClassLoader().getResource("bigtest.txt").toURI());
List<String> lines;
try(Reader r=Channels.newReader(FileChannel.open(path), dec, -1);
    BufferedReader br=new BufferedReader(r)) {
        lines=br.lines()
                .filter(s -> s.regionMatches(true, 0, "aa", 0, 2))
                .collect(Collectors.toList());
}

This simply ignores charset decoding errors, skipping the characters. To skip entire lines containing errors, you can let the decoder insert a replacement character (defaults to '\ufffd') for errors and filter out lines containing that character:

CharsetDecoder dec=StandardCharsets.UTF_8.newDecoder()
                  .onMalformedInput(CodingErrorAction.REPLACE);
Path path=Paths.get(getClass().getClassLoader().getResource("bigtest.txt").toURI());
List<String> lines;
try(Reader r=Channels.newReader(FileChannel.open(path), dec, -1);
    BufferedReader br=new BufferedReader(r)) {
        lines=br.lines()
                .filter(s->!s.contains(dec.replacement()))
                .filter(s -> s.regionMatches(true, 0, "aa", 0, 2))
                .collect(Collectors.toList());
}
like image 185
Holger Avatar answered Nov 17 '22 15:11

Holger


In this situation, the solution is going to be complex and more bug-prone when using the Streams API. I suggest to just use a normal for-loop to read from a BufferedReader and then capture the MalformedInputException. This also enables capture of other IO exceptions to be distinguished:

List<String> lines = new ArrayList<>();

try (BufferedReader r = new BufferedReader(path,StandardCharsets.UTF_8)){
     try{
          String line = null;
          while((line=r.readLine())!=null){
               if(line.substring(0, 2).equalsIgnoreCase("aa")){
                    lines.add(line);
                }
     }catch(MalformedInputException mie){
           // ignore or do something
     }
}
like image 1
The Coordinator Avatar answered Nov 17 '22 15:11

The Coordinator