Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

java.util.Scanner to read files with different character encoding

I use Java to read list of files. Some of these has different encoding, ANSI instead of UTF-8. java.util.Scanner is unable to read these files and get empty output string. I tried another approach:

                FileInputStream fis = new FileInputStream(my_file);
                BufferedReader br = new BufferedReader(new InputStreamReader(fis));
                InputStreamReader isr = new InputStreamReader(fis);
                isr.getEncoding();

I am not sure how to change character encoding in case of ANSI ones. UTF-8 and ANSI files are mixed in same folder. I try to use Apache Tika for this. After I get encoding of file, I use Scanner, but I get empty output.

Scanner scanner = new Scanner(my_file, detector.getCharset().toString());
line = scanner.nextLine();
like image 963
plaidshirt Avatar asked Oct 16 '22 11:10

plaidshirt


1 Answers

There is a library called juniversalchardet, which can help you at guessing the right encoding. It was updated recently and is currently located on GitHub:

https://github.com/albfernandez/juniversalchardet

However, there is no fail-safe tool to detect encodings, as there are many things unknown:

  1. Is this file text at all or some PNG?
  2. Is it stored in a (1,...,k,...,n)-bit encoding?
  3. Which k-bit encoding was used?

Some guesswork can be done by counting the amount of control characters that are not commonly used. When a file contains many control symbols, it is likely that you've chosen the wrong encoding. (Then try the next one.)

Juniversalchardet tries multiple and also more successful ways to determine encodings (even chinese ones). It also provides convenient ways to open a reader from a file with the correct encoding already selected:

(Snippet taken from https://github.com/albfernandez/juniversalchardet#creating-a-reader-with-correct-encoding and adapted)

import org.mozilla.universalchardet.ReaderFactory;
import java.io.File;
import java.io.IOException;
import java.io.Reader;

public class TestCreateReaderFromFile {

    public static void main (String[] args) throws IOException {
        if (args.length != 1) {
            System.err.println("Usage: java TestCreateReaderFromFile FILENAME");
            System.exit(1);
        }

        Reader reader = null;
        try {
            File file = new File(args[0]);
            reader = ReaderFactory.createBufferedReader(file);

            String line;
            while((line=reader.readLine())!=null){
                System.out.println(line); //Print each line to console
            }
        }
        finally {
            if (reader != null) {
                reader.close();
            }
        }

    }

}

Edit: Added ScannerFactory

/*
(C) Copyright 2016-2017 Alberto Fernández <[email protected]>
Adapted by Fritz Windisch 2018-11-15
The contents of this file are subject to the Mozilla Public License Version
1.1 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.mozilla.org/MPL/
Software distributed under the License is distributed on an "AS IS" basis,
WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
for the specific language governing rights and limitations under the
License.
Alternatively, the contents of this file may be used under the terms of
either the GNU General Public License Version 2 or later (the "GPL"), or
the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
in which case the provisions of the GPL or the LGPL are applicable instead
of those above. If you wish to allow use of your version of this file only
under the terms of either the GPL or the LGPL, and not to allow others to
use your version of this file under the terms of the MPL, indicate your
decision by deleting the provisions above and replace them with the notice
and other provisions required by the GPL or the LGPL. If you do not delete
the provisions above, a recipient may use your version of this file under
the terms of any one of the MPL, the GPL or the LGPL.
*/

import java.io.BufferedInputStream;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Objects;
import java.util.Scanner;
import org.mozilla.universalchardet.UniversalDetector;
import org.mozilla.universalchardet.UnicodeBOMInputStream;

/**
 * Create a scanner from a file with correct encoding
 */
public final class ScannerFactory {

    private ScannerFactory() {
        throw new AssertionError("No instances allowed");
    }
    /**
     * Create a scanner from a file with correct encoding
     * @param file The file to read from
     * @param defaultCharset defaultCharset to use if can't be determined
     * @return Scanner for the file with the correct encoding
     * @throws java.io.IOException if some I/O error ocurrs
     */

    public static Scanner createScanner(File file, Charset defaultCharset) throws IOException {
        Charset cs = Objects.requireNonNull(defaultCharset, "defaultCharset must be not null");
        String detectedEncoding = UniversalDetector.detectCharset(file);
        if (detectedEncoding != null) {
            cs = Charset.forName(detectedEncoding);
        }
        if (!cs.toString().contains("UTF")) {
            return new Scanner(file, cs.name());
        }
        Path path = file.toPath();
        return new Scanner(new UnicodeBOMInputStream(new BufferedInputStream(Files.newInputStream(path))), cs.name());
    }
    /**
     * Create a scanner from a file with correct encoding. If charset cannot be determined,
     * it uses the system default charset.
     * @param file The file to read from
     * @return Scanner for the file with the correct encoding
     * @throws java.io.IOException if some I/O error ocurrs
     */
    public static Scanner createScanner(File file) throws IOException {
        return createScanner(file, Charset.defaultCharset());
    }
}
like image 178
Friwi Avatar answered Oct 21 '22 01:10

Friwi