Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java can't see file on file system that contains illegal characters

I am experimenting with an edge case we're seeing in production. We have a business model where clients generate text files and then FTP them to our servers. We ingest those files and process them on our Java backend (running on CentOS machines). Most (95%+) of our clients know to generate these files in UTF-8 which is what we want. However we have a few stubborn clients (but large accounts) that generate these files on Windows machine with the CP1252 character set. No problem though, we've configured our 3rd party libs (which are what do most of the "processing" work for us) to handle input in any character set through some magical voo doo.

Occasionally, we see a file come over that has illegal UTF-8 characters (CP1252) in its name. When our software tries to read these files in from the FTP server the normal method of file reading chokes and throws a FileNotFoundException:

File f = getFileFromFTPServer();
FileReader fReader = new FileReader(f);

String line = fReader.readLine();
// ...etc.

The exceptions look something like this:

java.io.FileNotFoundException: /path/to/file/some-text-blah?blah.xml (No such file or directory) at java.io.FileInputStream.open(Native Method) at 
java.io.FileInputStream.(FileInputStream.java:120) at java.io.FileReader.(FileReader.java:55) at com.myorg.backend.app.InputFileProcessor.run(InputFileProcessor.java:60) at 
java.lang.Thread.run(Thread.java:662)

So what I think is happening is that because the file name itself contains illegal chars, we never even get to read it in the first place. If we could, then regardless of the file's contents, our software should be able to handle processing it correctly. So this is really an issue with reading file names with illegal UTF-8 chars in them.

As a test case, I created a very simple Java "app" to deploy on one of our servers and test some things out (source code is provided below). I then logged into a Windows machine and created a test file and named it test£.txt. Notice the character after "test" in the file name. This is Alt-0163. I FTPed this to our server, and when I ran ls -ltr on its parent directory, I was surprised to see it listed as test?.txt.

Before I go any further, here is the Java "app" I wrote for testing/reproducing this issue:

public Driver {
    public static void main(String[] args) {
        Driver d = new Driver();
        d.run(args[0]);     // I know this is bad, but its fine for our purposes here
    }

    private void run(String fileName) {
        InputStreamReader isr = null;
        BufferedReader buffReader = null;
        FileInputStream fis = null;
        String firstLineOfFile = "default";

        System.out.println("Processing " + fileName);

        try {
            System.out.println("Attempting UTF-8...");

            fis = new FileInputStream(fileName);
            isr = new InputStreamReader(fis, Charset.forName("UTF-8"));
            buffReader = new BufferedReader(isr);

            firstLineOfFile = buffReader.readLine();

            System.out.println("UTF-8 worked and first line of file is : " + firstLineOfFile);
        }
        catch(IOException io1) {
            // UTF-8 failed; try CP1252.
            try {
                System.out.println("UTF-8 failed. Attempting Windows-1252...(" + io1.getMessage() + ")");

                fis = new FileInputStream(fileName);
                // I've also tried variations "WINDOWS-1252", "Windows-1252", "CP1252", "Cp1252", "cp1252"
                isr = new InputStreamReader(fis, Charset.forName("windows-1252"));
                buffReader = new BufferedReader(isr);

                firstLineOfFile = buffReader.readLine();

                System.out.println("Windows-1252 worked and first line of file is : " + firstLineOfFile);
            }
            catch(IOException io2) {
                // Both UTF-8 and CP1252 failed...
                System.out.println("Both UTF-8 and Windows-1252 failed. Could not read file. (" + io2.getMessage() + ")");
            }
        }
    }
}

When I run this from the terminal (java -cp . com/Driver t*), I get the following output:

Processing test�.txt
Attempting UTF-8...
UTF-8 failed. Attempting Windows-1252...(test�.txt (No such file or directory))
Both UTF-8 and Windows-1252 failed. Could not read file.(test�.txt (No such file or directory))

test�.txt?!?! I did some research and found that the "�" is the Unicode replacement character \uFFFD. So I guess what's happening is that the CentOS FTP server doesn't know how to handle Alt-0163 (£) and so it replaces it with \uFFFD (�). But I don't understand why ls -ltr displays a file called test?.txt...

In any event, it appears that the solution is to add some logic that searches for the existence of this character in the file name, and if found, renames the file to something else (like perhaps do a String-wise replaceAll("\uFFFD", "_") or something like that) that the system can read and process.

The problem is that Java doesn't even see this file on the file system. CentOS knows the file is there (test?.txt), but when that file gets passed into Java, Java interprets it as test�.txt and for some reason No such file or directory...

How can I get Java to see this file so that I can perform a File::renameTo(String) on it? Sorry for the backstory here but I feel it is relevant since every detail counts in this scenario. Thanks in advance!

like image 540
IAmYourFaja Avatar asked Aug 24 '12 12:08

IAmYourFaja


2 Answers

Welcome to the wonderful world of text encodings. You have several levels of problems and you need to sort each of them out individually.

First, what is the file name on disk? Does it contain valid UTF-8 escape sequences or is it something else?

The problem here is that you need the correct file name or the Windows file system simply won't be able to find the file. On top of that, Windows might try to convert the illegal characters in the file name to Unicode \uFFFD so no matter what you try, you won't be able to load the file (since there is no file with \uFFFD in it on the disk).

How can that be? This happens because the mapping isn't two-way. When Windows loads the file name from disk, it replaces test�.txt with test\uFFFD.txt and gives you that name. When you tell Windows to open test\uFFFD.txt, it won't be able to find the file because there is no file with such a name (there is only test�.txt). There is no way for you to find out what the real name of the file is.

Solutions? You can open a dos prompt and rename the file with a pattern ren test*.txt test.txt. Since the pattern matches only a single file, that will work. But you won't be able to do the same from, say, the Windows Explorer because it also can't find the file.

Next step: FTP. FTP is a protocol for humans - it's not suitable for automatic data exchange. Get rid of FTP. I don't know how much that will cost you but it's always worth it. Use SFTP, scp or FTAPI.

One source of the problems could be that FTP transfers file names as ASCII. No umlauts are allowed in the FTP protocol ... or rather, FTP doesn't expect any. If you're lucky, your FTP client will refuse to transfer the file but most simply bug out. But when they exist, FTP will just do ... something. Whatever that might be. Usual effects here are that files with Unicode in the name are encoded twice as UTF-8 or Unicode is replaced with ? (\u003f).

Or the Java FTP client could use new String( bytes ) to create a String from the FTP file name which would rape the poor bytes with your System's default encoding - not pretty.

Solutions:

  1. Use an FTP server which rejects files with illegal characters in their names or which replaces these characters to something that doesn't confuse the file system / OS.
  2. Use an file system which properly handles files with strange names. That usually means to get rid of Windows on the Server.
  3. Make sure users can only upload into a single directory and that this directory can only contain a single file. That way, you can use a small shell script and patterns to rename it to something that you can read.
like image 138
Aaron Digulla Avatar answered Sep 27 '22 23:09

Aaron Digulla


It's a bug in the old-skool java File api, maybe just on a mac? Anyway, the new java.nio api works much better. I have several files containing unicode characters that failed to load using java.io... classes. After converting all my code to use java.nio.Path EVERYTHING started working. And I replaced apache FileUtils (which has the same problem) with java.nio.Files...

Be sure to read and write the content of file using an appropriate charset, for example: Files.readAllLines(myPath, StandardCharsets.UTF_8)

like image 40
pomo Avatar answered Sep 27 '22 23:09

pomo