Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert strings in any language and character set to valid filenames in Java?

I need to generate file names from user inputted names. These names could be in any language. For example:

  • "John Smith"
  • "高岡和子"
  • "محمد سعيد بن عبد العزيز الفلسطيني"

These are use inputted values, so I have no guarantee that the names don't contain characters that are invalid to be in file names.

Users will be downloading these files from their browser, so I need to ensure the file names are valid on all operating systems in all configurations.

I am currently doing this for English speaking countries by simply removing all non-alphanumeric characters with a simple regex:

string = string.replaceAll("[^a-zA-Z0-9]", "");
string = string.replaceAll("\\s+", "_")

Some example conversions:

  • "John Smith" -> "John_Smith.ext"
  • "John O'Henry" -> "John_OHenry.ext"
  • "John van Smith III" -> "John_van_Smith_III.ext"

Obviously this does not work internationally.

I've considered finding/generating a blacklist of all characters that are invalid on all file systems and stripping those from the names. I've been unable to find a comprehensive list.

I'd prefer to use existing code in a common library if possible. I imagine this is an already solved problem, however I can't find a solution that works internationally.

The filename is for the user downloading the file, not for me. I'm not going to be storing these files. These files are dynamically generated by the server upon request from data in a database. The filenames are for the convenience of the person downloading the file.

like image 892
leros Avatar asked Apr 14 '12 03:04

leros


People also ask

How to validate filename in Java?

isInstanceOf(IOException. class) . hasMessageContaining("File name too long"); assertThat(validateStringFilenameUsingNIO2(filename)). isTrue();

How to index through a string in Java?

You can get the character at a particular index within a string by invoking the charAt() accessor method. The index of the first character is 0, while the index of the last character is length()-1 . For example, the following code gets the character at index 9 in a string: String anotherPalindrome = "Niagara.


1 Answers

Regex [^a-zA-Z0-9] will filter non-ASCII characters which will omit Unicode characters or characters above 128 codepoints.

Assuming that you want to filter user input for valid file-names by replacing invalid file-name characters such as ? \ / : | < > * with underscore (_):

import java.io.UnsupportedEncodingException;

public class ReplaceI18N {

    public static void main(String[] args) {
        String[] names = {
                "John Smith",
                "高岡和子",
                "محمد سعيد بن عبد العزيز الفلسطيني",                
                "|J:o<h>n?Sm\\it/h*", 
                "高?岡和\\子*", 
                "محمد /سعيد بن عبد ?العزيز :الفلسطيني\\"
                };

        for(String s: names){
            String u  = s;
            try {
                u = new String(s.getBytes(), "UTF-8");
            } catch (UnsupportedEncodingException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            } 
            u = u.replaceAll("[\\?\\\\/:|<>\\*]", " "); //filter ? \ / : | < > *
            u = u.replaceAll("\\s+", "_");
            System.out.println(s + " = " + u);
        }
    }
}

The output:

John Smith = John_Smith
高岡和子 = 高岡和子
محمد سعيد بن عبد العزيز الفلسطيني = محمد_سعيد_بن_عبد_العزيز_الفلسطيني
|J:o<h>n?Sm\it/h* = _J_o_h_n_Sm_it_h_
高?岡和\子* = 高_岡和_子_
محمد /سعيد بن عبد ?العزيز :الفلسطيني\ = محمد_سعيد_بن_عبد_العزيز_الفلسطيني_

The valid filenames even with Unicode characters will be displayable on any webpage that supports UTF-8 encoding with the correct Unicode font.

In addition, each will be the correct name for its file on any OS file-system that supports Unicode (tested OK on Windows XP, Windows 7).

i18n filenames

But, if you want to pass each valid filename as a URL string, make sure to encode it properly using URLEncoder and later decode each encoded URL using URLDecoder.

like image 171
ecle Avatar answered Nov 04 '22 20:11

ecle