Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sanitizing strings with filenames and extension in Java

Tags:

java

string

regex

Having this four type of file names:

  1. Filename with double extension
  2. Filename with no extension
  3. Filename with dot at the end, and no extension
  4. Filename with a proper name.

Like this:

String doubleexsension = "doubleexsension.pdf.pdf";
String noextension = "noextension";
String nameWithDot = "nameWithDot.";
String properName = "properName.pdf";

String extension = "pdf";

My aim is to sanitze all the types and output only the filename.filetype properly. I made a little stupid script in order to make this post:

ArrayList<String> app = new ArrayList<String>();
app.add(doubleexsension);
app.add(properName);
app.add(noextension);
app.add(nameWithDot);

System.out.println("------------");

for(String i : app) {

    // Ends with .
    if (i.endsWith(".")) {
        String m = i + extension;
        System.out.println(m);
        break;
    }

    // Double extension
    String p = i.replaceAll("(\\.\\w+)\\1+$", "$1");
    System.out.println(p);
}

This outputs:

------------
doubleexsension.pdf
properName.pdf
noextension
nameWithDot.pdf

I dont know how can I handle the noextension one. How can I do it? When there's no extension, it should take the extension value and apped it to the string at the end.

My desired output would be:

------------
doubleexsension.pdf
properName.pdf
noextension.pdf
nameWithDot.pdf

Thanks in advance.

like image 729
Avión Avatar asked Nov 10 '16 13:11

Avión


3 Answers

You may add alternatives to the regex to match all kinds of scenarios:

(?:(\.\w+)\1*|\.|([^.]))$

And replace with $2.pdf. See the regex demo.

EDIT: In case the extensions that can be duplicated are known, you may use the whitelisting approach via an alternation group:

(?:(\.(?:pdf|gif|jpe?g))\1*|\.|([^.]))$

See another regex demo.

Details:

  • (?: - start of grouping, the $ end of string anchor is applied to all the alternatives below (they must be at the end of string)
    • (\.\w+)\1* - duplicated (or not) extensions (. + 1+ word chars repeated zero or more times) (with the whitelisting approach, only the indicated extensions will be taken into account - (?:pdf|gif|jpe?g) will only match pdf, gif, jpeg, jpg, etc. if more alternatives are added)
    • | - or
    • \. - a dot
    • | - or
    • ([^.]) - any char that is not a dot captured into Group 2
  • ) - end of the outer grouping
  • $ - end of string.

See Java demo:

List<String> strs = Arrays.asList("doubleexsension.pdf.pdf","noextension","nameWithDot.","properName.pdf");
for (String str : strs)
    System.out.println(str.replaceAll("(?:(\\.\\w+)\\1*|\\.|([^.]))$", "$2.pdf"));
like image 56
Wiktor Stribiżew Avatar answered Oct 13 '22 21:10

Wiktor Stribiżew


I would avoid the complexity (and reduced readability) of regular expressions:

String m = i;

if (m.endsWith(".")) {
    m = m + extension;
}
if (m.endsWith("." + extension + "." + extension)) {
    m = m.substring(0, m.length() - extension.length() - 1);
}
if (!m.endsWith("." + extension)) {
    m = m + "." + extension;
}
like image 45
VGR Avatar answered Oct 13 '22 19:10

VGR


Easy

if (-1 == i.indexOf('.'))
    System.out.println(i + "." + extension);
like image 37
borowis Avatar answered Oct 13 '22 20:10

borowis