Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ICU: Transliterate and then remove all non-alphanumeric characters

Can it be done with ICU without falling back to regex?

Currently I normalize filenames like this:

protected function normalizeFilename($filename)
{
    $transliterator = Transliterator::createFromRules(
        'Any-Latin; Latin-ASCII; [:Punctuation:] Remove;'
    );
    $filename = $transliterator->transliterate($filename);
    $filename = preg_replace('/[^A-Za-z0-9_]/', '', $filename);
    return $filename;

}

Can I get rid of regular expression here and do everything with ICU calls?

like image 617
Vladislav Rastrusny Avatar asked Sep 18 '14 09:09

Vladislav Rastrusny


People also ask

How do you remove non alphanumeric characters?

The approach is to use the String. replaceAll method to replace all the non-alphanumeric characters with an empty string.

How do I remove non alphanumeric characters from a string in CPP?

Remove all non alphanumeric characters from a string in C++The std::remove_if algorithm returns an iterator that indicates where the end should be, which can be passed to the std::erase function. Starting with C++20, consider using the std::erase_if function that is error-free wrapper over the erase-remove idiom.

How remove all special characters from a string except space in Java?

You can use a regular expression and replaceAll() method of java. lang. String class to remove all special characters from String.


1 Answers

Use the correct tool for the job

I don't see anything wrong with what you're doing now.

ICU transliteration is first and foremost language oriented. It tries to preserve meaning.

Regular expressions, on the other hand, can manipulate characters in detail, giving you the assurance that the file name is restricted to the selected characters.

The combination is perfect, in this case.

I have, of course, looked for a solution to your question. But to be honest, I couldn't find something that would work on all possible inputs.

For instance, not all characters, we would consider punctuation marks, are removed by [:Punctuation:] Remove;. Try the Russian name: Корнильев, Кирилл. After applying your id it becomes: Kornilʹev Kirill. Clearly that's not a punctuation mark, but you don't want it in your file name.

So I would advice to use the correct tool for the job:

  1. Use ICU to get the best ASCII enquivalent. Only using Latin-ASCII; as the id will do. Nice and simple.
  2. Then use a regular expression, just like you did, to make sure you're left with only the characters you need.

There is really nothing wrong with this.

PS: Personally I think the person, or persons, who wrote the ICU user guide should not be complimented on a job well done. What a mess.

like image 190
KIKO Software Avatar answered Oct 11 '22 08:10

KIKO Software