Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a regex to test if a string is for a locale? [closed]

Tags:

c#

regex

I don't know anything about regular expressions but I think I have to use it for my probleme I got some filenames that look like :

MyResource
MyResource.en-GB
MyResource.en-US
MyResource.fr-FR
MyResource.de-DE

The idea is to test if my strings end with "[letter][letter]-[letter][letter]"

I know this is a very noob, but I just have no idea about how to do it, even if I know exactly what I wanna do... :(

like image 635
Guillaume Slashy Avatar asked Jan 06 '12 13:01

Guillaume Slashy


People also ask

How do I test a string in regex?

Use the test() method to check if a regular expression matches an entire string, e.g. /^hello$/. test(str) . The caret ^ and dollar sign $ match the beginning and end of the string. The test method returns true if the regex matches the entire string, and false otherwise.

Does regex match empty string?

There is only one “character” position in an empty string: the void after the string. The first token in the regex is ^. It matches the position before the void after the string, because it is preceded by the void before the string.

What the \b will do in a regular expression?

Simply put: \b allows you to perform a “whole words only” search using a regular expression in the form of \bword\b. A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.

Does regex match anything?

Matching a Single Character Using Regex ' dot character in a regular expression matches a single character without regard to what character it is. The matched character can be an alphabet, a number or, any special character.


2 Answers

To cater for basic variants:

^[A-Za-z]{2,4}([_-][A-Za-z]{4})?([_-]([A-Za-z]{2}|[0-9]{3}))?$

which consists of:

  1. Language code: ISO 639 2 or 3, or 4 for future use, alpha.
  2. Optional script code: ISO 15924 4 alpha.
  3. Optional country code: ISO 3166-1 2 alpha or 3 digit.
  4. Separated by underscores or dashes.

Valid examples are:

  • de
  • en-US
  • zh-Hant-TW
  • En-au
  • aZ_cYrl-aZ.

For the OP's specific question, this would need to be prefixed by /^MyResource[.] and suffixed by $/ to ensure the whole file name is for a valid resource file that ends in a locale.

Note that some programming language's functions may only accept particular forms, like only underscores and uppercase country code. PHP's intl functions accept either case and separators. PayPal accepts only the language, or the la_CY form, where la is the language and CY is the country/region. The PHP locale_canonicalize function can be used to standardise to this format.

IETF RFC 5646, which governs internet usage of these tags, recommends a capitalisation and separation format like az-Cyrl-AZ, as used in the first three examples above, though it says processors should accept any mix of case and either separator, as per the last two examples. When displaying locales, using - as the separator allows finer-grained line-wrapping which might otherwise produce significantly empty lines as when the non=wrapping _ is used, especially in table cells.

The regex for the recommended basic format is:

^[a-z]{2,4}(-[A-Z][a-z]{3})?(-([A-Z]{2}|[0-9]{3}))?$

The regexp only covers the basic format. There are variants for extras, like local region. RFC 5646 allows for such variants, along with private extensions and backwards-compatibility forms. It all depends upon the granularity required. The CLDR Unicode database, which is used by PHP's intl functions and other programs, may include such variants from version to version, though they can also disappear at a later time.

If using a CLDR-based function set, like PHP's intl extension, you can check if a locale exists in the intl database using a function like:

<?php
 function is_locale($locale=''){
  // STANDARDISE INPUT
  $locale=locale_canonicalize($locale);
  
  // LOAD ARRAY WITH LOCALES
  $locales=resourcebundle_locales('');
  
  // RETURN WHETHER FOUND
  return (array_search($locale,$locales)!==F);
 }
?>

It takes about half a millisecond to load and search the data, so it won't be too much of a performance hit.

Of course, it will only find those in the database of the CLDR version supplied with the PHP version used, but will be updated with each subsequent PHP release.

Note that some locales are not for countries, but regions, and these are typically numeric, like 001 for 'World', 150 for 'Europe' and 419 for 'Latin America'. So there are now en-001, en-150, ar-001, and es-419, which can be used for generic language purposes. For example, en-001 was designed to decouple dependence upon en-us as an ersatz English, especially since its date formats and spellings are radically different from the 100 other regional en variants. The en-150 locale is the same as en-001 except for numbering separators and other Europe-specific formats.

In general, a regexp is a good front-end sanity check to filter out illegal characters, and especially to reserve the format for possible future additions. It also helps to prevent malicious character combinations being sent to the lookup facility, especially if text-based lookup command mechanisms, like SQL or Xpath, are used.

like image 77
Patanjali Avatar answered Oct 22 '22 15:10

Patanjali


That would be testing your input against:

\.[a-z]{2}-[A-Z]{2}$

This is really very literal: "match a dot (\., the dot being a special character in regexes), followed by exactly two of any characters from a to z ([a-z]{2} -- [...] is a character class), followed by a dash (-), followed by two of any characters from A to Z ([A-Z]{2}), followed by the end of input ($).

http://www.dotnetperls.com/regex-match <-- how to apply this regex in C# against an input. It means the code would look like (UNTESTED):

// Post edit: this will really return a boolean
if (Regex.Match(input, @"\.[a-z]{2}-[A-Z]{2}$").Success) {
    // there is a match
}

http://regex.info <-- buy that and read it, it is the BEST resource for regular expressions in the universe

http://regular-expressions.info <-- the second best resource

like image 13
fge Avatar answered Oct 22 '22 15:10

fge