Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex collating symbols

Tags:

regex

bash

I tried to understand how 'collating symbols' match works but I did not come out this. I understood that it means matching an exact sequence instead of just the character(s), that is:

echo "ciiiao" | grep '[oa]'     --> output 'ciiiao'
echo "ciiiao" | grep '[[.oa.]]' --> no output
echo "ciiiao" | grep '[[.ia.]]' --> output 'ciiiao'

However, the third command does not work. Am I wrong or I misinterpret something?

I have read this regexp tutorial.

like image 587
MFrancone Avatar asked Jan 27 '16 16:01

MFrancone


People also ask

What is a collating element?

It defines a collating element to be “a sequence of one or more bytes defined in the current collating sequence as a unit of collation.” This generalizes the notion of a character in two ways. First, a single character can map into two or more collating elements.

What are bracket expressions in regex?

A bracket expression is either a matching list expression or a non-matching list expression. It consists of one or more expressions: ordinary characters, collating elements, collating symbols, equivalence classes, character classes, or range expressions.

What is extended regex?

An extended regular expression specifies a set of strings to be matched. The expression contains both text characters and operator characters. Text characters match the corresponding characters in the strings being compared. Operator characters specify repetitions, choices, and other features.

What BRE would match a single character on only a single character?

Use \s to match any single whitespace character.


1 Answers

Collating symbols are typically used when a digraph is treated like a single character in a language. They are an element of the POSIX regular expression specification, and are not widely supported.

For example, the Welsh alphabet has a number of digraphs that are treated as a single letter (marked with a * below)

a b c ch d dd e f ff g ng h i j l ll m n o p ph r rh s t th u w y
       *           *    *          *          *    *      *

Assuming the locale file defines it (a collating symbol will only work if it is defined in the current locale), the collating symbol [[.ng.]] is treated like a single character. Likewise, a single character expression like . or [^a] will also match "ff" or "th." This also affects sorting, so that [p-t] will include the digraphs "ph" and "rh" in addition to the expected single letters.

like image 65
miken32 Avatar answered Oct 02 '22 16:10

miken32