Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to match Egyptian Hieroglyphics [closed]

People also ask

What is ?: In regex?

It indicates that the subpattern is a non-capture subpattern. That means whatever is matched in (?:\w+\s) , even though it's enclosed by () it won't appear in the list of matches, only (\w+) will.

How do you decode Egyptian hieroglyphics?

The Rosetta Stone was a large stone tablet that acted as a cipher, or, a way of decoding information. It showed Greek words next to their Egyptian hieroglyphic counterparts. People could read Greek, so cryptologists used the Rosetta Stone to decipher the meaning of each hieroglyph.

Can we read Heiroglyphs?

Hieroglyphs are written in rows or columns and can be read from left to right or from right to left. You can distinguish the direction in which the text is to be read because the human or animal figures always face towards the beginning of the line. Also the upper symbols are read before the lower.

What are the 3 types Egyptian hieroglyphics?

Hieroglyphs consist of three kinds of glyphs: phonetic glyphs, including single-consonant characters that function like an alphabet; logographs, representing morphemes; and determinatives, which narrow down the meaning of logographic or phonetic words.


TLDNR: \p{Egyptian_Hieroglyphs}

Javascript

Egyptian_Hieroglyphs belong to the "astral" plane that uses more than 16 bits to encode a character. Javascript, as of ES5, doesn't support astral planes (more on that) therefore you have to use surrogate pairs. The first surrogate is

U+13000 = d80c dc00

the last one is

U+1342E = d80d dc2e

that gives

re = /(\uD80C[\uDC00-\uDFFF]|\uD80D[\uDC00-\uDC2E])+/g

t = document.getElementById("pyramid").innerHTML
document.write("<h1>Found</h1>" + t.match(re))
<div id="pyramid">

  some     𓀀	really    𓀁	old    𓐬	stuff    𓐭	    𓐮
  
  </div>

This is what it looks like with Noto Sans Egyptian Hieroglyphs installed:

enter image description here

Other languages

On platforms that support UCS-4 you can use Egyptian codepoints 13000 to 1342F directly, but the syntax differs from system to system. For example, in Python (3.3 up) it will be [\U00013000-\U0001342E]:

>>> s = "some \U+13000 really \U+13001 old \U+1342C stuff \U+1342D \U+1342E"
>>> s
'some 𓀀 really 𓀁 old 𓐬 stuff 𓐭 𓐮'
>>> import re
>>> re.findall('[\U00013000-\U0001342E]', s)
['𓀀', '𓀁', '𓐬', '𓐭', '𓐮']

Finally, if your regex engine supports unicode properties, you can (and should) use these instead of hardcoded ranges. For example in php/pcre:

$str = " some 𓀀 really 𓀁 old 𓐬 stuff 𓐭  𓐮";

preg_match_all('~\p{Egyptian_Hieroglyphs}~u', $str, $m);
print_r($m);

prints

[0] => Array
    (
        [0] => 𓀀
        [1] => 𓀁
        [2] => 𓐬
        [3] => 𓐭
        [4] => 𓐮
    )

Unicode encodes Egyptian hieroglyphs in the range from U+13000 – U+1342F (beyond the Basic Multilingual Plane).

In this case, there are 2 ways to write the regex:

  1. By specifying a character range from U+13000 – U+1342F.

    While specifying a character range in regex for characters in BMP is as easy as [a-z], depending on the language support, doing so for characters in astral planes might not be as simple.

  2. By specifying Unicode block for Egyptian hieroglyphs

    Since we are matching any character in Egyptian hieroglyphs block, this is the preferred way to write the regex where support is available.

Java

(Currently, I don't have any idea how other implementation of Java Class Libraries deal with astral plane characters in Pattern classes).

Sun/Oracle implementation

I'm not sure if it makes sense to talk about matching characters in astral planes in Java 1.4, since support for characters beyond BMP was only added in Java 5 by retrofitting the existing String implementation (which uses UCS-2 for its internal String representation) with code point-aware methods.

Since Java continues to allow lone surrogates (one which can't form a pair with other surrogate) to be specified in String, it resulted in a mess, since surrogates are not real characters, and lone surrogates are invalid in UTF-16.

Pattern class saw a major overhaul from Java 1.4.x to Java 5, as the class was rewritten to provide support for matching Unicode characters in astral planes: the pattern string is converted to an array of code point before it is parsed, and the input string is traversed by code point-aware methods in String class.

You can read more about the madness in Java regex in this answer by tchist.

I have written a detailed explanation on how to match a range of character which involves astral plane characters in this answer, so I am only going to include the code here. It also includes a few counter-examples of incorrect attempts to write regex to match astral plane characters.

Java 5 (and above)

"[\uD80C\uDC00-\uD80D\uDC2F]"

Java 7 (and above)

"[\\uD80C\\uDC00-\\uD80D\\uDC2F]"
"[\\x{13000}-\\x{1342F}]"

Since we are matching any code point belongs to the Unicode block, it can also be written as:

"\\p{InEgyptian_Hieroglyphs}"
"\\p{InEgyptian Hieroglyphs}"
"\\p{InEgyptianHieroglyphs}"

"\\p{block=EgyptianHieroglyphs}"
"\\p{blk=Egyptian Hieroglyphs}"

Java supported \p syntax for Unicode block since 1.4, but support for Egyptian Hieroglyphs block was only added in Java 7.

PCRE (used in PHP)

PHP example is already covered in georg's answer:

'~\p{Egyptian_Hieroglyphs}~u'

Note that u flag is mandatory if you want to match by code points instead of matching by code units.

Not sure if there is a better post on StackOverflow, but I have written some explanation on the effect of u flag (UTF mode) in this answer of mine.

One thing to note is Egyptian_Hieroglyphs is only available from PCRE 8.02 (or a version not earlier than PCRE 7.90).

As an alternative, you can specify a character range with \x{h...hh} syntax:

'~[\x{13000}-\x{1342F}]~u'

Note the mandatory u flag.

The \x{h...hh} syntax is supported from at least PCRE 4.50.

JavaScript (ECMAScript)

ES5

The character range method (which is the only way to do this in vanilla JavaScript) is already covered in georg's answer. The regex is modified a bit to cover the whole block, including the reserved unassigned code point.

/(?:\uD80C[\uDC00-\uDFFF]|\uD80D[\uDC00-\uDC2F])/

The solution above demonstrates the technique to match a range of character in astral plane, and also the limitations of JavaScript RegExp.

JavaScript also suffers from the same problem of string representation as Java. While Java did fix Pattern class in Java 5 to allow it to work with code points, JavaScript RegExp is still stuck in the days of UCS-2, forcing us to work with code units instead of code point in the regular expression.

ES6

Finally, support for code point matching is added in ECMAScript 6, which is made available via u flag to prevent breaking existing implementations in previous versions of ECMAScript.

  • ES6 Specification - 21.2 RegExp (Regular Expression) Objects
  • Unicode-aware regular expressions in ECMAScript 6

Check Support section from the second link above for the list of browser providing experimental support for ES6 RegExp.

With the introduction of \u{h...hh} syntax in ES6, the character range can be rewritten in a manner similar to Java 7:

/[\u{13000}-\u{1342F}]/u

Or you can also directly specify the character in the RegExp literal, though the intention is not as clear cut as [a-z]:

/[𓀀-𓐯]/u

Note the u modifier in both regexes above.

Still got stuck with ES5? Don't worry, you can transpile ES6 Unicode RegExp to ES5 RegExp with regxpu.