Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

get all unicode variations of a latin character

E.g., for the character "a", I want to get a string (list of chars) like "aàáâãäåāăą" (not sure if that example list is complete...) (basically all unicode chars with names "Latin Small Letter A with *").

Is there a generic way to get this?

I'm asking for Python, but if the answer is more generic, this is also fine, although I would appreciate a Python code snippet in any case. Python >=3.5 is fine. But I guess you need to have access to the Unicode database, e.g. the Python module unicodedata, which I would prefer over other external data sources.

I could imagine some solution like this:

def get_variations(char):
   import unicodedata
   name = unicodedata.name(char)
   chars = char
   for variation in ["WITH CEDILLA", "WITH MACRON", ...]:
      try: 
          chars += unicodedata.lookup("%s %s" % (name, variation))
      except KeyError:
          pass
   return chars
like image 529
Albert Avatar asked Jul 23 '19 17:07

Albert


People also ask

How do I find Unicode characters?

Go to Insert >Symbol > More Symbols. Find the symbol you want. Tip: The Segoe UI Symbol font has a very large collection of Unicode symbols to choose from. On the bottom right you'll see Character code and from:.

How many different Unicode characters are there?

Q: How many characters are in Unicode? The short answer is that as of Version 14.0, the Unicode Standard contains 144,697 characters.

Can I replace the Unicode of the character?

Click on the Replace tab, then paste the Unicode character to be found in the Find what field. Paste the replacement character in the Replace with field. No numeric codes required.

How to generate and use Latin character symbols?

Steps to generate and use Latin Character Symbols. ☑ Step 1: Just enter the text from the keyboard on textbox under "Input your text here". ☑ Step 2: Now it provides you with fancy style Latin Character Symbols Text. ☑ Step 3: Copy and paste Latin Character Symbols text wherever you want. ☑ Step 4: Enjoy with the fancy text.

How many characters are in the Latin script?

Latin script in Unicode. Many Unicode characters belonging to the Latin script are encoded in the Unicode Standard. As of version 12.0 of the Unicode Standard, 1,366 characters in the following blocks are classified as belonging to the Latin script: Basic Latin, 0000–007F. This block corresponds to ASCII.

How to find the number of a Unicode character?

Each Unicode character has its own number and HTML-code. Example: Cyrillic capital letter Э has number U+042D (042D – it is hexadecimal number), code ъ. In a table, letter Э located at intersection line no. 0420 and column D. If you want to know number of some Unicode symbol, you may found it in a table.

What are the characters in Latin Extended-E?

Latin Extended-E mostly comprises characters used for German dialectology ( Teuthonista ). Latin Extended-F and -G contain characters for phonetic transcription . As of version 14.0 of the Unicode Standard, 1,475 characters in the following blocks are classified as belonging to the Latin script:


1 Answers

To start, get a collection of the Unicode combining diacritical characters; they're contiguous, so this is pretty easy, e.g.:

# Unicode combining diacritical marks run from 768 to 879, inclusive
combining_chars = ''.join(map(chr, range(768, 880)))

Now define a function that attempts to compose each one with a base ASCII character; when the composed normal form is length 1 (meaning the ASCII + combining became a single Unicode ordinal), save it:

import unicodedata

def get_unicode_variations(letter):
    if len(letter) != 1:
        raise ValueError("letter must be a single character to check for variations")
    variations = []
    # We could just loop over map(chr, range(768, 880)) without caching
    # in combining_chars, but that increases runtime ~20%
    for combiner in combining_chars:
        normalized = unicodedata.normalize('NFKC', letter + combiner)
        if len(normalized) == 1:
            variations.append(normalized)
    return ''.join(variations)

This has the advantage of not trying to manually perform string lookups in the unicodedata DB, and not needing to hardcode all possible descriptions of the combining characters. Anything that composes to a single character gets included; runtime for the check on my machine comes in under 50 µs, so if you're not doing this too often, the cost is reasonable (you could decorate with functools.lru_cache if you intend to call it repeatedly with the same arguments and want to avoid recomputing it every time).

If you want to get everything built out of one of these characters, a more exhaustive search can find it, but it'll take longer (functools.lru_cache would be nigh mandatory unless it's only ever called once per argument):

import functools
import sys
import unicodedata

@functools.lru_cache(maxsize=None)
def get_unicode_variations_exhaustive(letter): 
    if len(letter) != 1:
        raise ValueError("letter must be a single character to check for variations")
    variations = [] 
    for testlet in map(chr, range(sys.maxunicode)): 
        if letter in unicodedata.normalize('NFKD', testlet) and testlet != letter: 
            variations.append(testlet) 
    return ''.join(variations) 

This looks for any character that decomposes into a form that includes the target letter; it does mean that searching the first time takes roughly a third of a second, and the result includes stuff that isn't really just a modified version of the character (e.g. 'L''s result will include , which isn't really a "modified 'L'), but it's as exhaustive as you can get.

like image 74
ShadowRanger Avatar answered Sep 25 '22 08:09

ShadowRanger