Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect Character Set/Script of Arbitrary String

Tags:

I'm working on cleaning up a database of "profiles" of entities (people, organizations, etc), and one such part of the profile is the name of the individual in their native script (e.g. Thai), encoded in UTF-8. In the previous data structure we didn't capture the character set of the name, so now we have more records with invalid values than possible to manually review.

What I need to do at this point is, via script, determine what language/script any given name is in. With a sample data set of:

Name: "แผ่นดินต้น"
Script: NULL

Name: "አብርሃም"
Script: NULL

I need to end up with

Name: "แผ่นดินต้น"
Script: Thai

Name: "አብርሃም"
Script: Amharic

I do not need to translate the names, just determine what script they're in. Is there an established technique for figuring this sort of thing out?

like image 448
Oso Avatar asked Jul 26 '16 17:07

Oso


People also ask

How do you find a charset of a string?

A possible call to the charset() method would be: String detectedCharset = charset(value, new String[] { "ISO-8859-1", "UTF-8" });

How do you find the encoding of a character?

One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).

Is UTF 8 a character set?

UTF-8 is a character set. It defines which binary values represent a character in an encoding system. E.g. in UTF-8 a = 01100001.

What is Mb_detect_encoding?

mb_detect_encoding() detects character encoding in string str. It returns detected character encoding. encoding-list is list of character encoding. Encoding order may be specified by array or comma separated list string. If encoding_list is omitted, detect_order is used.


2 Answers

You can use charnames in Perl to figure out the name of a given character.

use strict;
use warnings;
use charnames '';
use feature 'say';
use utf8;

say charnames::viacode(ord 'Բ');

__END__
ARMENIAN CAPITAL LETTER BEN

With that, you can break apart all you strings into characters, and then build a counting hash for each type of character group. Figuring out groups from this is a bit tricky but it's a start. Once you're done with a string, the group with the highest count should win. That way, you'll not have punctuation or numbers get in the way.

Probably it's smarter to find something that already has the names of ranges in unicode and makes it easy to look up. I know there is at least one module on CPAN that does that, but I cannot find it right now. Something like that can be abused to make the lookup easier.

like image 77
simbabque Avatar answered Oct 06 '22 00:10

simbabque


Using the unicodedata2 Python module described here and here, you can examine the Unicode script for each character, like so:

#!/usr/bin/env python2
#coding: utf-8

import unicodedata2
import collections

def scripts(name):
    scripts = [unicodedata2.script(char) for char in name]
    scripts = collections.Counter(scripts)
    scripts = scripts.most_common()
    scripts = ', '.join(script for script,_ in scripts)
    return scripts


assert scripts(u'Rob') == 'Latin'
assert scripts(u'Robᵩ') == 'Latin, Greek'
assert scripts(u'Aarón') == 'Latin'
assert scripts(u'แผ่นดินต้น') == 'Thai'
assert scripts(u'አብርሃም') == 'Ethiopic'
like image 26
Robᵩ Avatar answered Oct 05 '22 23:10

Robᵩ