I'm working on cleaning up a database of "profiles" of entities (people, organizations, etc), and one such part of the profile is the name of the individual in their native script (e.g. Thai), encoded in UTF-8. In the previous data structure we didn't capture the character set of the name, so now we have more records with invalid values than possible to manually review.
What I need to do at this point is, via script, determine what language/script any given name is in. With a sample data set of:
Name: "แผ่นดินต้น"
Script: NULL
Name: "አብርሃም"
Script: NULL
I need to end up with
Name: "แผ่นดินต้น"
Script: Thai
Name: "አብርሃም"
Script: Amharic
I do not need to translate the names, just determine what script they're in. Is there an established technique for figuring this sort of thing out?
A possible call to the charset() method would be: String detectedCharset = charset(value, new String[] { "ISO-8859-1", "UTF-8" });
One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).
UTF-8 is a character set. It defines which binary values represent a character in an encoding system. E.g. in UTF-8 a = 01100001.
mb_detect_encoding() detects character encoding in string str. It returns detected character encoding. encoding-list is list of character encoding. Encoding order may be specified by array or comma separated list string. If encoding_list is omitted, detect_order is used.
You can use charnames
in Perl to figure out the name of a given character.
use strict;
use warnings;
use charnames '';
use feature 'say';
use utf8;
say charnames::viacode(ord 'Բ');
__END__
ARMENIAN CAPITAL LETTER BEN
With that, you can break apart all you strings into characters, and then build a counting hash for each type of character group. Figuring out groups from this is a bit tricky but it's a start. Once you're done with a string, the group with the highest count should win. That way, you'll not have punctuation or numbers get in the way.
Probably it's smarter to find something that already has the names of ranges in unicode and makes it easy to look up. I know there is at least one module on CPAN that does that, but I cannot find it right now. Something like that can be abused to make the lookup easier.
Using the unicodedata2
Python module described here and here, you can examine the Unicode script for each character, like so:
#!/usr/bin/env python2
#coding: utf-8
import unicodedata2
import collections
def scripts(name):
scripts = [unicodedata2.script(char) for char in name]
scripts = collections.Counter(scripts)
scripts = scripts.most_common()
scripts = ', '.join(script for script,_ in scripts)
return scripts
assert scripts(u'Rob') == 'Latin'
assert scripts(u'Robᵩ') == 'Latin, Greek'
assert scripts(u'Aarón') == 'Latin'
assert scripts(u'แผ่นดินต้น') == 'Thai'
assert scripts(u'አብርሃም') == 'Ethiopic'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With