Detect Character Set/Script of Arbitrary String

Tags:

I'm working on cleaning up a database of "profiles" of entities (people, organizations, etc), and one such part of the profile is the name of the individual in their native script (e.g. Thai), encoded in UTF-8. In the previous data structure we didn't capture the character set of the name, so now we have more records with invalid values than possible to manually review.

What I need to do at this point is, via script, determine what language/script any given name is in. With a sample data set of:

Name: "แผ่นดินต้น"
Script: NULL

Name: "አብርሃም"
Script: NULL

I need to end up with

Name: "แผ่นดินต้น"
Script: Thai

Name: "አብርሃም"
Script: Amharic

I do not need to translate the names, just determine what script they're in. Is there an established technique for figuring this sort of thing out?

448

asked Jul 26 '16 17:07

Oso

2 Answers

You can use charnames in Perl to figure out the name of a given character.

use strict;
use warnings;
use charnames '';
use feature 'say';
use utf8;

say charnames::viacode(ord 'Բ');

__END__
ARMENIAN CAPITAL LETTER BEN

With that, you can break apart all you strings into characters, and then build a counting hash for each type of character group. Figuring out groups from this is a bit tricky but it's a start. Once you're done with a string, the group with the highest count should win. That way, you'll not have punctuation or numbers get in the way.

Probably it's smarter to find something that already has the names of ranges in unicode and makes it easy to look up. I know there is at least one module on CPAN that does that, but I cannot find it right now. Something like that can be abused to make the lookup easier.

answered Oct 06 '22 00:10

simbabque

Using the unicodedata2 Python module described here and here, you can examine the Unicode script for each character, like so:

#!/usr/bin/env python2
#coding: utf-8

import unicodedata2
import collections

def scripts(name):
    scripts = [unicodedata2.script(char) for char in name]
    scripts = collections.Counter(scripts)
    scripts = scripts.most_common()
    scripts = ', '.join(script for script,_ in scripts)
    return scripts


assert scripts(u'Rob') == 'Latin'
assert scripts(u'Robᵩ') == 'Latin, Greek'
assert scripts(u'Aarón') == 'Latin'
assert scripts(u'แผ่นดินต้น') == 'Thai'
assert scripts(u'አብርሃም') == 'Ethiopic'

answered Oct 05 '22 23:10

Robᵩ

Related questions
                            
                                Can you override what == does in Javascript? [duplicate]
                            
                                how to change "Build Host" in rpm
                            
                                Is it possible to pass generic protocols into a constructor for proper Dependency Injection in Swift 3?
                            
                                A fast numpy way to find index in array where cumulative sum becomes greater?
                            
                                Python block thread if list is empty
                            
                                Exclude a specific SQL column?
                            
                                Angular 2, TypeScript & ui-router - How to get state params
                            
                                getImageURI() not work from orgchart by google
                            
                                How to set pan in IOS Audio Unit Framework
                            
                                Replace $x<y$ by $x < y$
                            
                                React-Native: Module RCTLog is not a registered callable module
                            
                                Copy .rmd file included in a Rstudio addin package to a user defined directory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With