Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Validate Japanese Character in Active Record Callback

I have a Japanese project that needs to validate a half width and full width Japanese character, 14 chars are allowed on half width and 7 characters on full width.

Is there anyone who knows how to implement that?

Right now on my model

class Customer
   validates_length_of :name, :maximum => 14
end

is not a good choice

I'm currently using ror 2.3.5 Both fullwidth and halfwidth can be used

like image 594
valrecx Avatar asked Mar 26 '13 07:03

valrecx


2 Answers

First of all, the concept of fullwidth (全角) and halfwidth (半角) exists only for two types of characters in Japanese:

  • Roman characters (i.e. Latin)
  • Katakana characters

A similar concept exists for Korean Hangul, but not for Japanese Hiragana, nor for Kanji.

For Katakana, half-width characters have their own Unicode code points, and they are rendered half the size of full-width characters, although they are identical in shape otherwise. Example:

Fullwidth "ka": カ
Halfwidth "ka": カ

Combined characters (i.e. with diacritics like ガ) do not exists in halfwidth versions; they must be encoded as two separate characters: カ + ゙, which is probably the reason why in your task twice as many characters are allowed for halfwidth. (Note that these combinations of two code points are regarded as combining characters and usually rendered as one.)

For Roman (Latin) characters, the usual ASCII characters are called halfwidth, but the Japanese code range of Unicode (as well as traditional Japan-specific character sets) provide a separate code range for fullwidth versions. Example:

Fullwidth: L
Halfwidth: L

Fullwidth versions do not exist for non-ASCII Latin-derived characters (such as German umlauts), nor for accented versions. They do, however, exist for numerals and some punctuation characters.

Again, Hiragana and Kanji have no halfwidth versions.

To check whether a character is a fullwidth or halfwidth character, compare the code point to the relevant code range. The ranges are as follows:

Halfwidth Katakana: 0xff61 through 0xff9f
Fullwidth Katakana: 0x30a0 through 0x30ff
Halfwidth Roman: 0x21 through 0x7e (this is ASCII)
Fullwidth Roman: 0xff01 through 0xff60
Hiragana: 0x3041 through 0x309f
Kanji (i.e. the unified-ideographs range): 0x4e00 through 0x9fcc

Here is a simple Ruby program that performs the checks on a per-character basis:

# -*- coding: utf-8 -*-

def is_halfwidth_katakana(c)
  return (c.ord >= 0xff61 and c.ord <= 0xff9f)
end

def is_fullwidth_katakana(c)
  return (c.ord >= 0x30a0 and c.ord <= 0x30ff)
end

def is_halfwidth_roman(c)
  return (c.ord >= 0x21 and c.ord <= 0x7e)
end

def is_fullwidth_roman(c)
  return (c.ord >= 0xff01 and c.ord <= 0xff60)
end

def is_hiragana(c)
  return (c.ord >= 0x3041 and c.ord <= 0x309f)
end

def is_kanji(c)
  return (c.ord >= 0x4e00 and c.ord <= 0x9fcc)
end

text = "Hello World、こんにちは、半角カタカナ、全角カタカナ、fullwidth 0-9\n"

text.split("").each do |c|
  if is_halfwidth_katakana(c)
    type = "halfwidth katakana"
  elsif is_fullwidth_katakana(c)
    type = "fullwidth katakana"
  elsif is_halfwidth_roman(c)
    type = "halfwidth roman"
  elsif is_fullwidth_roman(c)
    type = "fullwidth roman"
  elsif is_hiragana(c)
    type = "hiragana"
  elsif is_kanji(c)
    type = "kanji"
  end

  printf("%c (%x) %s\n",c,c.ord,type)
end

Further notes

  1. The code ranges above are the official Unicode ranges for each character type (see Unicode Fullwidth forms and Unicode Hiragana). These include certain fullwidth / halfwidth versions of characters that are old / traditional forms or special punctuation characters. If you only want characters that are commonly used in web forms (e.g. for people to enter their names), you might want to narrow the ranges a bit.

  2. Recommendation: If this is for a web form where people can enter their names, you might want to do a little more than just check for half-width or full-width. It is extremely common on Japanese websites and registration forms, esp. with banks, to require that people enter their name in pure halfwidth (typically for Latin) or pure fullwidth (typically for Katakana). Unfortunately, this makes entering data very inconvenient. When the Japanese input method is enabled, Latin characters often come out in fullwidth versions, and the web form will then reject the data because it isn't pure halfwidth. Rather than rejecting it, it should automatically convert it to whatever form it needs. You can easily implement this by translating from one code range to the other (simply by adding the relevant constant), and make people's lives much easier.

like image 106
jogojapan Avatar answered Nov 13 '22 11:11

jogojapan


The following code may just push you over the line to fulfil the exact requirement you've so far specified in the least possible time. It uses the Moji gem (Japanese documentation), which gives lots of convenience methods in determining the content of a Japanese language string.

It validates a maximum of 14 characters in a name that only consists of half-width characters, and a maximum of 7 characters for names otherwise (including names that contain a combination of half- and full-width characters i.e. the presence of even one full-width character in the string will make the whole string be regarded as "full-width").

class Customer 

  validates_length_of :name, :maximum => 14, 
    :if => Proc.new { |customer| half_width?(customer.name) }
  validates_length_of :name, :maximum => 7
    :unless => Proc.new { |customer| half_width?(customer.name) }

  def half_width?(string)
    Moji.type?(string, Moji::HAN_KATA)
  end

end

Assumptions made:

  1. Data encoding within the system is UTF-8, and gets stored as such in the database; any further necessary re-encoding (such as for passing the data to a legacy system etc) is done in another module.
  2. No automatic conversion of half-to-full width characters done before data is saved to database i.e. half-width characters are allowed in the database for reasons perhaps of legacy system integration, proper preservation of actual user input(!), and/or aesthetic value of half-width characters(!)
  3. Diacritics in half-width characters are treated as their own separate character (i.e. no parsing of カ and ゙ to be considered one character for purposes of determining string length)
  4. There is only one name field as you specify and not, say, four (for surname, surname furigana, given name, given name furigana) which is quite common nowadays.
like image 40
Paul Fioravanti Avatar answered Nov 13 '22 11:11

Paul Fioravanti