Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ - How to read Unicode characters( Hindi Script for e.g. ) using C++ or is there a better Way through some other programming language?

I have a hindi script file like this:

3.  भारत का इतिहास काफी समृद्ध एवं विस्तृत है।

I have to write a program which adds a position to each and every word in each sentence. Thus the numbering for every line for a particular word position should start off with 1 in parentheses. The output should be something like this.

3.  भारत(1) का(2) इतिहास(3) काफी(4) समृद्ध(5) एवं(6) विस्तृत(7) है(8) ।(9)

The meaning of the above sentence is:

3.  India has a long and rich history.

If you observe the '।'( which is a full stop in hindi equivalent to a '.' in English ) also has a word position and similarly other special symbols would also have as I am trying to go about English-Hindi Word alignment( a part of Natural Language Processing ( NLP ) ) so the full stop in english '.' should map to '।' in Hindi. Serial nos remain as it is untouched. I thought reading character by character could be a solution. Could you please help me with how to go about in C++ if its easy or if easier could you suggest some other way through some other programming language may like Python/Perl..?

The thing is I am able to get word positions for my English text using C++ as I was able to read character by character using ASCII values in C++ but I don't have a clue to how to go about the same for the hindi text.

The final aim of all this is to see which word position of the English text maps to which postion in Hindi. This way I can achieve bidirectional alignment.

Thank you for your time...:)

like image 387
boddhisattva Avatar asked Feb 18 '10 10:02

boddhisattva


2 Answers

Wow, already 6 answers and not a single one actually does what mgj wanted. jkp comes close, but then drops the ball by deleting the daṇḍa.

Perl to the rescue. Less code, fewer bugs.

use utf8; use strict; use warnings;
use Encode qw(decode);
my $index;
join ' ', map { $index++; "$_($index)" } split /\s+|(?=।)/, decode 'UTF-8', <>;
# returns भारत(1) का(2) इतिहास(3) काफी(4) समदध(5) एव(6) विसतत(7) ह(8) ।(9)

edit: changed to read from STDIN as per comment, added best practices pragmas

like image 128
daxim Avatar answered Nov 03 '22 10:11

daxim


If you are working in C++ and decide that UTF-8 is a viable encoding for your application you could look at utfcpp which is a library that provides many equivalents for types found in the stdlib (such as streams and string processing functions) but abstracts away the difficulties of dealing with a variable length encoding like UTF8.

If on the other hand you are free to use any language, I would say that doing something like this in something like Python would be far easier: it's unicode support is very good as are the bundled string processing routines.

#!/usr/bin/env python
# encoding: utf-8

string = u"भारत का इतिहास काफी समृद्ध एवं विस्तृत है।"
parts = []
for part in string.split():
    parts.extend(part.split(u"।"))
print "No of Parts: %d" % len(parts)
print "Parts: %s" % parts

Outputs:

No of Parts: 9
Parts: [u'\u092d\u093e\u0930\u0924', u'\u0915\u093e', u'\u0907\u0924\u093f\u0939\u093e\u0938', u'\u0915\u093e\u092b\u0940', u'\u0938\u092e\u0943\u0926\u094d\u0927', u'\u090f\u0935\u0902', u'\u0935\u093f\u0938\u094d\u0924\u0943\u0924', u'\u0939\u0948', u'']

Also, since you are doing natural language processing, you may want to take a look at the NLTK library for Python which has a wealth of tools for just this kind of job.

like image 27
jkp Avatar answered Nov 03 '22 11:11

jkp