Python - pyparsing unicode characters

Tags:

:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8)

The code specifies the grammar and parses accordingly.

671.assess  :: अहसास  ::2
x=number + "." + src + "::" + w + "::" + number + "." + number

If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format.

I mean that the code works when we have something of the form 671.assess :: ahsaas ::2

i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose.

The python code looks like this:

# -*- coding: utf-8 -*-
from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit , 
# grammar 
src = Word(printables)
trans =  Word(printables)
number = Word(nums)
x=number + "." + src + "::" + trans + "::" + number + "." + number
#parsing for eng-dict
efiledata = open('b1aop_or_not_word.txt').read()
eresults = x.parseString(efiledata)
edict1 = {}
edict2 = {}
counter=0
xx=list()
for result in eresults:
  trans=""#translation string
  ew=""#english word
  xx=result[0]
  ew=xx[2]
  trans=xx[4]   
  edict1 = { ew:trans }
  edict2.update(edict1)
print len(edict2) #no of entries in the english dictionary
print "edict2 has been created"
print "english dictionary" , edict2 

#parsing for hin-dict
hfiledata = open('b1aop_or_not_word.txt').read()
hresults = x.scanString(hfiledata)
hdict1 = {}
hdict2 = {}
counter=0
for result in hresults:
  trans=""#translation string
  hw=""#hin word
  xx=result[0]  
  hw=xx[2]
  trans=xx[4]
  #print trans
  hdict1 = { trans:hw }
  hdict2.update(hdict1)

print len(hdict2) #no of entries in the hindi dictionary
print"hdict2 has been created"
print "hindi dictionary" , hdict2
'''
#######################################################################################################################

def translate(d, ow, hinlist):
   if ow in d.keys():#ow=old word d=dict
    print ow , "exists in the dictionary keys"
        transes = d[ow]
    transes = transes.split()
        print "possible transes for" , ow , " = ", transes
        for word in transes:
            if word in hinlist:
        print "trans for" , ow , " = ", word
                return word
        return None
   else:
        print ow , "absent"
        return None

f = open('bidir','w')
#lines = ["'\
#5# 10 # and better performance in business in turn benefits consumers .  # 0 0 0 0 0 0 0 0 0 0 \
#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI .  # 0 0 0 0 0 0 0 0 0 0 0 \
#'"]
data=open('bi_full_2','rb').read()
lines = data.split('!@#$%')
loc=0
for line in lines:
    eng, hin = [subline.split(' # ')
                for subline in line.strip('\n').split('\n')]

    for transdict, source, dest in [(edict2, eng, hin),
                                    (hdict2, hin, eng)]:
        sourcethings = source[2].split()
        for word in source[1].split():
            tl = dest[1].split()
            otherword = translate(transdict, word, tl)
            loc = source[1].split().index(word)
            if otherword is not None:
                otherword = otherword.strip()
                print word, ' <-> ', otherword, 'meaning=good'
                if otherword in dest[1].split():
                    print word, ' <-> ', otherword, 'trans=good'
                    sourcethings[loc] = str(
                        dest[1].split().index(otherword) + 1)

        source[2] = ' '.join(sourcethings)

    eng = ' # '.join(eng)
    hin = ' # '.join(hin)
    f.write(eng+'\n'+hin+'\n\n\n')
f.close()
'''

if an example input sentence for the source file is:

1# 5 # modern markets : confident consumers  # 0 0 0 0 0 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 0 0 0 0 0 0 
!@#$%

the ouptut would look like this :-

1# 5 # modern markets : confident consumers  # 1 2 3 4 5 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 1 2 3 4 5 0 
!@#$%

Output Explanation:- This achieves bidirectional alignment. It means the first word of english 'modern' maps to the first word of hindi 'AddhUnIk' and vice versa. Here even characters are take as words as they also are an integral part of bidirectional mapping. Thus if you observe the hindi WORD '.' has a null alignment and it maps to nothing with respect to the English sentence as it doesn't have a full stop. The 3rd line int the output basically represents a delimiter when we are working for a number of sentences for which your trying to achieve bidirectional mapping.

What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.

985

asked Feb 26 '10 03:02

boddhisattva

1 Answers

I Was searching about french unicode chars and fall on this question. If you search french or other latin accents, with pyparsing 2.3.0 you can use:

>>> pp.pyparsing_unicode.Latin1.alphas
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'

answered Oct 07 '22 10:10

snoob dogg

Related questions
                            
                                Numpy: get the column and row index of the minimum value of a 2D array
                            
                                How to install python smtplib module in ubuntu os
                            
                                How to remove accent in Python 3.5 and get a string with unicodedata or other solutions?
                            
                                Proper way to use "opposite boolean" in Pandas data frame boolean indexing
                            
                                Delete a column in a pandas' DataFrame if its sum is less than x
                            
                                Enumerating three variables in python list comprehension
                            
                                Convert datetime.time into datetime.timedelta in Python 3.4
                            
                                Got Failed to decode JSON object when calling a POST request in flask python
                            
                                How to execute local python scripts in Jenkins UI
                            
                                SeqIO.parse on a fasta.gz
                            
                                How do you exit PDB /and/ kill the program?
                            
                                Reverse 32bit integer
                            
                                Why "if-else-break" breaks in python?
                            
                                Attempting to reset tensorflow graph when using keras, failing
                            
                                How to print pretty JSON on a html page from a django template?
                            
                                Split data directory into training and test directory with sub directory structure preserved
                            
                                How can I download a pandas Dataframe in Google Colab? [duplicate]
                            
                                Why does "test".count('') return 5? [duplicate]
                            
                                Unable to install Python packages using pip in Ubuntu Linux: InsecurePlatformWarning, SSLError, tlsv1 alert protocol version
                            
                                Financial Charts / Graphs in Ruby or Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python - pyparsing unicode characters

Tags:

python

unicode

nlp

pyparsing

boddhisattva

People also ask

1 Answers

snoob dogg

Recent Activity

Donate For Us