I want to process the whole Tanach file, in Hebrew. For that, I chose the language Raku because some of its features (grammar and unicode support).
So, I defined some tokens to select the relevant data.
grammar HEB {
token TOP {'<hebrewname>'<t_word>'</hebrewname>'}
token t_word {<graph>+}
};
grammar CHA {
token TOP {'<c n="'<t_number>'">'}
token t_number {\d+}
};
grammar VER {
token TOP {'<v n="'<t_number>'">'}
token t_number {\d+}
};
grammar WOR {
token TOP {'<w>'<t_word>'</w>'}
token t_word {<graph>+}
};
Here, a very small part the document (the Tanach in XML format) which is sufficient show the problem :
<names>
<name>Genesis</name>
<abbrev>Gen</abbrev>
<number>1</number>
<filename>Genesis</filename>
<hebrewname>בראשית</hebrewname>
</names>
<c n="1">
<v n="1">
<w>בְּ/רֵאשִׁ֖ית</w>
<w>בָּרָ֣א</w>
<w>אֱלֹהִ֑ים</w>
<w>אֵ֥ת</w>
<w>הַ/שָּׁמַ֖יִם</w>
<w>וְ/אֵ֥ת</w>
<w>הָ/אָֽרֶץ׃</w>
</v>
<v n="2">
<w>וְ/הָ/אָ֗רֶץ</w>
<w>הָיְתָ֥ה</w>
<w>תֹ֙הוּ֙</w>
<w>וָ/בֹ֔הוּ</w>
<w>וְ/חֹ֖שֶׁךְ</w>
<w>עַל־</w>
<w>פְּנֵ֣י</w>
<w>תְה֑וֹם</w>
<w>וְ/ר֣וּחַ</w>
<w>אֱלֹהִ֔ים</w>
<w>מְרַחֶ֖פֶת</w>
<w>עַל־</w>
<w>פְּנֵ֥י</w>
<w>הַ/מָּֽיִם׃</w>
</v>
The problem is that the code doesn't recognize the two first words (<w>בְּ/רֵאשִׁ֖ית</w>
<w>בָּרָ֣א</w>
) but seems to work fine with the following words...
Somebody could explain to me what's wrong ?
The main loop is :
for $file_in.lines -> $line {
$memline = $line.trim;
if HEB.parse($memline) {
say "hebrew name of book is "~ $/<t_word>;
next;
}
if CHA.parse($memline) {
say "chapitre number is "~ $/<t_number>;
next;
}
if VER.parse($memline) {
say "verse number is "~ $/<t_number>;
next;
}
if WOR.parse($memline) {
$computed_word_value = 0;
say "word is "~ $/<t_word>;
$file_out.print("$/<t_word>");
say "numbers of graphemes of word is "~ $/<t_word>.chars;
@exploded_word = $/<t_word>.comb;
for @exploded_word {
say $_.uniname;
};
next;
}
say "not processed";
}
Output file :
Please note that after verse number is 1, the 2 first words are not processed. Don't be focused on the distorted Hebrew (windows console) !
not processed
not processed
not processed
not processed
not processed
hebrew name of book is ׳‘׳¨׳׳©׳™׳×
not processed
chapitre number is 1
verse number is 1
not processed
not processed
word is ׳ײ±׳œײ¹׳”ײ´ײ‘׳™׳
numbers of graphemes of word is 5
HEBREW LETTER ALEF
HEBREW LETTER LAMED
HEBREW LETTER HE
HEBREW LETTER YOD
HEBREW LETTER FINAL MEM
word is ׳ײµײ¥׳×
numbers of graphemes of word is 2
HEBREW LETTER ALEF
HEBREW LETTER TAV
not processed
word is ׳•ײ°/׳ײµײ¥׳×
numbers of graphemes of word is 4
HEBREW LETTER VAV
SOLIDUS
I hope that my question is clearly exposed.
I can't reproduce your problem.
About the only thing I can guess is that you didn't open the file with the correct encoding.
Or worse, you are getting the file from STDIN and don't have the proper codepage selected. (Which makes sense since your output is also mojibake.)
Rakudo doesn't really do codepages, so if you don't set your environment to utf8 you have to change the encoding of $*STDIN
(and $*STDOUT
) to match whatever it is.
I'm now going to pretend that you posted to CodeReview.StackExchange.com instead.
First I don't know why you are creating a whole grammar for something so small which could easily be done with simple regexes.
my token HEB {
'<hebrewname>'
$<t_word> = [<.graph>+]
'</hebrewname>'
}
my token CHA {
'<c n="' $<t_number> = [\d+] '">'
}
my token VER {
'<v n="' $<t_number> = [\d+] '">'
}
my token WOR {
'<w>' $<t_word> = [<.graph>+] '</w>'
}
Honestly that is still more than you seem to need, as you only deal with one element per regex.
That's also ignoring that I really dislike that you are giving the elements names like t_word
and t_number
. Which is pointless as they are inside of $/
, and Grammar also doesn't have any such similarly named method so there is no chance of them interfering with any other namespace. Give them descriptive names if you must give them names.
You can just restrict $/
to only stringifying to the part you care about with <(…)>
. (It works here because you are only capturing one thing.)
<(
means ignore everything before, and )>
means ignore everything after.
my token HEB {
'<hebrewname>'
<( <.graph>+ )> # $/ will contain only what <.graph>+ matches
'</hebrewname>'
}
my token CHA {
'<c n="' <( \d+ )> '">'
}
my token VER {
'<v n="' <( \d+ )> '">'
}
my token WOR {
'<w>' <( <.graph>+ )> '</w>'
}
You are parsing it as if it was just a line oriented file.
Which does make a certain amount of sense as it is formatted as one, and that results in less memory usage.
Using named regexes for that, let alone whole grammars is a bit overkill. It also separates the logic when that isn't really necessary for such simple matches.
Here is how I would parse that file in a line oriented fashion:
my $in-names = False;
my %names;
my @chapters;
my @verses;
my @current-verse;
for $file_in.lines {
when /'<names>' / { $in-names = True }
when /'</names>'/ { $in-names = False }
# chapter
when /'<c n="' <( \d+ )> '">'/ {
@verses := @chapters[ +$/ - 1 ] //= [];
}
when /'</c>'/ {
# finalize this chapter
# for example print out statistics
# (only needed if you don't want `default` to catch it)
}
# verse
when /'<v n="' <( \d+ )> '">'/ {
@current-verse := @verses[ +$/ - 1 ] //= [];
}
when /'</v>'/ {
# finalize this verse
}
# word
when /'<w>' <( <.graph>+ )> '</w>'/ {
push @current-verse, ~$/;
}
# name tags
# must be after more specific regexes
when /'<' <tag=.ident> '>' $<value> = [<.ident>|\d+] {} "</$<tag>>"/ {
if $in-names {
%names{~$<tag>} = ~$<value>
} else {
note "not handling $<tag> => $<value> outside of <names>"
}
}
default { note "unexpected text '$_'" }
}
Note that when
makes it so that you don't have to do next
.
And since we just use $_
instead of $line
, it makes it so that we can just use regexes directly as the condition of those when
statements.
I'm not bothering to use ^
or $
so there is no need to either trim
or use ^\s*
and \s*$
.
It does make it a bit more fragile, so you may want to change it if it becomes a problem.
If you really want to just do simple line processing like you're doing, I'm sure you can alter the above to suit your needs.
I wanted to make this more useful to people who come across this in the future. So I created a data structure from the file instead of following what you were doing.
Really I probably only would have reached for a grammar if I were going to .parse()
the entire file in one go.
This is what such a grammar would look like.
grammar Book {
rule TOP {
<names>
<chapter> +
# note that there needs to be a space between <chapter> and +
# so that whitespace can be between <c…>…</c> elements
}
rule names {
'<names>' ~ '</names>'
<name> +
}
token name {
'<' <tag=.ident> '>'
$<name> = [<.ident>|\d+]
{}
"</$<tag>>"
}
rule chapter {
# note space before ]
['<c n="' <number> '">' ] ~ '</c>'
<verse> +
}
rule verse {
['<v n="' <number> '">' ] ~ '</v>'
<word> +
}
token number { \d+ }
token word { '<w>' <( <.graph>+ )> '</w>' }
}
To do similar processing as you have been doing
class Line-Actions {
has IO::Handle:D $.file-out is required;
has $!number-type is default<chapter>;
method name ($/) {
if $<tag> eq 'hebrewname' {
say "hebrew name of book is $<name>";
}
}
# note that .chapter and .verse will run at the end
# of parsing them, which is too late for when .word is processed
# so we do it in .number instead
method number ($/) {
say "$!number-type number is $/";
$!number-type = 'verse';
}
method chapter ($/) {
# reset to default of "chapter"
# as the next .number will be for the next chapter
$!number-type = Nil;
}
method word ($/) {
say "word is $/";
$!file-out.print(~$/);
say "number of graphemes in word is $/.chars()";
.say for "$/".comb.map: *.uninames.join(', ');
}
}
Book.parsefile(
$filename,
actions => Line-Actions.new( 'outfile.txt'.IO.open(:w) )
);
Your parsing problem seems to be somewhat confined to the example text you posted, as there appear to be forward-slashes ("solidus" characters) embedded within the snippet of Hebrew text you provided.
The script you provided was easy to fix up, and I re-worked the WOR
token in your Raku script to select only <:Script<Hebrew>>
unicode. While this may help with stray/embedded "solidus" characters (and other, non-Hebrew characters), presumably you could re-write the script to parse faster. Here's the script:
grammar HEB {
token TOP {'<hebrewname>'<t_word>'</hebrewname>'}
token t_word {<graph>+}
};
grammar CHA {
token TOP {'<c n="'<t_number>'">'}
token t_number {\d+}
};
grammar VER {
token TOP {'<v n="'<t_number>'">'}
token t_number {\d+}
};
grammar WOR {
token TOP {'<w>'<t_word>'</w>'}
token t_word {<:Script<Hebrew>>+}
};
for $*ARGFILES.lines -> $line {
my $memline = $line.trim;
if HEB.parse($memline) {
say "hebrew name of book is "~ $/<t_word>;
next;
}
if CHA.parse($memline) {
say "chapitre number is "~ $/<t_number>;
next;
}
if VER.parse($memline) {
say "verse number is "~ $/<t_number>;
next;
}
if WOR.parse($memline) {
say "word is "~ $/<t_word>;
say "numbers of graphemes of word is "~ $/<t_word>.chars;
my @exploded_word = $/<t_word>.comb;
for @exploded_word {
say $_.uniname, ": ", $_;
};
next;
}
say "not processed";
}
Starting with a new test file, I was able to get 124655/126663 lines of the following XML
text to parse:
http://www.tanach.us/Books/Genesis.xml
Below is the parsed text from lines 103-119 (words which previously had given you problems):
hebrew name of book is בראשית
not processed
chapitre number is 1
verse number is 1
word is בְּרֵאשִׁ֖ית
numbers of graphemes of word is 6
HEBREW LETTER BET: בְּ
HEBREW LETTER RESH: רֵ
HEBREW LETTER ALEF: א
HEBREW LETTER SHIN: שִׁ֖
HEBREW LETTER YOD: י
HEBREW LETTER TAV: ת
word is בָּרָ֣א
numbers of graphemes of word is 3
HEBREW LETTER BET: בָּ
HEBREW LETTER RESH: רָ֣
HEBREW LETTER ALEF: א
HTH.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With