Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count empty translations in .po with grep (or other LSB tool)?

I can perform search of empty translations in vim with command like this:

/""\n\n

But my task is to find number of non-translated strings. Any ideas how to do this with standard tools which every linux box should have (no separate packages please).

Here is example of .po file containing 2 translated and 2 non-translated string (long and short variant).

msgid "translated string"
msgstr "some translation"

msgid "non-translated string"
msgstr ""

msgid ""
"Some long translated string which starts from new line "
"and can last for few lines"
msgstr ""
"Translation of some long string which starts from new line "
"and lasts for few lines"

msgid ""
"Some long NON-translated string which starts from new line "
"and can last for few lines"
msgstr ""
like image 716
Sergey P. aka azure Avatar asked Jan 25 '13 14:01

Sergey P. aka azure


4 Answers

Here's one way using awk:

awk '$NF == "msgstr \"\"" { c++ } END { print c }' FS="\n" RS= file

Results:

2

Explanation:

Put awk in paragraph mode. Then test the last line in each block. If the last line matches the pattern exactly, count it. Then, at the end of the script, print out the count. If you later decide you want to count the number of translated strings, simply change == to !=. HTH.


From the comments below, to handle empty lines containing whitespace:

You'll need to use a regular expression, like: RS="\n{2,}|\n([ \t]*\n)+|\n$" (this could be simplified perhaps). However, it should be noted that the ability for RS to be a regex is a GNU awk extension. Other awk's will fail to handle multi-character record separators in some way. Fortunately, the above file format looks fairly rigid, so handling lines containing whitespace shouldn't be necessary.

If faced with separators including whitespace, the quick fix is a call to sed:

< file sed 's/^ *$//' | awk ...
like image 133
Steve Avatar answered Nov 08 '22 09:11

Steve


I suggest using the available gettext tools, instead of trying to parse .po files directly:

$ msggrep -v -T -e "." test.po 
msgid "non-translated string"
msgstr ""

msgid ""
"Some long NON-translated string which starts from new line and can last for "
"few lines"
msgstr ""

The msggrep flags are:

  • -v invert match
  • -T apply next pattern to msgstr
  • -e search pattern

i.e. show any msgstr which does not match /./, and is therefore empty.

Since msggrep doesn't have -c, the count in a one-liner is:

 msggrep -v -T -e "." test.po  | grep -c ^msgstr

(msggrep has been part of the gettext package since v0.11, Jan 2002. LSB Core aka ISO/IEC 23360-1:2006(E) only mandates the gettext and msgfmt binaries, but I've yet to see a system without it, so it should hopefully meet your requirements.)

like image 20
mr.spuratic Avatar answered Nov 08 '22 11:11

mr.spuratic


As awk (nice) solution is already given, there is 4 other ways:

All commands was tested with your sample and a good .po file.

Using sed

sed -ne '/msgstr ""/{N;s/\n$//p}' <poFile | wc -l
2

Explained: Each time I found msgstr "", I merge next line, than if I could suppress a newline as last character of my strings/\n$//, I print them p. For finaly count the number of lines.

Bash only

Without the use of any binary other than bash:

total=0
while read line;do
    if [ "$line" == 'msgstr ""' ] ;then
        read line
        [ -z "$line" ] && ((total++))
      fi
  done <poFile
echo $total
2

Explained: Each time I found msgstr "", I read next line, than if empty, I increment my counter.

Other bash way
mapfile -t line <poFile
count=0
for ((i=${#line[@]};i--;));do
    [ -z "${line[i]}" ] && [ "${line[i-1]}" == 'msgstr ""' ] && ((count++))
  done
echo $count
2

Explained: read the entire .po file in one array, than browse array for empty field where previous field contain msgstr "", increment counter, than print.

Perl (in command line mode)

perl -ne '$t++if/^$/&&$l=~/msgstr\s""\s*$/;$l=$_;END{printf"%d\n",$t}' <poFile
2

Explained: Each time I found an empty line and previous line (stored in variable $l) contain msgstr "" then I increment the counter.

Dash (not bash!)

count=0
while read line ; do
    [ "$line" = "" ] && [ "$prev" = 'msgstr ""' ] && true $((count=count+1))
    prev="$line"
  done <poFile
echo $count
2

Based on perl sample, this work on both bash and dash

like image 2
F. Hauri Avatar answered Nov 08 '22 10:11

F. Hauri


Try:

grep -c '^""$'

it counts the lines where the only content is two ".

EDIT:

Following from your comment I see that the above does not meet your needs. To perform a multi-line match you could use GNU grep in the following way:

grep -Pzo '^msgstr ""\n\n' en.po | grep -c msgstr

This was tested and found to work using GNU grep 2.14. I however do not know if GNU grep is standard enough for you.

Explanation of 1st grep:

-P activate the Perl regex extension.

-z replace the newline at the end of line with a null, allowing grep to keep track of new lines.

-o print 'only-matching', required because -z is in use; otherwise we'd print the whole file.

Explanation of 2nd grep:

-c count the number of lines matching, in this case msgstr. This has to be in a separate grep statement as -c would return 1 if used with -z.

like image 1
imp25 Avatar answered Nov 08 '22 09:11

imp25