Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallelize nested for loop in GNU Parallel

I have a small bash script to OCR PDF files (slightly modified this script). The basic flow for each file is:

For each page in pdf FILE:

  1. Convert page to TIFF image (imegamagick)
  2. OCR image (tesseract)
  3. Cat results to text file

Script:

FILES=/home/tgr/OCR/input/*.pdf
for f in $FILES
do

  FILENAME=$(basename "$f") 
  ENDPAGE=$(pdfinfo $f | grep "^Pages: *[0-9]\+$" | sed 's/.* //')
  OUTPUT="/home/tgr/OCR/output/${FILENAME%.*}.txt"
  RESOLUTION=1400
  touch $OUTPUT
  for i in `seq 1 $ENDPAGE`; do
      convert -monochrome -density $RESOLUTION $f\[$(($i - 1 ))\] page.tif
      echo processing file $f, page $i
      tesseract page.tif tempoutput -l ces
      cat tempoutput.txt >> $OUTPUT
  done

  rm tempoutput.txt
  rm page.tif
done

Because of high resolution and fact that tesseract can utilize only one core, the process is extremely slow (takes approx. 3 minutes to convert one PDF file).

Because I have thousands of PDF files I think I can use parallel to use all 4 cores, but I don't get the concept how to use it. In examples I see:

Nested for-loops like this:

  (for x in `cat xlist` ; do
    for y in `cat ylist` ; do
      do_something $x $y
    done
  done) | process_output
can be written like this:

parallel do_something {1} {2} :::: xlist ylist | process_output

Unfortunately I was not able to figure out how to apply this. How do I parallelize my script?

like image 692
Tomas Greif Avatar asked Sep 20 '13 12:09

Tomas Greif


2 Answers

Since you have 1000s of PDF files it is probably enough simply to parallelize the processing of PDF-files and not parallelize the processing of the pages in a single file.

function convert_func {
  f=$1
  FILENAME=$(basename "$f") 
  ENDPAGE=$(pdfinfo $f | grep "^Pages: *[0-9]\+$" | sed 's/.* //')
  OUTPUT="/home/tgr/OCR/output/${FILENAME%.*}.txt"
  RESOLUTION=1400
  touch $OUTPUT
  for i in `seq 1 $ENDPAGE`; do
      convert -monochrome -density $RESOLUTION $f\[$(($i - 1 ))\] $$.tif
      echo processing file $f, page $i
      tesseract $$.tif $$ -l ces
      cat $$.txt >> $OUTPUT
  done

  rm $$.txt
  rm $$.tif
}

export -f convert_func

parallel convert_func ::: /home/tgr/OCR/input/*.pdf

Watch the intro video for a quick introduction: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial (man parallel_tutorial or http://www.gnu.org/software/parallel/parallel_tutorial.html). You command line with love you for it.

Read the EXAMPLEs (LESS=+/EXAMPLE: man parallel).

like image 190
Ole Tange Avatar answered Nov 07 '22 00:11

Ole Tange


You can have a script like this.

#!/bin/bash

function convert_func {
    local FILE=$1 RESOLUTION=$2 PAGE_INDEX=$3 OUTPUT=$4
    local TEMP0=$(exec mktemp --suffix ".00.$PAGE_INDEX.tif")
    local TEMP1=$(exec mktemp --suffix ".01.$PAGE_INDEX")
    echo convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP0"  ## Just for debugging purposes.
    convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP0"
    echo "processing file $FILE, page $PAGE_INDEX"  ## I think you mean to place this before the line above.
    tesseract "$TEMP0" "$TEMP1" -l ces
    cat "$TEMP1".txt >> "$OUTPUT"  ## Lines may be mixed up from different processes here and a workaround may still be needed but it may no longer be necessary if outputs are small enough.
    rm -f "$TEMP0" "$TEMP1"
}

export -f convert_func

FILES=(/home/tgr/OCR/input/*.pdf)

for F in "${FILES[@]}"; do
    FILENAME=${F##*/}
    ENDPAGE=$(exec pdfinfo "$F" | grep '^Pages: *[0-9]\+$' | sed 's/.* //')
    OUTPUT="/home/tgr/OCR/output/${FILENAME%.*}.txt"
    RESOLUTION=1400
    touch "$OUTPUT"  ## This may no longer be necessary. Or probably you mean to truncate it instead e.g. : > "$OUTPUT"

    for (( I = 1; I <= ENDPAGE; ++I )); do
        printf "%s\xFF%s\xFF%s\xFF%s\x00" "$F" "$RESOLUTION" "$I" "$OUTPUT"
    done | parallel -0 -C $'\xFF' -j 4 -- convert_func '{1}' '{2}' '{3}' '{4}'
done

It exports a function that's importable by parallel, make proper sanitation of arguments, and unique temporary files to make parallel processing possible.

Update. This would hold output on multiple temporary files first before concatenating them to one main output file.

#!/bin/bash

shopt -s nullglob

function convert_func {
    local FILE=$1 RESOLUTION=$2 PAGE_INDEX=$3 OUTPUT=$4 TEMPLISTFILE=$5

    local TEMP_TIF=$(exec mktemp --suffix ".01.$PAGE_INDEX.tif")
    local TEMP_TXT_BASE=$(exec mktemp --suffix ".02.$PAGE_INDEX")

    echo "processing file $FILE, page $PAGE_INDEX"

    echo convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP_TIF"  ## Just for debugging purposes.
    convert -monochrome -density "$RESOLUTION" "${FILE}[$(( PAGE_INDEX - 1 ))]" "$TEMP_TXT_BASE"

    tesseract "$TEMP_TIF" "$TEMP_TXT_BASE" -l ces

    echo "$PAGE_INDEX"$'\t'"${TEMP_TXT_BASE}.txt" >> "$TEMPLISTFILE"

    rm -f "$TEMP_TIF"
}

export -f convert_func

FILES=(/home/tgr/OCR/input/*.pdf)

for F in "${FILES[@]}"; do
    FILENAME=${F##*/}
    ENDPAGE=$(exec pdfinfo "$F" | grep '^Pages: *[0-9]\+$' | sed 's/.* //')
    BASENAME=${FILENAME%.*}
    OUTPUT="/home/tgr/OCR/output/$BASENAME.txt"
    RESOLUTION=1400

    TEMPLISTFILE=$(exec mktemp --suffix ".00.$BASENAME")
    : > "$TEMPLISTFILE"

    for (( I = 1; I <= ENDPAGE; ++I )); do
        printf "%s\xFF%s\xFF%s\xFF%s\x00" "$F" "$RESOLUTION" "$I" "$OUTPUT"
    done | parallel -0 -C $'\xFF' -j 4 -- convert_func '{1}' '{2}' '{3}' '{4}' "$TEMPLISTFILE"

    while IFS=$'\t' read -r __ FILE; do
        cat "$FILE"
        rm -f "$FILE"
    done < <(exec sort -n "$TEMPLISTFILE") > "$OUTPUT"

    rm -f "$TEMPLISTFILE"
done
like image 2
konsolebox Avatar answered Nov 07 '22 00:11

konsolebox