Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Process large amount of data using bash

I've got to process a large amount of txt files in a folder using bash scripting. Each file contains million of row and they are formatted like this:

File #1:

en ample_1 200
it example_3 24
ar example_5 500
fr.b example_4 570
fr.c example_2 39
en.n bample_6 10

File #2:

de example_3 4
uk.n example_5 50
de.n example_4 70
uk example_2 9
en ample_1 79
en.n bample_6 1

...

I've got to filter by "en" or "en.n", finding duplicate occurrences in the second column, sum third colum and get a sorted file like this:

en ample_1 279
en.n bample_6 11

Here my script:

#! /bin/bash
clear
BASEPATH=<base_path>
FILES=<folder_with_files>
TEMP_UNZIPPED="tmp"
FINAL_RES="pg-1"
#iterate each file in folder and apply grep
INDEX=0
DATE=$(date "+DATE: %d/%m/%y - TIME: %H:%M:%S")
echo "$DATE" > log
for i in ${BASEPATH}${FILES}
do
FILENAME="${i%.*}"
if [ $INDEX = 0 ]; then
  VAR=$(gunzip $i)
  #-e -> multiple condition; -w exact word; -r grep recursively; -h remove file path
  FILTER_EN=$(grep -e '^en.n\|^en ' $FILENAME > $FINAL_RES)
  INDEX=1
  #remove file to free space
  rm $FILENAME
else
  VAR=$(gunzip $i)
  FILTER_EN=$(grep -e '^en.n\|^en ' $FILENAME > $TEMP_UNZIPPED)
  cat $TEMP_UNZIPPED >> $FINAL_RES
  #AWK BLOCK
  #create array a indexed with page title and adding frequency parameter as value.
  #eg. a['ciao']=2 -> the second time I find "ciao", I sum previous value 2 with the new. This is why i use "+=" operator
  #for each element in array I print i=page_title and array content such as frequency
  PARSING=$(awk '{  page_title=$1" "$2;
                    frequency=$3;
                    array[page_title]+=frequency
                  }END{
                    for (i in array){
                      print i,array[i] | "sort -k2,2"
                    }
                  }' $FINAL_RES)

  echo "$PARSING" > $FINAL_RES
  #END AWK BLOCK
  rm $FILENAME
  rm $TEMP_UNZIPPED
fi
done
mv $FINAL_RES $BASEPATH/06/01/
DATE=$(date "+DATE: %d/%m/%y - TIME: %H:%M:%S")
echo "$DATE" >> log

Everything works, but it take a long long time to execute. Does anyone know how to get same result, with less time and less lines of code?

like image 352
JJack_ Avatar asked Feb 08 '23 10:02

JJack_


1 Answers

The UNIX shell is an environment from which to manipulate files and processes and sequence calls to tools. The UNIX tool which shell calls to manipulate text is awk so just use it:

$ awk '$1~/^en(\.n)?$/{tot[$1" "$2]+=$3} END{for (key in tot) print key, tot[key]}' file | sort
en ample_1 279
en.n bample_6 11

Your script has too many issues to comment on which indicates you are a beginner at shell programming - get the books Bash Shell Scripting Recipes by Chris Johnson and Effective Awk Programming, 4th Edition, by Arnold Robins.

like image 129
Ed Morton Avatar answered Feb 13 '23 02:02

Ed Morton