Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bash. The quickest and efficient array search

I need to do an array search so many time in a bash process. I need to know what is the most quick and efficient way to do it. I know how to do it. The point of the question os how to do it in the fastest way. Now, I'm doing in this way:

#!/bin/bash

array_test=("text1" "text2" "text3" "text4" "text5")
text_to_search="text4"

START=$(date +%s.%N)

for item in "${array_test[@]}"; do
    if [ ${item} = "${text_to_search}" ]; then
        echo "found!!"
        break
    fi
done

END=$(date +%s.%N)
DIFF=$(echo "$END - $START" | bc)
echo $DIFF

With this code we can measure the time.

Imagine we have 300 items or more in array. Is there a quicker way? I need to increase performance. Thank you.

EDIT I'm using bash 4.2 . The real array has line breaks:

array_test=(
"text1"
"text2"
"text3"
"text4"
"text5"
)
like image 598
OscarAkaElvis Avatar asked Feb 18 '17 10:02

OscarAkaElvis


4 Answers

Fastest Version (for large arrays)

Use grep -qFx in array[*].
-q suppresses output and exists on the first match.
-F searches fixed strings instead of regexes.
-x searches whole lines instead of substrings. This ensures that the search string b doesn't match the element abc.

if ( IFS=$'\n'; echo "${array[*]}" ) | grep -qFx "$text_to_search"; then
    echo "found!"
fi

Assumption: Neither the array nor the text to search contain linebreaks.

It's also possible to search texts with linebreaks, as long as there is a known unused character which can be stored inside a bash variable. I found \x1E (the ASCII control character »Record Separator«) to work out quite well. The adapted version looks as follows:

export d=$'\x1E' # unused character, here "Record Seperator"
if ( IFS="$d"; echo "$d${array[*]}$d" ) | grep -qF "$d$text_to_search$d"; then
    echo "found!"
fi

In theory you might be able to speed this up even further by using multiple parallel grep on slices of the array. However, this is so fast (see results below) that you probably won't ever encounter a scenario where the parallel search pays off.

Testing

I used an array of size 1'000'000, which was generated as follows:

size=$((10 ** 6))
array_test=($(seq -f 'text%.0f' 1 "$size"))

(By the way: using {1..1000000} was magnitudes slower than seq)

The search pattern was the last entry of said array

text_to_search="text$size"

Three search methods were tested

  1. Your method using a for loop
  2. printf %s\\n "array[@]" | grep -qFx
  3. (IFS=$'\n'; echo "array[*]";) | grep -qFx

With the following results:

  1. 65.5 seconds
  2. 59.3 seconds
  3. 00.4 seconds (yes, that's a zero before the decimal point)
like image 185
Socowi Avatar answered Oct 24 '22 05:10

Socowi


If you deeply care about performance, perhaps (a) Bash is not the best tool and (b) you should try different approaches and test them on your data. This being said, maybe associative arrays could help you.

Try this :

#!/bin/bash

declare -A array_test()
array_test=(["text1"]="" ["text2"]="" ["text3"]="" ["text4"]="" ["text5"]="")
text_to_search="text4"

if
  [[ ${array_test[$text_to_search]+found} ]]
then
  echo "Found!"
fi

Please note in this case I build the associative arrays with keys but empty values (no real use in setting the value to the same thing as the key and eating up more memory).

Another approach would be to sort your array, and use some kind of binary search. That would, of course, involve more code, and if Bash associative arrays are implemented efficiently, is likely to be slower. But, again, nothing beats testing on actual data to validate performance hypotheses.

If you have an associative arrays with information in the keys, you can use an expansion to use them the same way you would use values :

for key in "${!array[@]}"
do
  do_something_with_the key
done

Also, you can build your array with a loop, which would be useful if you read your elements from a file or from the output of a command. As an example :

declare -A array_test=()
while IFS= read -r key
do
  array_test[$key]=
done < <(command_with_output)

It is useful to note that an element set to a null (empty) value is not the same as an unset element. You can easily see that with the following expansion :

declare -A array_test=()
array_test[existing_key]=
echo ${array_test[existing_key]+found} # Echoes "found"
echo ${array_test[missing_key]+found}  # Echoes nothing

The "${var+value}" expansion use useful because it expands to nothing if the variable is unset, and to value if it is set. It does not produce an error when using set -u that catches attempts to expand unset variables.

like image 43
Fred Avatar answered Oct 24 '22 05:10

Fred


If none of your array elements contain whitespace, you can use pattern matching

array_test=("text1" "text2" "text3" "text4" "text5")
text_to_search="text4"

[[ " ${array_test[*]} " == *"  $text_to_search "*]] && echo found || echo not found

The quotes and spaces are very important there.

In bash, within [[ ]] double brackets, the == operator is a pattern matching operator, not just string equality.

like image 41
glenn jackman Avatar answered Oct 24 '22 05:10

glenn jackman


A concrete example based on useful @Fred answer using associative array:

script.sh

#!/bin/bash

read -a array_test <<< $(seq 1 10000)
text_to_search="9999"

function from_associative_array {
  declare -A array
  for constant in ${array_test[@]}
  do
      array[$constant]=1
  done
  [[ ${array[$text_to_search]} ]] && echo  "Found in associative array";
}

function from_indexed_array {
  for item in "${array_test[@]}"; do
      if [ ${item} = "${text_to_search}" ]; then
          echo "Found in indexed array"
          break
      fi
  done
}

time from_indexed_array
time from_associative_array

$ bash script.sh
Found in indexed array

real    0m0.611s
user    0m0.604s
sys     0m0.004s

Found in associative array

real    0m0.297s
user    0m0.296s
sys     0m0.000s
like image 42
SLePort Avatar answered Oct 24 '22 03:10

SLePort