Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find all words containing characters in UNIX

Tags:

shell

unix

Given a word W, I want to find all words containing the letters in W from /usr/dict/words. For example, "bat" should return "bat" and "tab" (but not "table").

Here is one solution which involves sorting the input word and matching:

word=$1
sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`

while read line
do
    sortedLine=`echo $line | grep -o . | sort | tr -d '\n'`
    if [ "$sortedWord" == "$sortedLine" ]
    then
        echo $line
    fi
done < /usr/dict/words

Is there a better way? I'd prefer using basic commands (instead of perl/awk etc), but all solutions are welcome!

To clarify, I want to find all permutations of the original word. Addition or deletion of characters is not allowed.

like image 982
dogbane Avatar asked Mar 04 '10 10:03

dogbane


People also ask

How do you check if a file contains any characters in Unix?

You need to use the grep command. The grep command or egrep command searches the given input FILEs for lines containing a match or a text string.

How do I search for a file containing text in Unix?

The grep command searches through the file, looking for matches to the pattern specified. To use it type grep , then the pattern we're searching for and finally the name of the file (or files) we're searching in.

How do you grep in Unix?

To search multiple files with the grep command, insert the filenames you want to search, separated with a space character. The terminal prints the name of every file that contains the matching lines, and the actual lines that include the required string of characters. You can append as many filenames as needed.

What is $_ in Unix?

$_ (dollar underscore) is another special bash parameter and used to reference the absolute file name of the shell or bash script which is being executed as specified in the argument list. This bash parameter is also used to hold the name of mail file while checking emails.


3 Answers

here's an awk implementation. It finds the words with those letters in "W".

dict="/usr/share/dict/words"
word=$1
awk -vw="$word" 'BEGIN{
  m=split(w,c,"")
  for(p=1;p<=m;p++){ chars[c[p]]++ }
}
length($0)==length(w){
  f=0;g=0
  n=split($0,t,"")
  for(o=1;o<=n;o++){
    if (!( t[o] in chars) ){
       f=1; break
    }else{ st[t[o]]++ }
  }
  if (!f || $0==w){
      for(z in st){
        if ( st[z] != chars[z] ) { g=1 ;break}
      }
      if(!g){ print "found: "$0 }
  }
  delete st
}' $dict

output

$ wc -l < /usr/share/dict/words
479829

$ time ./shell.sh look
found: kolo
found: look

real    0m1.361s
user    0m1.074s
sys     0m0.015s

Update: change of algorithm, using sorting

dict="/usr/share/dict/words"
awk 'BEGIN{
  w="table"
  m=split(w,c,"")
  b=asort(c,chars)
}
length($0)==length(w){
  f=0
  n=split($0,t,"")
  e=asort(t,d)
  for(i=1;i<=e;i++) {
    if(d[i]!=chars[i]){
        f=1;break
    }
  }
  if(!f) print $0
}' $dict

output

$ time ./shell.sh #looking for table
ablet
batel
belat
blate
bleat
tabel
table

real    0m1.416s
user    0m1.343s
sys     0m0.014s

$ time ./shell.sh #looking for chairs
chairs
ischar
rachis

real    0m1.697s
user    0m1.660s
sys     0m0.014s

$ time perl perl.pl #using beamrider's Perl script
table
tabel
ablet
batel
blate
bleat
belat

real    0m2.680s
user    0m1.633s
sys     0m0.881s

$ time perl perl.pl # looking for chairs
chairs
ischar
rachis

real    0m14.044s
user    0m8.328s
sys     0m5.236s
like image 132
ghostdog74 Avatar answered Oct 23 '22 08:10

ghostdog74


Here's a shell solution. The best algorithm seems to be #4. It filters out all words that are of incorrect length. Then, it sums the words using a simple substitution cipher (a=1, b=2, A=27, ...). If the sums match, then it will actually do the original sort and compare. On my system, it can churn through ~235k words looking for "bat" in just under 1/2 second. I'm providing all of my solutions so you can see the different approaches.

Update: not shown, but I also tried putting the sum inside the first bin of the histogram approach I tried, but it was even slower than the histograms without. I thought it would function as a short circuit, but it didn't work.

Update2: I tried the awk solution and it runs in about 1/3 the time of my best shell solution or ~0.126s versus ~0.490s. The perl solution runs ~1.1s.

#!/bin/bash

word=$1
#dict=words
dict=/usr/share/dict/words
#dict=/usr/dict/words

alg1() {
  sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`

  while read line
  do
    sortedLine=`echo $line | grep -o . | sort | tr -d '\n'`
    if [ "$sortedWord" == "$sortedLine" ]
    then
      echo $line
    fi
  done < $dict
}

check_sorted_versus_not() {
    local word=$1
    local line=`echo $2 | grep -o . | sort | tr -d '\n'`
    if [ "$word" == "$line" ]
    then
        echo $2
    fi
}

# Filter out all words of incorrect length
alg2() {
  sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
  grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"

  grep "$grep_string" "$dict" | \
  while read line
  do
    sortedLine=`echo $line | grep -o . | sort | tr -d '\n'`
    if [ "$sortedWord" == "$sortedLine" ]
    then
      echo $line
    fi
  done
}


# Create a lot of variables like this:
# _a=1, _b=2, ... _z=26, _A=27, _B=28, ... _Z=52
gen_chars() {
#  [ -n "$GEN_CHARS" ] && return
  GEN_CHARS=1
  local alpha="abcdefghijklmnopqrstuvwxyz"
  local upperalpha=`echo -n $alpha | tr 'a-z' 'A-Z'`
  local both="$alpha$upperalpha"
  for ((i=0; i < ${#both}; i++))
  do
    ACHAR=${both:i:1}
    eval "_$ACHAR=$((i+1))"
  done
}

# I think it's faster to return the value in a var then to echo it in a sub process.
# Try summing the word one char at a time by building an arithmetic expression
# and then evaluate that expression.
# Requires: gen_chars
sum_word() {
  SUM=0
  local s=""
  # parsing input one character at a time
  for ((i=0; i < ${#1}; i++))
  do
    ACHAR=${1:i:1}
    s="$s\$_$ACHAR+"
  done

  SUM=$(( $(eval echo -n ${s}0) ))
}

# I think it's faster to return the value in a var then to echo it in a sub process.
# Try summing the word one char at a time using a case statement.
sum_word2() {
  SUM=0
  local s=""
  # parsing input one character at a time
  for ((i=0; i < ${#1}; i++))
  do
    ACHAR=${1:i:1}
    case $ACHAR in
    a) SUM=$((SUM+  1));;
    b) SUM=$((SUM+  2));;
    c) SUM=$((SUM+  3));;
    d) SUM=$((SUM+  4));;
    e) SUM=$((SUM+  5));;
    f) SUM=$((SUM+  6));;
    g) SUM=$((SUM+  7));;
    h) SUM=$((SUM+  8));;
    i) SUM=$((SUM+  9));;
    j) SUM=$((SUM+ 10));;
    k) SUM=$((SUM+ 11));;
    l) SUM=$((SUM+ 12));;
    m) SUM=$((SUM+ 13));;
    n) SUM=$((SUM+ 14));;
    o) SUM=$((SUM+ 15));;
    p) SUM=$((SUM+ 16));;
    q) SUM=$((SUM+ 17));;
    r) SUM=$((SUM+ 18));;
    s) SUM=$((SUM+ 19));;
    t) SUM=$((SUM+ 20));;
    u) SUM=$((SUM+ 21));;
    v) SUM=$((SUM+ 22));;
    w) SUM=$((SUM+ 23));;
    x) SUM=$((SUM+ 24));;
    y) SUM=$((SUM+ 25));;
    z) SUM=$((SUM+ 26));;
    A) SUM=$((SUM+ 27));;
    B) SUM=$((SUM+ 28));;
    C) SUM=$((SUM+ 29));;
    D) SUM=$((SUM+ 30));;
    E) SUM=$((SUM+ 31));;
    F) SUM=$((SUM+ 32));;
    G) SUM=$((SUM+ 33));;
    H) SUM=$((SUM+ 34));;
    I) SUM=$((SUM+ 35));;
    J) SUM=$((SUM+ 36));;
    K) SUM=$((SUM+ 37));;
    L) SUM=$((SUM+ 38));;
    M) SUM=$((SUM+ 39));;
    N) SUM=$((SUM+ 40));;
    O) SUM=$((SUM+ 41));;
    P) SUM=$((SUM+ 42));;
    Q) SUM=$((SUM+ 43));;
    R) SUM=$((SUM+ 44));;
    S) SUM=$((SUM+ 45));;
    T) SUM=$((SUM+ 46));;
    U) SUM=$((SUM+ 47));;
    V) SUM=$((SUM+ 48));;
    W) SUM=$((SUM+ 49));;
    X) SUM=$((SUM+ 50));;
    Y) SUM=$((SUM+ 51));;
    Z) SUM=$((SUM+ 52));;
    *) SUM=0; return;;
    esac
  done
}

# I think it's faster to return the value in a var then to echo it in a sub process.
# Try summing the word by building an arithmetic expression using sed and then evaluating
# the expression.
# Requires: gen_chars
sum_word3() {
  SUM=$(( $(eval echo -n `echo -n $1 | sed -E -ne 's,.,$_&+,pg'`) 0))
  #echo "SUM($1)=$SUM"
}

# Filter out all words of incorrect length
# Sum the characters in the word: i.e. a=1, b=2, ...  and "abbc" = 1+2+2+3 = 8
alg3() {
  gen_chars
  sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
  sum_word $word
  word_sum=$SUM
  grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"

  grep "$grep_string" "$dict" | \
  while read line
  do
    sum_word $line
    line_sum=$SUM
    if [ $word_sum == $line_sum ]
    then
      check_sorted_versus_not $sortedWord $line
    fi
  done
}

# Filter out all words of incorrect length
# Sum the characters in the word: i.e. a=1, b=2, ...  and "abbc" = 1+2+2+3 = 8
# Use sum_word2
alg4() {
  sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
  sum_word2 $word
  word_sum=$SUM
  grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"

  grep "$grep_string" "$dict" | \
  while read line
  do
    sum_word2 $line
    line_sum=$SUM
    if [ $word_sum == $line_sum ]
    then
      check_sorted_versus_not $sortedWord $line
    fi
  done
}

# Filter out all words of incorrect length
# Sum the characters in the word: i.e. a=1, b=2, ...  and "abbc" = 1+2+2+3 = 8
# Use sum_word3
alg5() {
  gen_chars
  sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
  sum_word3 $word
  word_sum=$SUM
  grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"

  grep "$grep_string" "$dict" | \
  while read line
  do
    sum_word3 $line
    line_sum=$SUM
    if [ $word_sum == $line_sum ]
    then
      check_sorted_versus_not $sortedWord $line
    fi
  done
}


# I think it's faster to return the value in a var then to echo it in a sub process.
# Try summing the word one char at a time using a case statement.
# Place results in a histogram
sum_word4() {
  SUM=(0 0 0 0 0 0 0 0 0 0
       0 0 0 0 0 0 0 0 0 0
       0 0 0 0 0 0 
       0 0 0 0 0 0 0 0 0 0
       0 0 0 0 0 0 0 0 0 0
       0 0 0 0 0 0 
       0)
  # parsing input one character at a time
  for ((i=0; i < ${#1}; i++))
  do
    ACHAR=${1:i:1}
    case $ACHAR in
    a) SUM[1]=$((SUM[ 1] + 1));;
    b) SUM[2]=$((SUM[ 2] + 1));;
    c) SUM[3]=$((SUM[ 3] + 1));;
    d) SUM[4]=$((SUM[ 4] + 1));;
    e) SUM[5]=$((SUM[ 5] + 1));;
    f) SUM[6]=$((SUM[ 6] + 1));;
    g) SUM[7]=$((SUM[ 7] + 1));;
    h) SUM[8]=$((SUM[ 8] + 1));;
    i) SUM[9]=$((SUM[ 9] + 1));;
    j) SUM[10]=$((SUM[10] + 1));;
    k) SUM[11]=$((SUM[11] + 1));;
    l) SUM[12]=$((SUM[12] + 1));;
    m) SUM[13]=$((SUM[13] + 1));;
    n) SUM[14]=$((SUM[14] + 1));;
    o) SUM[15]=$((SUM[15] + 1));;
    p) SUM[16]=$((SUM[16] + 1));;
    q) SUM[17]=$((SUM[17] + 1));;
    r) SUM[18]=$((SUM[18] + 1));;
    s) SUM[19]=$((SUM[19] + 1));;
    t) SUM[20]=$((SUM[20] + 1));;
    u) SUM[21]=$((SUM[21] + 1));;
    v) SUM[22]=$((SUM[22] + 1));;
    w) SUM[23]=$((SUM[23] + 1));;
    x) SUM[24]=$((SUM[24] + 1));;
    y) SUM[25]=$((SUM[25] + 1));;
    z) SUM[26]=$((SUM[26] + 1));;
    A) SUM[27]=$((SUM[27] + 1));;
    B) SUM[28]=$((SUM[28] + 1));;
    C) SUM[29]=$((SUM[29] + 1));;
    D) SUM[30]=$((SUM[30] + 1));;
    E) SUM[31]=$((SUM[31] + 1));;
    F) SUM[32]=$((SUM[32] + 1));;
    G) SUM[33]=$((SUM[33] + 1));;
    H) SUM[34]=$((SUM[34] + 1));;
    I) SUM[35]=$((SUM[35] + 1));;
    J) SUM[36]=$((SUM[36] + 1));;
    K) SUM[37]=$((SUM[37] + 1));;
    L) SUM[38]=$((SUM[38] + 1));;
    M) SUM[39]=$((SUM[39] + 1));;
    N) SUM[40]=$((SUM[40] + 1));;
    O) SUM[41]=$((SUM[41] + 1));;
    P) SUM[42]=$((SUM[42] + 1));;
    Q) SUM[43]=$((SUM[43] + 1));;
    R) SUM[44]=$((SUM[44] + 1));;
    S) SUM[45]=$((SUM[45] + 1));;
    T) SUM[46]=$((SUM[46] + 1));;
    U) SUM[47]=$((SUM[47] + 1));;
    V) SUM[48]=$((SUM[48] + 1));;
    W) SUM[49]=$((SUM[49] + 1));;
    X) SUM[50]=$((SUM[50] + 1));;
    Y) SUM[51]=$((SUM[51] + 1));;
    Z) SUM[52]=$((SUM[52] + 1));;
    *) SUM[53]=-1; return;;
    esac
  done

 #echo ${SUM[*]}
}

# Check if two histograms are equal
hist_are_equal() {
  # Array sizes differ?
  [ ${#_h1[*]} != ${#SUM[*]} ] && return 1

  # parsing input one index at a time
  for ((i=0; i < ${#_h1[*]}; i++))
  do
    [ ${_h1[i]} != ${SUM[i]} ] && return 1
  done

  return 0
}

# Check if two histograms are equal
hist_are_equal2() {
  # Array sizes differ?
  local size=${#_h1[*]}
  [ $size != ${#SUM[*]} ] && return 1

  # parsing input one index at a time
  for ((i=0; i < $size; i++))
  do
    [ ${_h1[i]} != ${SUM[i]} ] && return 1
  done

  return 0
}

# Filter out all words of incorrect length
# Use sum_word4 which generates a histogram of character frequency
alg6() {
  sum_word4 $word
  _h1=${SUM[*]}
  grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"

  grep "$grep_string" "$dict" | \
  while read line
  do
    sum_word4 $line
    if hist_are_equal
    then
      echo $line
    fi
  done
}

# Filter out all words of incorrect length
# Use sum_word4 which generates a histogram of character frequency
alg7() {
  sum_word4 $word
  _h1=${SUM[*]}
  grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"

  grep "$grep_string" "$dict" | \
  while read line
  do
    sum_word4 $line
    if hist_are_equal2
    then
      echo $line
    fi
  done
}

run_test() {
  echo alg$1
  eval time alg$1
}

#run_test 1
#run_test 2
#run_test 3
run_test 4
#run_test 5
run_test 6
#run_test 7
like image 30
Harvey Avatar answered Oct 23 '22 08:10

Harvey


#!/usr/bin/perl
$myword=join("", sort split (//, $ARGV[0]));
shift;
while (<>) {
    chomp;
    print "$_\n" if (join("", sort split (//)) eq $myword);
}

Use it like this: bla.pl < /usr/dict/words searchword

like image 27
WMR Avatar answered Oct 23 '22 08:10

WMR