Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using awk to find a domain name containing the longest repeated word

Tags:

regex

bash

awk

For example, let's say there is a file called domains.csv with the following:

1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org

I'm trying to use linux awk regex expressions to find the line that contains the longest repeated1 word, so in this case, it will return the line

5,letswelcomewelcomeyou.org

How do I do that?

1 Meaning "immediately repeated", i.e., abcabc, but not abcXabc.

like image 524
cullan Avatar asked Dec 14 '22 08:12

cullan


1 Answers

A pure awk implementation would be rather long-winded as awk regexes don't have backreferences, the usage of which simplifies the approach quite a bit.

I'ved added one line to the example input file for the case of multiple longest words:

1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org

And this gets the lines with the longest repeated sequence:

cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{ print length(), $0 }' | sort -k 1,1 -nr |
awk 'NR==1 {prev=$1;print $2;next} $1==prev {print $2;next} {exit}' | grep -f - infile

Since this is pretty anti-obvious, let's split up what this does and look at the output at each stage:

  1. Remove the first column with the line number to avoid matches for lines numbers with repeating digits:

    $ cut -d ',' -f 2 infile
    helloguys.ca
    byegirls.com
    hellohelloboys.ca
    hellobyebyedad.com
    letswelcomewelcomeyou.org
    letscomewelcomewelyou.org
    
  2. Get all lines with a repeated sequence, extract just that repeated sequence:

    ... | grep -Eo '(.*)\1'
    ll
    hellohello
    ll
    byebye
    welcomewelcome
    comewelcomewel
    
  3. Get the length of each of those lines:

    ... | awk '{ print length(), $0 }'
    2 ll
    10 hellohello
    2 ll
    6 byebye
    14 welcomewelcome
    14 comewelcomewel
    
  4. Sort by the first column, numerically, descending:

    ...| sort -k 1,1 -nr
    14 welcomewelcome
    14 comewelcomewel
    10 hellohello
    6 byebye
    2 ll
    2 ll
    
  5. Print the second of these columns for all lines where the first column (the length) has the same value as on the first line:

    ... | awk 'NR==1{prev=$1;print $2;next} $1==prev{print $2;next} {exit}'
    welcomewelcome
    comewelcomewel
    
  6. Pipe this into grep, using the -f - argument to read stdin as a file:

    ... | grep -f - infile
    5,letswelcomewelcomeyou.org
    6,letscomewelcomewelyou.org
    

Limitations

While this can handle the bbwelcomewelcome case mentioned in comments, it will trip on overlapping patterns such as welwelcomewelcome, where it only finds welwel, but not welcomewelcome.

Alternative solution with more awk, less sort

As pointed out by tripleee in comments, this can be simplified to skip the sort step and combine the two awk steps and the sort step into a single awk step, likely improving performance:

$ cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{if (length()>ml) {ml=length(); delete a; i=1} if (length()>=ml){a[i++]=$0}}
END{for (i in a){print a[i]}}' |
grep -f - infile

Let's look at that awk step in more detail, with expanded variable names for clarity:

{
    # New longest match: throw away stored longest matches, reset index
    if (length() > max_len) {
        max_len = length()
        delete arr_longest
        idx = 1
    }

    # Add line to longest matches
    if (length() >= max_len)
        arr_longest[idx++] = $0
}

# Print all the longest matches
END {
    for (idx in arr_longest)
        print arr_longest[idx]
}

Benchmarking

I've timed the two solutions on the top one million domains file mentioned in the comments:

  • First solution (with sort and two awk steps):

    964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
    
    real    1m55.742s
    user    1m57.873s
    sys     0m0.045s
    
  • Second solution (just one awk step, no sort):

    964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
    
    real    1m55.603s
    user    1m56.514s
    sys     0m0.045s
    
  • And the Perl solution by Casimir et Hippolyte:

    964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
    
    real    0m5.249s
    user    0m5.234s
    sys     0m0.000s
    

What we learn from this: ask for a Perl solution next time ;)

Interestingly, if we know that there will be just one longest match and simplify the commands accordingly (just head -1 instead of the second awk command for the first solution, or no keeping track of multiple longest matches with awk in the second solution), the time gained is only in the range of a few seconds.

Portability remark

Apparently, BSD grep can't do grep -f - to read from stdin. In this case, the output of the pipe until there has to be redirected to a temp file, and this temp file then used with grep -f.

like image 127
Benjamin W. Avatar answered May 14 '23 09:05

Benjamin W.