Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to search for Gadaffi [closed]

Tags:

regex

search

I'm trying to search for the word Gadaffi, which can be spelled in many different ways. What's the best regular expression to search for this?

This is a list of 30 variants:

Gadaffi
Gadafi
Gadafy
Gaddafi
Gaddafy
Gaddhafi
Gadhafi
Gathafi
Ghadaffi
Ghadafi
Ghaddafi
Ghaddafy
Gheddafi
Kadaffi
Kadafi
Kaddafi
Kadhafi
Kazzafi
Khadaffy
Khadafy
Khaddafi
Qadafi
Qaddafi
Qadhafi
Qadhdhafi
Qadthafi
Qathafi
Quathafi
Qudhafi
Kad'afi

My best attempt so far is:

\b[KG]h?add?af?fi$\b

But I still seem to be missing some variants. Any suggestions?

like image 356
SiggyF Avatar asked Mar 19 '11 22:03

SiggyF


4 Answers

Easy... (Qadaffi|Khadafy|Qadafi|...)... it's self-documented, maintainable, and assuming your regexp engine actually compiles regular expressions (rather than interpreting them), it will compile to the same DFA that a more obfuscated solution would.

Writing compact regular expressions is like using short variable names to speed up a program. It only helps if your compiler is brain-dead.

like image 87
Chris Pacejo Avatar answered Oct 14 '22 07:10

Chris Pacejo


\b[KGQ]h?add?h?af?fi\b

Arabic transcription is (Wiki says) "Qaḏḏāfī", so maybe adding a Q. And one H ("Gadhafi", as the article (see below) mentions).

Btw, why is there a $ at the end of the regex?


Btw, nice article on the topic:

Gaddafi, Kadafi, or Qaddafi? Why is the Libyan leader’s name spelled so many different ways?.


EDIT

To match all the names in the article you've mentioned later, this should match them all. Let's just hope it won't match a lot of other stuff :D

\b(Kh?|Gh?|Qu?)[aeu](d['dt]?|t|zz|dhd)h?aff?[iy]\b
like image 139
Czechnology Avatar answered Oct 14 '22 06:10

Czechnology


One interesting thing to note from your list of potential spellings is that there's only 3 Soundex values for the contained list (if you ignore the outlier 'Kazzafi')

G310, K310, Q310

Now, there are false positives in there ('Godby' also is G310), but by combining the limited metaphone hits as well, you can eliminate them.

<?
$soundexMatch = array('G310','K310','Q310');
$metaphoneMatch = array('KTF','KTHF','FTF','KHTF','K0F');

$text = "This is a big glob of text about Mr. Gaddafi. Even using compound-Khadafy terms in here, then we might find Mr Qudhafi to be matched fairly well. For example even with apostrophes sprinkled randomly like in Kad'afi, you won't find false positives matched like godfrey, or godby, or even kabbadi";

$wordArray = preg_split('/[\s,.;-]+/',$text);
foreach ($wordArray as $item){
    $rate = in_array(soundex($item),$soundexMatch) + in_array(metaphone($item),$metaphoneMatch);
    if ($rate > 1){
        $matches[] = $item;
    }
}
$pattern = implode("|",$matches);
$text = preg_replace("/($pattern)/","<b>$1</b>",$text);
echo $text;
?>

A few tweaks, and lets say some cyrillic transliteration, and you'll have a fairly robust solution.

like image 45
tomwalsham Avatar answered Oct 14 '22 06:10

tomwalsham


Using CPAN module Regexp::Assemble:

#!/usr/bin/env perl

use Regexp::Assemble;

my $ra = Regexp::Assemble->new;
$ra->add($_) for qw(Gadaffi Gadafi Gadafy Gaddafi Gaddafy
                    Gaddhafi Gadhafi Gathafi Ghadaffi Ghadafi
                    Ghaddafi Ghaddafy Gheddafi Kadaffi Kadafi
                    Kaddafi Kadhafi Kazzafi Khadaffy Khadafy
                    Khaddafi Qadafi Qaddafi Qadhafi Qadhdhafi
                    Qadthafi Qathafi Quathafi Qudhafi Kad'afi);
say $ra->re;

This produces the following regular expression:

(?-xism:(?:G(?:a(?:d(?:d(?:af[iy]|hafi)|af(?:f?i|y)|hafi)|thafi)|h(?:ad(?:daf[iy]|af?fi)|eddafi))|K(?:a(?:d(?:['dh]a|af?)|zza)fi|had(?:af?fy|dafi))|Q(?:a(?:d(?:(?:(?:hd)?|t)h|d)?|th)|u(?:at|d)h)afi))
like image 27
Prakash K Avatar answered Oct 14 '22 06:10

Prakash K