I have two lists of song titles, each in a plain text file, which are the filenames of licensed lyric files - I want to check if the shorter list titles (needle) are in the longer list (haystack). The script/app should return the list of titles in the needle that aren't in the haystack.
I'd prefer to use Python or a shell script (BASH) or just use visual diff program that can handle the fuzziness needed.
The main problem is that the titles need to be fuzzy matched to account for data entry errors and possibly also word ordering.
Haystack sample (note some duplicate and near duplicate lines, matches highlighted):
Yearn
Yesterday, Today And Forever
Yesterday, Today, Forever
You
You Alone
You Are Here (The Same Power)
You Are Holy
You Are Holy (Prince Of Peace)
You Are Mighty
You Are Mine
You Are My All In All
You Are My Hiding Place
You Are My King (Amazing Love)
You Are Righteous (Hope)
You Are So Faithful
You Are So Good to Me
You Are Worthy Of My Praise
You Have Been Good
You Led Me To The Cross
You Reign
You Rescued Me
You Said
You Sent Your Own
You Set Me Apart (Dwell In Your House)
You alone are worthy (Glory in the highest)
You are God in heaven (let my words be few)
You are always fighting for us (Hallelujah you have overcome)
You are beautiful (I stand in awe)
You are beautiful beyond description
You are mighty
You are my all in all
You are my hiding place
You are my passion
You are still Holy
You are the Holy One (We exalt Your name)
You are the mighty King
You are the mighty warrior
You are the vine
**You chose the cross (Lost in wonder)**
You have shown me favour unending
You hold the broken hearted
You laid aside Your majesty
You said
You're Worthy Of My Praise
You're calling me (Unashamed love)
You're the God of this city
You're the Lion of Judah
You're the word of God the Father (Across the lands)
You've put a new song in my heart
Your Beloved
Your Grace is Enough
Your Great Name We Praise
Your Great Name We Praise-2
Your Light (You Have Turned)
Your Light Is Over Me (His Love)
**Your Love**
**Your Love Is Amazing**
Your Love Is Deep
Your Love Is Deeper - Jesus, Lord of Heaven (Phil Wickham)
Your Love Oh Lord
Your Love Oh Lord (Psalm 36)
Your Love is Extravagant
Your Power (Send Me)
Your blood speaks a better word
Your everlasting love
**Your grace is enough**
**Your grace is enough (Great is Your faithfulness)**
Your mercy is falling
Your mercy taught us how to dance (Dancing generation)
Your voice stills the oceans (nothing is impossible)
Yours Is The Kingdom
Needle sample:
You Are Good (I Want To Scream It Out)
You Are My Strength (In The Fullness)
You Are My Vision O King Of My Heart
You Are The King Of Glory (Hosanna To The Son)
**You Chose The Cross (Lost In Wonder)**
**Your Grace Is Enough (This Is Our God)**
**Your Love Is Amazing Steady And Unchanging**
**Your Love Shining Like The Sun**
Note that the needle title "Your Love Shining Like The Sun" is only a possible match for "Your Love". It's better to fail towards not matching and therefore any uncertain title matches should appear in the output.
comm -1 -3 <(sort haystack.txt) <(sort needle.txt)
doesn't find any of the matches. diff
or grep
seems they'd have the same problem with not being fuzzy enough. Kdiff3
and diffnow.com
were as quick as a manual comparison as I had to still scan through for nearly all matches, they could only cope with whitespace and letter-case differences.
ExamDiffPro
from prestosoft.com looks like a possibility but is MS Windows only and I'd prefer a native Linux solution before I go messing with WINE or VirtualBox.
The needle is actually a CSV so I've thought about using LibreOffice and treating it as a database and doing SQL queries or using a spreadsheet with hlookup or something ... Another question led me to OpenRefine (formerly google-refine)
Seems like this is a common category of problem (it's basically "record linkage" which often uses [Levenshtein] edit-distance calculation), how should I approach it? Suggestions please?
Open both of the files that you want to compare. On the View tab, in the Window group, click View Side by Side. in the Window group on the View tab.
To compare two files or groups of files at a local site, you can use the Fc.exe and the Comp.exe file compare commands. Both commands are run from a command prompt. You can use Fc.exe to compare two ASCII or binary files on a line-by-line basis.
I did something similar to this in MySQL. I used the following code to define Levenshtein distance and ratio functions (which I got from the answer to this question):
DROP FUNCTION IF EXISTS `levenshtein`;
CREATE FUNCTION `levenshtein`(s1 text, s2 text) RETURNS int(11) DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
DECLARE cv0, cv1 text;
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END;
DROP FUNCTION IF EXISTS `levenshtein_ratio`;
CREATE FUNCTION `levenshtein_ratio`(s1 text, s2 text) RETURNS int(11) DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, max_len INT;
SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
IF s1_len > s2_len THEN
SET max_len = s1_len;
ELSE
SET max_len = s2_len;
END IF;
RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
END;
Assuming you import your lists as the following two tables:
needle (title VARCHAR)
haystack (title VARCHAR)
You can then use a query like the following one to compare the two tables.
SELECT title, best_match,
levenshtein_ratio(TRIM(LOWER(title)), TRIM(LOWER(best_match))) AS ratio
FROM (
SELECT n.title AS title, (
SELECT h.title
FROM haystack h
ORDER BY levenshtein_ratio(TRIM(LOWER(n.title)), TRIM(LOWER(h.title))) DESC
LIMIT 1
) AS best_match
FROM needle n
) x
ORDER BY ratio DESC
Pick a cutoff value, all rows below that value didn't have good matches. You can use levenshtein() instead of levenshtein_ratio() if you want to work with edit distances directly, in this case the ORDER BY to ASC.
Note that this doesn't have any special provision for word order differences. Also, if your lists are large, the comparison can be slow.
If you want to go the OpenRefine way, the best is set up a local reconciliation server - I recommand reconcile csv.
Load your haystack in the reconciliation server and have OpenRefine processing your needle file and sending it to the reconciliation service.
Reconciliation server return a score with each proposal so you can (I just use those score as example, don't take them for granted):
You might want to look at fuzzywuzzy (https://github.com/seatgeek/fuzzywuzzy).
from fuzzywuzzy import process
needles = open('needle').read().split("\n")
haystack = open('haystack').read().split("\n")
for a in needles:
print a + ' -> ',
print process.extractBests(a, haystack, score_cutoff=90)
Usefull params for the extract function are limit, scorer, and processor.
Fuzzywuzzy can make your dreams come true:
import csv
from fuzzywuzzy import fuzz
# Grab CSV data:
with open('needles.csv', 'U') as z:
reader = csv.reader(z)
needles = list(reader)
with open('haystack.csv', 'U') as w:
reader = csv.reader(w)
haystack = list(reader)
# Calculate matches and append to list
NeedlesNotInHaystack = []
Fuzziness = 80 # ADJUST THIS VALUE TO FINE-TUNE RESULTS
for x in needle:
for y in haystack:
if fuzz.ratio(x,y) > Fuzziness:
NeedlesNotInHaystack.append(x)
#Export results to CSV:
with open('Results', 'wb') as csvfile:
temp = csv.writer(csvfile)
temp.writerows(NeedlesNotInHaystack)
Rather than fuzz.ratio, you may get beter results using:
fuzz.token_sort_ratio OR fuzz.token_set_ratio
IMO, this tutorial is the best + most concise overview of fuzzy matching in Python
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With