Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Easiest way to compare two files with lists of song titles

I have two lists of song titles, each in a plain text file, which are the filenames of licensed lyric files - I want to check if the shorter list titles (needle) are in the longer list (haystack). The script/app should return the list of titles in the needle that aren't in the haystack.

I'd prefer to use Python or a shell script (BASH) or just use visual diff program that can handle the fuzziness needed.

The main problem is that the titles need to be fuzzy matched to account for data entry errors and possibly also word ordering.

Haystack sample (note some duplicate and near duplicate lines, matches highlighted):

Yearn
Yesterday, Today And Forever
Yesterday, Today, Forever
You
You Alone
You Are Here (The Same Power)
You Are Holy
You Are Holy (Prince Of Peace)
You Are Mighty
You Are Mine
You Are My All In All
You Are My Hiding Place
You Are My King (Amazing Love)
You Are Righteous (Hope)
You Are So Faithful
You Are So Good to Me
You Are Worthy Of My Praise
You Have Been Good
You Led Me To The Cross
You Reign
You Rescued Me
You Said
You Sent Your Own
You Set Me Apart (Dwell In Your House)
You alone are worthy (Glory in the highest)
You are God in heaven (let my words be few)
You are always fighting for us (Hallelujah you have overcome)
You are beautiful (I stand in awe)
You are beautiful beyond description
You are mighty
You are my all in all
You are my hiding place
You are my passion
You are still Holy
You are the Holy One (We exalt Your name)
You are the mighty King
You are the mighty warrior
You are the vine
**You chose the cross (Lost in wonder)**
You have shown me favour unending
You hold the broken hearted
You laid aside Your majesty
You said
You're Worthy Of My Praise
You're calling me (Unashamed love)
You're the God of this city
You're the Lion of Judah
You're the word of God the Father (Across the lands)
You've put a new song in my heart
Your Beloved
Your Grace is Enough
Your Great Name We Praise
Your Great Name We Praise-2
Your Light (You Have Turned)
Your Light Is Over Me (His Love)
**Your Love**
**Your Love Is Amazing**
Your Love Is Deep
Your Love Is Deeper - Jesus, Lord of Heaven (Phil Wickham)
Your Love Oh Lord
Your Love Oh Lord (Psalm 36)
Your Love is Extravagant
Your Power (Send Me)
Your blood speaks a better word
Your everlasting love
**Your grace is enough**
**Your grace is enough (Great is Your faithfulness)**
Your mercy is falling
Your mercy taught us how to dance (Dancing generation)
Your voice stills the oceans (nothing is impossible)
Yours Is The Kingdom

Needle sample:

You Are Good (I Want To Scream It Out)
You Are My Strength (In The Fullness)
You Are My Vision O King Of My Heart
You Are The King Of Glory (Hosanna To The Son)
**You Chose The Cross (Lost In Wonder)**
**Your Grace Is Enough (This Is Our God)**
**Your Love Is Amazing Steady And Unchanging**
**Your Love Shining Like The Sun**

Note that the needle title "Your Love Shining Like The Sun" is only a possible match for "Your Love". It's better to fail towards not matching and therefore any uncertain title matches should appear in the output.

comm -1 -3 <(sort haystack.txt) <(sort needle.txt)

doesn't find any of the matches. diff or grep seems they'd have the same problem with not being fuzzy enough. Kdiff3 and diffnow.com were as quick as a manual comparison as I had to still scan through for nearly all matches, they could only cope with whitespace and letter-case differences.

ExamDiffPro from prestosoft.com looks like a possibility but is MS Windows only and I'd prefer a native Linux solution before I go messing with WINE or VirtualBox.

The needle is actually a CSV so I've thought about using LibreOffice and treating it as a database and doing SQL queries or using a spreadsheet with hlookup or something ... Another question led me to OpenRefine (formerly google-refine)

Seems like this is a common category of problem (it's basically "record linkage" which often uses [Levenshtein] edit-distance calculation), how should I approach it? Suggestions please?

like image 670
pbhj Avatar asked Jun 30 '15 21:06

pbhj


People also ask

How do I compare files side by side?

Open both of the files that you want to compare. On the View tab, in the Window group, click View Side by Side. in the Window group on the View tab.

How do I compare two files in Windows?

To compare two files or groups of files at a local site, you can use the Fc.exe and the Comp.exe file compare commands. Both commands are run from a command prompt. You can use Fc.exe to compare two ASCII or binary files on a line-by-line basis.


4 Answers

I did something similar to this in MySQL. I used the following code to define Levenshtein distance and ratio functions (which I got from the answer to this question):

DROP FUNCTION IF EXISTS `levenshtein`;
CREATE FUNCTION `levenshtein`(s1 text, s2 text) RETURNS int(11) DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    DECLARE s1_char CHAR; 
    DECLARE cv0, cv1 text; 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN
            SET c = c_temp;
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
          IF c > c_temp THEN  
            SET c = c_temp;  
          END IF; 
          SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
END;

DROP FUNCTION IF EXISTS `levenshtein_ratio`;
CREATE FUNCTION `levenshtein_ratio`(s1 text, s2 text) RETURNS int(11) DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, max_len INT;
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); 
    IF s1_len > s2_len THEN  
      SET max_len = s1_len;  
    ELSE  
      SET max_len = s2_len;  
    END IF; 
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); 
END;

Assuming you import your lists as the following two tables:

needle (title VARCHAR)
haystack (title VARCHAR)

You can then use a query like the following one to compare the two tables.

SELECT title, best_match,
    levenshtein_ratio(TRIM(LOWER(title)), TRIM(LOWER(best_match))) AS ratio
FROM (
    SELECT n.title AS title, (
                SELECT h.title
                FROM haystack h
                ORDER BY levenshtein_ratio(TRIM(LOWER(n.title)), TRIM(LOWER(h.title))) DESC
                LIMIT 1
           ) AS best_match
    FROM needle n
) x
ORDER BY ratio DESC

Pick a cutoff value, all rows below that value didn't have good matches. You can use levenshtein() instead of levenshtein_ratio() if you want to work with edit distances directly, in this case the ORDER BY to ASC.

Note that this doesn't have any special provision for word order differences. Also, if your lists are large, the comparison can be slow.

like image 101
reaanb Avatar answered Nov 16 '22 02:11

reaanb


If you want to go the OpenRefine way, the best is set up a local reconciliation server - I recommand reconcile csv.

Load your haystack in the reconciliation server and have OpenRefine processing your needle file and sending it to the reconciliation service.

Reconciliation server return a score with each proposal so you can (I just use those score as example, don't take them for granted):

  • bulk accept everything over 0.9
  • review manually everything between 0.8 and 0.9
  • discard everything below 0.8
like image 42
magdmartin Avatar answered Nov 16 '22 03:11

magdmartin


You might want to look at fuzzywuzzy (https://github.com/seatgeek/fuzzywuzzy).

from fuzzywuzzy import process
needles = open('needle').read().split("\n")
haystack = open('haystack').read().split("\n")
for a in needles:
    print a + ' -> ',
    print process.extractBests(a, haystack, score_cutoff=90)

Usefull params for the extract function are limit, scorer, and processor.

like image 20
4lk4tr43 Avatar answered Nov 16 '22 03:11

4lk4tr43


Fuzzywuzzy can make your dreams come true:

import csv
from fuzzywuzzy import fuzz

# Grab CSV data: 
with open('needles.csv', 'U') as z:
    reader = csv.reader(z)
    needles = list(reader)

with open('haystack.csv', 'U') as w:
    reader = csv.reader(w)
    haystack = list(reader)

# Calculate matches and append to list
NeedlesNotInHaystack = []

Fuzziness = 80 # ADJUST THIS VALUE TO FINE-TUNE RESULTS 
for x in needle:
    for y in haystack:
        if fuzz.ratio(x,y) > Fuzziness: 
            NeedlesNotInHaystack.append(x)

#Export results to CSV:
with open('Results', 'wb') as csvfile:
    temp = csv.writer(csvfile)
    temp.writerows(NeedlesNotInHaystack)

Rather than fuzz.ratio, you may get beter results using:

fuzz.token_sort_ratio OR fuzz.token_set_ratio

IMO, this tutorial is the best + most concise overview of fuzzy matching in Python

like image 33
Benjamin James Avatar answered Nov 16 '22 04:11

Benjamin James