Easiest way to compare two files with lists of song titles

Tags:

I have two lists of song titles, each in a plain text file, which are the filenames of licensed lyric files - I want to check if the shorter list titles (needle) are in the longer list (haystack). The script/app should return the list of titles in the needle that aren't in the haystack.

I'd prefer to use Python or a shell script (BASH) or just use visual diff program that can handle the fuzziness needed.

The main problem is that the titles need to be fuzzy matched to account for data entry errors and possibly also word ordering.

Haystack sample (note some duplicate and near duplicate lines, matches highlighted):

Yearn
Yesterday, Today And Forever
Yesterday, Today, Forever
You
You Alone
You Are Here (The Same Power)
You Are Holy
You Are Holy (Prince Of Peace)
You Are Mighty
You Are Mine
You Are My All In All
You Are My Hiding Place
You Are My King (Amazing Love)
You Are Righteous (Hope)
You Are So Faithful
You Are So Good to Me
You Are Worthy Of My Praise
You Have Been Good
You Led Me To The Cross
You Reign
You Rescued Me
You Said
You Sent Your Own
You Set Me Apart (Dwell In Your House)
You alone are worthy (Glory in the highest)
You are God in heaven (let my words be few)
You are always fighting for us (Hallelujah you have overcome)
You are beautiful (I stand in awe)
You are beautiful beyond description
You are mighty
You are my all in all
You are my hiding place
You are my passion
You are still Holy
You are the Holy One (We exalt Your name)
You are the mighty King
You are the mighty warrior
You are the vine
**You chose the cross (Lost in wonder)**
You have shown me favour unending
You hold the broken hearted
You laid aside Your majesty
You said
You're Worthy Of My Praise
You're calling me (Unashamed love)
You're the God of this city
You're the Lion of Judah
You're the word of God the Father (Across the lands)
You've put a new song in my heart
Your Beloved
Your Grace is Enough
Your Great Name We Praise
Your Great Name We Praise-2
Your Light (You Have Turned)
Your Light Is Over Me (His Love)
**Your Love**
**Your Love Is Amazing**
Your Love Is Deep
Your Love Is Deeper - Jesus, Lord of Heaven (Phil Wickham)
Your Love Oh Lord
Your Love Oh Lord (Psalm 36)
Your Love is Extravagant
Your Power (Send Me)
Your blood speaks a better word
Your everlasting love
**Your grace is enough**
**Your grace is enough (Great is Your faithfulness)**
Your mercy is falling
Your mercy taught us how to dance (Dancing generation)
Your voice stills the oceans (nothing is impossible)
Yours Is The Kingdom

Needle sample:

You Are Good (I Want To Scream It Out)
You Are My Strength (In The Fullness)
You Are My Vision O King Of My Heart
You Are The King Of Glory (Hosanna To The Son)
**You Chose The Cross (Lost In Wonder)**
**Your Grace Is Enough (This Is Our God)**
**Your Love Is Amazing Steady And Unchanging**
**Your Love Shining Like The Sun**

Note that the needle title "Your Love Shining Like The Sun" is only a possible match for "Your Love". It's better to fail towards not matching and therefore any uncertain title matches should appear in the output.

comm -1 -3 <(sort haystack.txt) <(sort needle.txt)

doesn't find any of the matches. diff or grep seems they'd have the same problem with not being fuzzy enough. Kdiff3 and diffnow.com were as quick as a manual comparison as I had to still scan through for nearly all matches, they could only cope with whitespace and letter-case differences.

ExamDiffPro from prestosoft.com looks like a possibility but is MS Windows only and I'd prefer a native Linux solution before I go messing with WINE or VirtualBox.

The needle is actually a CSV so I've thought about using LibreOffice and treating it as a database and doing SQL queries or using a spreadsheet with hlookup or something ... Another question led me to OpenRefine (formerly google-refine)

Seems like this is a common category of problem (it's basically "record linkage" which often uses [Levenshtein] edit-distance calculation), how should I approach it? Suggestions please?

670

asked Jun 30 '15 21:06

pbhj

4 Answers

I did something similar to this in MySQL. I used the following code to define Levenshtein distance and ratio functions (which I got from the answer to this question):

DROP FUNCTION IF EXISTS `levenshtein`;
CREATE FUNCTION `levenshtein`(s1 text, s2 text) RETURNS int(11) DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    DECLARE s1_char CHAR; 
    DECLARE cv0, cv1 text; 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN
            SET c = c_temp;
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
          IF c > c_temp THEN  
            SET c = c_temp;  
          END IF; 
          SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
END;

DROP FUNCTION IF EXISTS `levenshtein_ratio`;
CREATE FUNCTION `levenshtein_ratio`(s1 text, s2 text) RETURNS int(11) DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, max_len INT;
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); 
    IF s1_len > s2_len THEN  
      SET max_len = s1_len;  
    ELSE  
      SET max_len = s2_len;  
    END IF; 
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); 
END;

Assuming you import your lists as the following two tables:

needle (title VARCHAR)
haystack (title VARCHAR)

You can then use a query like the following one to compare the two tables.

SELECT title, best_match,
    levenshtein_ratio(TRIM(LOWER(title)), TRIM(LOWER(best_match))) AS ratio
FROM (
    SELECT n.title AS title, (
                SELECT h.title
                FROM haystack h
                ORDER BY levenshtein_ratio(TRIM(LOWER(n.title)), TRIM(LOWER(h.title))) DESC
                LIMIT 1
           ) AS best_match
    FROM needle n
) x
ORDER BY ratio DESC

Pick a cutoff value, all rows below that value didn't have good matches. You can use levenshtein() instead of levenshtein_ratio() if you want to work with edit distances directly, in this case the ORDER BY to ASC.

Note that this doesn't have any special provision for word order differences. Also, if your lists are large, the comparison can be slow.

101

answered Nov 16 '22 02:11

reaanb

If you want to go the OpenRefine way, the best is set up a local reconciliation server - I recommand reconcile csv.

Load your haystack in the reconciliation server and have OpenRefine processing your needle file and sending it to the reconciliation service.

Reconciliation server return a score with each proposal so you can (I just use those score as example, don't take them for granted):

bulk accept everything over 0.9
review manually everything between 0.8 and 0.9
discard everything below 0.8

answered Nov 16 '22 03:11

magdmartin

You might want to look at fuzzywuzzy (https://github.com/seatgeek/fuzzywuzzy).

from fuzzywuzzy import process
needles = open('needle').read().split("\n")
haystack = open('haystack').read().split("\n")
for a in needles:
    print a + ' -> ',
    print process.extractBests(a, haystack, score_cutoff=90)

Usefull params for the extract function are limit, scorer, and processor.

answered Nov 16 '22 03:11

4lk4tr43

Fuzzywuzzy can make your dreams come true:

import csv
from fuzzywuzzy import fuzz

# Grab CSV data: 
with open('needles.csv', 'U') as z:
    reader = csv.reader(z)
    needles = list(reader)

with open('haystack.csv', 'U') as w:
    reader = csv.reader(w)
    haystack = list(reader)

# Calculate matches and append to list
NeedlesNotInHaystack = []

Fuzziness = 80 # ADJUST THIS VALUE TO FINE-TUNE RESULTS 
for x in needle:
    for y in haystack:
        if fuzz.ratio(x,y) > Fuzziness: 
            NeedlesNotInHaystack.append(x)

#Export results to CSV:
with open('Results', 'wb') as csvfile:
    temp = csv.writer(csvfile)
    temp.writerows(NeedlesNotInHaystack)

Rather than fuzz.ratio, you may get beter results using:

fuzz.token_sort_ratio OR fuzz.token_set_ratio

IMO, this tutorial is the best + most concise overview of fuzzy matching in Python

answered Nov 16 '22 04:11

Benjamin James

Related questions
                            
                                What are com.sec.android.provider.* apps exactly?
                            
                                Mongodb find inside sub array
                            
                                Postgres Materialize causes poor performance in delete query
                            
                                SQLite connection not appearing in Entity Data Model Wizard
                            
                                How to execute an .sql file in pymssql
                            
                                Optimize large IN condition for Redshift query
                            
                                Merging Complicated Tables
                            
                                Need advice in data model
                            
                                Firebase - Targeting Specific Firestore Document Field with Cloud Functions
                            
                                Do Django Model Managers require using=self._db
                            
                                Performance of large EAV/open schema systems on SQL Server
                            
                                Secure JDBC connection
                            
                                Hibernate mapping a second @Embeddable field in a subclass
                            
                                MySQL Workbench symbol reference
                            
                                Tips for writing a DBMS
                            
                                State list of world with country-code [closed]
                            
                                Data normalization and writing queries
                            
                                Simple "SELECT" with variable but without "INTO"
                            
                                Database storage design of large amounts of heterogeneous data
                            
                                mysql_install_db gives FATAL ERROR: Could not find my-default.cnf

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Easiest way to compare two files with lists of song titles

Tags:

database

list

comparison

fuzzy-comparison

recordset