Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble with a join statement in SQLite

I have two data files that I'm working with. One contains a list of words with some additional information about those words and the other one contains word pairs (where words are listed by their word IDs from the first table) and their frequencies.

Lexicon file (sample output)

('wID', 'w1', 'w1cs', 'L1', 'c1')
('-----', '-----', '-----', '-----', '-----')
(1, ',', ',', ',', 'y')
(2, '.', '.', '.', 'y')
(3, 'the', 'the', 'the', 'at')
(4, 'and', 'and', 'and', 'cc')
(5, 'of', 'of', 'of', 'io')

Bigram file (sample output)

('freq', 'w1', 'w2')
(4, 22097, 161)
(1, 98664, 1320)
(1, 426515, 1345)
(1, 483675, 747)
(19, 63, 15496)
(2, 3011, 7944)
(1, 27985, 27778)

I created two tables using SQLite and uploaded the data from the files above.

conn = sqlite3.connect('bigrams.db')
conn.text_factory = str
c = conn.cursor()
c.execute('pragma foreign_keys=ON')

Lexicon table

c.execute('''CREATE TABLE lex
            (wID INT PRIMARY KEY, w1 TEXT, w1cs TEXT, L1 TEXT, c1 TEXT)''')

#I removed this index as per CL.'s suggestion
#c.execute('''DROP INDEX IF EXISTS lex_index''') 
#c.execute('''CREATE INDEX lex_index ON lex (wID, w1, c1)''')

#and added this one
c.execute('''CREATE INDEX lex_w1_index ON lex (w1)''')

Insert data into the lexicon table

#I replaced this code
# with open('/Users/.../lexicon.txt', "rb") as lex_file:
#    for line in lex_file:
#        currentRow = line.split('\t')
#        try:
#            data = [currentRow[0], currentRow[1], currentRow[2], currentRow[3], str(currentRow[4].strip('\r\n'))]
#           c.executemany ('insert or replace into lex values (?, ?, ?, ?, ?)', (data,))
#        except IndexError:
#            pass   


#with the one that Julian wrote

blocksize = 100000

with open('/Users/.../lexicon.txt', "rb") as lex_file:
    data = []
    line_counter = 0
    for line in lex_file:
        data.append(line.strip().split('\t'))
        line_counter += 1
        if line_counter % blocksize == 0:
            try:
                c.executemany ('insert or replace into lex values (?, ?, ?, ?, ?)', data)
                conn.commit()
            except IndexError:
                block_start = line_counter - blocksize + 1
                print 'Lex error lines {}-{}'.format(block_start, line_counter)
            finally:
                data = []

Bigram table

#I replaced this code to create table x2 
#c.execute('''CREATE TABLE x2
#             (freq INT, w1 INT, w2 INT, FOREIGN KEY(w1) REFERENCES lex(wID), FOREIGN KEY(w2) REFERENCES lex(wID))''')

#with the code that Julian suggested
c.execute('''CREATE TABLE x2
             (freq INT, w1 INT, w2 INT,
              FOREIGN KEY(w1) REFERENCES lex(wID),
              FOREIGN KEY(w2) REFERENCES lex(wID),
              PRIMARY KEY(w1, w2) )''')

Insert data into the bigram table

#Replaced this code
#with open('/Users/.../x2.txt', "rb") as x2_file:
#    for line in x2_file:
#        currentRow = line.split('\t')
#        try:
#            data = [str(currentRow[0].replace('\x00','').replace('\xff\xfe','')), str(currentRow[1].replace('\x00','')), str(currentRow[2].replace('\x00','').strip('\r\n'))]
#           c.executemany('insert or replace into x2 values (?, ?, ?)', (data,))
#        except IndexError:
#            pass

#with this one suggested by Julian 
with open('/Users/.../x2.txt', "rb") as x2_file:
    data = []
    line_counter = 0
    for line in x2_file:
        data.append(line.strip().replace('\x00','').replace('\xff\xfe','').split('\t'))
        line_counter += 1
        if line_counter % blocksize == 0:
            try:
                c.executemany('insert or replace into x2 values (?, ?, ?)', data)
                conn.commit()
            except IndexError:
                block_start = line_counter - blocksize + 1
                print 'x2 error lines {}-{}'.format(block_start, line_counter)
            finally:
                data = []

conn.close()

I want to be able to check if a given word pair exists in the data -- for example "like new"

When I specify the first word only, the program works just fine.

cur.execute('''SELECT lex1.w1, lex2.w1 from x2 
                INNER JOIN lex as lex1 ON lex1.wID=x2.w1
                INNER JOIN lex as lex2 ON lex2.wID=x2.w2
                WHERE lex1.w1= “like” ’’’)

But when I want to search for a pair of words, the code is painfully slow.

cur.execute('''SELECT lex1.w1, lex2.w1 from x2 
                    INNER JOIN lex as lex1 ON lex1.wID=x2.w1
                    INNER JOIN lex as lex2 ON lex2.wID=x2.w2
                    WHERE lex1.w1=“like” AND lex2.w1= “new” ''')

I can't figure out what I'm doing wrong. Any help would be much appreciated.

like image 508
user3854605 Avatar asked Mar 19 '23 00:03

user3854605


1 Answers

The EXPLAIN QUERY PLAN shows that the database scanned the x2 table first, and then looked up the corresponding lex rows for each x2 row, checking if the words match. The lex lookups were done with a temporary index, but doing this lookup twice for each row in x2 still made the entire query slow.

The query would be fast if the database could look up the IDs of the two words first, and the search for a row with these two IDs in x2. This requires some new indexes. (The lex_index index would be useful only for lookups beginning on the on the wID column (and such lookups could already use the primary key index).)

You need to create an index that allows searching for w1:

CREATE INDEX lex_w1_index ON lex(w1);

To look up any x2 row(s) containing the two word IDs, you need some index with these two columns in the leftmost position:

CREATE INDEX x2_w1_w2_index ON x2(w1, w2);

Alternatively, make these two columns the primary index (see Julian's answer).


To force the database to do the word ID lookups first, you could move them into subqueries:

SELECT freq
FROM x2
WHERE w1 = (SELECT wID FROM lex WHERE w1 = 'like')
  AND w2 = (SELECT wID FROM lex WHERE w1 = 'new')

However, this should not be necessary; with the new indexes, the optimizer should be able to find the optimial query plan automatically. (But you can still use this query if you think it's more readable.)

like image 53
CL. Avatar answered Mar 23 '23 17:03

CL.