I have two data files that I'm working with. One contains a list of words with some additional information about those words and the other one contains word pairs (where words are listed by their word IDs from the first table) and their frequencies.
Lexicon file (sample output)
('wID', 'w1', 'w1cs', 'L1', 'c1')
('-----', '-----', '-----', '-----', '-----')
(1, ',', ',', ',', 'y')
(2, '.', '.', '.', 'y')
(3, 'the', 'the', 'the', 'at')
(4, 'and', 'and', 'and', 'cc')
(5, 'of', 'of', 'of', 'io')
Bigram file (sample output)
('freq', 'w1', 'w2')
(4, 22097, 161)
(1, 98664, 1320)
(1, 426515, 1345)
(1, 483675, 747)
(19, 63, 15496)
(2, 3011, 7944)
(1, 27985, 27778)
I created two tables using SQLite and uploaded the data from the files above.
conn = sqlite3.connect('bigrams.db')
conn.text_factory = str
c = conn.cursor()
c.execute('pragma foreign_keys=ON')
Lexicon table
c.execute('''CREATE TABLE lex
(wID INT PRIMARY KEY, w1 TEXT, w1cs TEXT, L1 TEXT, c1 TEXT)''')
#I removed this index as per CL.'s suggestion
#c.execute('''DROP INDEX IF EXISTS lex_index''')
#c.execute('''CREATE INDEX lex_index ON lex (wID, w1, c1)''')
#and added this one
c.execute('''CREATE INDEX lex_w1_index ON lex (w1)''')
Insert data into the lexicon table
#I replaced this code
# with open('/Users/.../lexicon.txt', "rb") as lex_file:
# for line in lex_file:
# currentRow = line.split('\t')
# try:
# data = [currentRow[0], currentRow[1], currentRow[2], currentRow[3], str(currentRow[4].strip('\r\n'))]
# c.executemany ('insert or replace into lex values (?, ?, ?, ?, ?)', (data,))
# except IndexError:
# pass
#with the one that Julian wrote
blocksize = 100000
with open('/Users/.../lexicon.txt', "rb") as lex_file:
data = []
line_counter = 0
for line in lex_file:
data.append(line.strip().split('\t'))
line_counter += 1
if line_counter % blocksize == 0:
try:
c.executemany ('insert or replace into lex values (?, ?, ?, ?, ?)', data)
conn.commit()
except IndexError:
block_start = line_counter - blocksize + 1
print 'Lex error lines {}-{}'.format(block_start, line_counter)
finally:
data = []
Bigram table
#I replaced this code to create table x2
#c.execute('''CREATE TABLE x2
# (freq INT, w1 INT, w2 INT, FOREIGN KEY(w1) REFERENCES lex(wID), FOREIGN KEY(w2) REFERENCES lex(wID))''')
#with the code that Julian suggested
c.execute('''CREATE TABLE x2
(freq INT, w1 INT, w2 INT,
FOREIGN KEY(w1) REFERENCES lex(wID),
FOREIGN KEY(w2) REFERENCES lex(wID),
PRIMARY KEY(w1, w2) )''')
Insert data into the bigram table
#Replaced this code
#with open('/Users/.../x2.txt', "rb") as x2_file:
# for line in x2_file:
# currentRow = line.split('\t')
# try:
# data = [str(currentRow[0].replace('\x00','').replace('\xff\xfe','')), str(currentRow[1].replace('\x00','')), str(currentRow[2].replace('\x00','').strip('\r\n'))]
# c.executemany('insert or replace into x2 values (?, ?, ?)', (data,))
# except IndexError:
# pass
#with this one suggested by Julian
with open('/Users/.../x2.txt', "rb") as x2_file:
data = []
line_counter = 0
for line in x2_file:
data.append(line.strip().replace('\x00','').replace('\xff\xfe','').split('\t'))
line_counter += 1
if line_counter % blocksize == 0:
try:
c.executemany('insert or replace into x2 values (?, ?, ?)', data)
conn.commit()
except IndexError:
block_start = line_counter - blocksize + 1
print 'x2 error lines {}-{}'.format(block_start, line_counter)
finally:
data = []
conn.close()
I want to be able to check if a given word pair exists in the data -- for example "like new"
When I specify the first word only, the program works just fine.
cur.execute('''SELECT lex1.w1, lex2.w1 from x2
INNER JOIN lex as lex1 ON lex1.wID=x2.w1
INNER JOIN lex as lex2 ON lex2.wID=x2.w2
WHERE lex1.w1= “like” ’’’)
But when I want to search for a pair of words, the code is painfully slow.
cur.execute('''SELECT lex1.w1, lex2.w1 from x2
INNER JOIN lex as lex1 ON lex1.wID=x2.w1
INNER JOIN lex as lex2 ON lex2.wID=x2.w2
WHERE lex1.w1=“like” AND lex2.w1= “new” ''')
I can't figure out what I'm doing wrong. Any help would be much appreciated.
The EXPLAIN QUERY PLAN shows that the database scanned the x2
table first, and then looked up the corresponding lex
rows for each x2
row, checking if the words match.
The lex
lookups were done with a temporary index, but doing this lookup twice for each row in x2
still made the entire query slow.
The query would be fast if the database could look up the IDs of the two words first, and the search for a row with these two IDs in x2
.
This requires some new indexes.
(The lex_index
index would be useful only for lookups beginning on the on the wID
column (and such lookups could already use the primary key index).)
You need to create an index that allows searching for w1
:
CREATE INDEX lex_w1_index ON lex(w1);
To look up any x2
row(s) containing the two word IDs, you need some index with these two columns in the leftmost position:
CREATE INDEX x2_w1_w2_index ON x2(w1, w2);
Alternatively, make these two columns the primary index (see Julian's answer).
To force the database to do the word ID lookups first, you could move them into subqueries:
SELECT freq
FROM x2
WHERE w1 = (SELECT wID FROM lex WHERE w1 = 'like')
AND w2 = (SELECT wID FROM lex WHERE w1 = 'new')
However, this should not be necessary; with the new indexes, the optimizer should be able to find the optimial query plan automatically. (But you can still use this query if you think it's more readable.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With