I have the following information in a python sqlite3 database which has around 4 million records.
Term No of articles Article Ids
Obama 300 [411,523,534, …. 846]
Gandhi 3900 [23,32,33…..4578]
Mandela 3900 [21,14,56,145 …4536]
George Bush 450 [230,310 … 700]
Tony Blair 350 [225,320 … 800]
Justin Bieber 25 [401 , 420, 690 …. 904]
Lionel Messi 150 [23, 78, …… 570]
'Article Ids' is a blob holding the list of ids (as returned by an API)
I have the task of finding the common-ids from the Id List for each term and save them in 'relationships.db'
How do I establish relationships where I find which articles talk both about Gandhi & Mandela together (intersecting article ids) ?
The relationships.db should look like this;
Term 1 Term 2 No of Common Article Ids Common Article IDS
Obama Gandhi 17 [34,123,25 ...]
Obama Mandela 43 [145,111,234,456 ....]
Obama George Bush 46
Obama Tony Blair 2
Obama Justin Bieber 36
Obama Lionel Messi 3
Gandhi Mandela 40
Gandhi George Bush 41
Gandhi Tony Blair 32
Gandhi Justin Bieber 31
Gandhi Lionel Messi 20
Mandela George Bush 20
Mandela Tony Blair 11
Mandela Justin Bieber 19
Mandela Lionel Messi 39
George Bush Tony Blair 46
George Bush Justin Bieber 49
George Bush Lionel Messi 2
Tony Blair Justin Bieber 50
Tony Blair Lionel Messi 3
Justin Bieber Lionel Messi 6
Using a 'for loop' to loop through every term to get the intersection is painful. Is there an efficient way to do this? Will there be trade-offs between 'memory' and 'speed'?
This is hinted at by the comments, but it is impossible (or at least very hard), to process the information while it is still in a BLOB. You could make more progress if you restructure your database:
Term Article_id
Ghandi 33
Obama 411
Obama 523
Ghandi 23
Obama 846
...
Mandela 23
This "flat" representation has a few advantages. First, it makes adding new articles easy, you don't have to extract the blob. Secondly, with proper indexing, your original counts can be recovered easily:
SELECT COUNT(*) FROM news WHERE Term="Ghandi"`
These can be stored in a separate table if you want. To extract joint articles, you would search for something like:
SELECT A.Article_id, A.Term, B.Term FROM news AS A
JOIN news AS B ON A.Article_id = B.Article_id
AND A.Term != B.Term
Note that this will double count, but this easily be corrected. Here is a complete minimal working example with a modified table to show more matches:
import sqlite3
conn = sqlite3.connect(":memory:")
raw_items = '''
Ghandi 33
Obama 411
Obama 521
Ghandi 23
Obama 21
Ghandi 411
Mandela 21'''
script = '''
CREATE TABLE news (
Term STRING,
Article_id INTEGER
);'''
conn.executescript(script)
items = [line.split() for line in raw_items.strip().split('\n')]
conn.executemany("INSERT INTO news VALUES (?,?)", items)
cmd = '''SELECT COUNT(*) FROM news WHERE Term="Obama"'''
print "Obama size: ", conn.execute(cmd).fetchone()
cmd = '''
SELECT A.Article_id, A.Term, B.Term FROM news AS A
JOIN news AS B ON A.Article_id = B.Article_id
AND A.Term != B.Term '''
for result in conn.execute(cmd).fetchall():
print result
This gives:
Obama size: (3,)
(411, u'Obama', u'Ghandi')
(21, u'Obama', u'Mandela')
(411, u'Ghandi', u'Obama')
(21, u'Mandela', u'Obama')
You could find some workaround with pandas.
1) Create a pandas DataFrame with pandas.read_sql
2) Then you can get a cross-join like this one suggested by @logc
3) After that you can convert lists to sets and apply a intersection.
If you need help to implement, I'll help you latter, in a bit hurry now.
EDIT:
Ok, it is very simple indeed, but I don't know if it has the performance you need, maybe you will need to read step by step the csv file:
import pandas, sqlite3
conn = sqlite3.connect(databaseFilePath)
df=pandas.read_sql('SELECT * FROM Terms;',conn)
df['Article Ids'] = df['Article Ids'].apply(eval).apply(set)
df['key'] = False
df2 = pandas.merge(df,df,on='key')
df2 = df2[df2.Term_x!=df2.Term_y]
df2['Common Articles IDS'] = df2.apply(lambda row:set.intersection(row['Article Ids_x'], row['Article Ids_y']), axis=1)
df2['No of Common Articles Ids'] = df2['Common Articles IDS'].apply(len)
df2['Common Articles IDS'] = df2['Common Articles IDS'].apply(list).apply(str)
df2[['Term_x','Term_y', 'No of Common Articles Ids', 'Common Articles IDS']].to_sql(outputTableName, conn)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With