Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

establishing a many-to-many relationship for millions of terms

I have the following information in a python sqlite3 database which has around 4 million records.

Term         No of articles      Article Ids
Obama           300            [411,523,534, …. 846]
Gandhi         3900            [23,32,33…..4578]
Mandela        3900            [21,14,56,145 …4536]
George Bush     450            [230,310 … 700]
Tony Blair      350            [225,320 … 800]
Justin Bieber   25             [401 , 420, 690 …. 904]
Lionel Messi    150            [23, 78, …… 570]

'Article Ids' is a blob holding the list of ids (as returned by an API)

I have the task of finding the common-ids from the Id List for each term and save them in 'relationships.db'

How do I establish relationships where I find which articles talk both about Gandhi & Mandela together (intersecting article ids) ?

The relationships.db should look like this;

Term 1              Term 2          No of Common Article Ids    Common Article IDS
Obama               Gandhi                17                    [34,123,25 ...]
Obama               Mandela               43                    [145,111,234,456 ....]
Obama               George Bush           46
Obama               Tony Blair             2
Obama               Justin Bieber         36
Obama               Lionel Messi           3
Gandhi              Mandela               40
Gandhi              George Bush           41
Gandhi              Tony Blair            32
Gandhi              Justin Bieber         31
Gandhi              Lionel Messi          20
Mandela             George Bush           20
Mandela             Tony Blair            11
Mandela             Justin Bieber         19
Mandela             Lionel Messi          39
George Bush         Tony Blair            46
George Bush         Justin Bieber         49
George Bush         Lionel Messi           2
Tony Blair          Justin Bieber         50
Tony Blair          Lionel Messi           3
Justin Bieber       Lionel Messi           6

Using a 'for loop' to loop through every term to get the intersection is painful. Is there an efficient way to do this? Will there be trade-offs between 'memory' and 'speed'?

like image 305
richie Avatar asked Oct 20 '22 04:10

richie


2 Answers

This is hinted at by the comments, but it is impossible (or at least very hard), to process the information while it is still in a BLOB. You could make more progress if you restructure your database:

Term      Article_id
Ghandi    33
Obama     411
Obama     523
Ghandi    23
Obama     846
...
Mandela   23

This "flat" representation has a few advantages. First, it makes adding new articles easy, you don't have to extract the blob. Secondly, with proper indexing, your original counts can be recovered easily:

SELECT COUNT(*) FROM news WHERE Term="Ghandi"`

These can be stored in a separate table if you want. To extract joint articles, you would search for something like:

SELECT A.Article_id, A.Term, B.Term FROM news AS A 
JOIN news AS B ON A.Article_id = B.Article_id 
AND A.Term != B.Term

Note that this will double count, but this easily be corrected. Here is a complete minimal working example with a modified table to show more matches:

import sqlite3

conn = sqlite3.connect(":memory:")

raw_items = '''
Ghandi    33
Obama     411
Obama     521
Ghandi    23
Obama     21
Ghandi    411
Mandela   21'''

script = '''
CREATE TABLE news ( 
Term STRING,
Article_id INTEGER
);'''

conn.executescript(script)
items = [line.split() for line in raw_items.strip().split('\n')]
conn.executemany("INSERT INTO news VALUES (?,?)", items)

cmd = '''SELECT COUNT(*) FROM news WHERE Term="Obama"'''
print "Obama size: ", conn.execute(cmd).fetchone()

cmd = '''
SELECT A.Article_id, A.Term, B.Term FROM news AS A 
JOIN news AS B ON A.Article_id = B.Article_id 
AND A.Term != B.Term '''

for result in conn.execute(cmd).fetchall():
    print result

This gives:

Obama size:  (3,)
(411, u'Obama', u'Ghandi')
(21, u'Obama', u'Mandela')
(411, u'Ghandi', u'Obama')
(21, u'Mandela', u'Obama')
like image 150
Hooked Avatar answered Oct 22 '22 17:10

Hooked


You could find some workaround with pandas.

1) Create a pandas DataFrame with pandas.read_sql

2) Then you can get a cross-join like this one suggested by @logc

3) After that you can convert lists to sets and apply a intersection.

If you need help to implement, I'll help you latter, in a bit hurry now.

EDIT:

Ok, it is very simple indeed, but I don't know if it has the performance you need, maybe you will need to read step by step the csv file:

import pandas, sqlite3

conn = sqlite3.connect(databaseFilePath)
df=pandas.read_sql('SELECT * FROM Terms;',conn)
df['Article Ids'] = df['Article Ids'].apply(eval).apply(set)
df['key'] = False
df2 = pandas.merge(df,df,on='key')
df2 = df2[df2.Term_x!=df2.Term_y]
df2['Common Articles IDS'] = df2.apply(lambda row:set.intersection(row['Article Ids_x'], row['Article Ids_y']), axis=1)
df2['No of Common Articles Ids'] = df2['Common Articles IDS'].apply(len)
df2['Common Articles IDS'] = df2['Common Articles IDS'].apply(list).apply(str)
df2[['Term_x','Term_y', 'No of Common Articles Ids', 'Common Articles IDS']].to_sql(outputTableName, conn) 
like image 45
caiohamamura Avatar answered Oct 22 '22 19:10

caiohamamura