Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Minimizing the performance issues of loading a many to many relationship

I've been tokenizing an extremely large corpus. Each Unigram can occur in multiple Comments multiple times. I'm storing the Comment.ids in a list that is attached to the Unigram in the database every 250K newly counted unigrams or so. What I'm wondering is if there is a way to extend the comment id list--or a similar data structure--without querying and loading the existing list of comments tied to the Unigram (it can number in the the thousands). Or is there no way around the slow IO?

Here is my model code:

comments = db.Table('ngrams',
    db.Column('unigram_id', db.String, db.ForeignKey('comment.id')),
    db.Column('comment_id', db.String, db.ForeignKey('unigram.id')))

class Unigram(db.Model):
    id = db.Column(db.String, primary_key=True, unique=True)
    times_occurred = db.Column(db.Integer)
    occurs_in = db.relationship('Comment', secondary=comments,
                    backref=db.backref('unigrams', lazy='dynamic'))

class Comment(db.Model):
    id = db.Column(db.String, primary_key=True, unique=True)
    creation_time = db.Column(db.DateTime)

as well as the code that adds new counts and Comment.ids in:

current = Unigram.query.filter(Unigram.id == ngram).first()
if current:
    current.times_occurred += counts[ngram]['count']
    current.occurs_in.extend(counts[ngram]['occurences'])
else:
    current = Unigram(ngram, counts[ngram]['count'],
                  counts[ngram]['occurences'])
    db.session.add(current)
like image 543
wegry Avatar asked Jan 27 '14 01:01

wegry


1 Answers

The answer to your specific question (I think): http://docs.sqlalchemy.org/en/rel_0_7/orm/collections.html#dynamic-relationship-loaders

The default behavior of relationship() is to fully load the collection of items in ... A key feature to enable management of a large collection is the so-called “dynamic” relationship. This is an optional form of relationship() which returns a Query object in place of a collection when accessed.

It looks like SQLAlchemy does indeed support not having to read a collection to modify it. So lazy='dynamic' is correct. It is possible that the problem is that you have it only on the backref. Try these two variants:

occurs_in = db.relationship('Comment', secondary=comments, 
    lazy='dynamic', backref=db.backref('unigrams'))

occurs_in = db.relationship('Comment', secondary=comments, 
    lazy='dynamic', backref=db.backref('unigrams', lazy='dynamic'))

Also, you might try lazy='noload' instead. Since you are just writing to the tables during indexing, this will work the same.

Now, for the broader question: why do this at all? Doing it this way will be frustrating, even after you figure out this little problem. Some ideas...

Use the right tool for the job: Sphinx, ElasticSearch, Lucene, Solr, Xapian, any one of these will handle the problem of text indexing quite thoroughly, and much better than you can handle it without using a specialized tool. Sphinx especially performs insanely fast, the indexing speed is hundreds of megabytes per second and a query of how many documents contain a word usually takes a millisecond or two (regardless of corpus size).

If you are doing a one-off script or test code, rather than setting up a production system, and for some reason don't want to use the right tool, then do it all in memory, and don't use SQL. Use plain dictionaries in python, and save them as pickle files on a ramdisk in between runs. Buy more memory, it's cheaper than your time. This is not a bad way to test statistical ideas on a text corpus.

If you really MUST put a text index in a SQL database for some reason (why?), then save yourself a lot of pain and don't use an object relational mapper like SQLAlchemy. The best way to do this is, prepare a data dump in a suitable format (as a text file), and load it in the database with one shot (using something like LOAD DATA INFILE in MySQL, or equivalents in your database). This is several orders of magnitude faster. It can easily be 1000x the speed of running individual INSERT queries for every unigram. You can still access the data later through SQLAlchemy, provided that you organized your tables in the right way, but while you are indexing your text you want to bypass that.

like image 141
Alex I Avatar answered Oct 29 '22 23:10

Alex I