How do I load data in many-to-many relation using pony orm?

Question

Here are my entities:

class Article(db.Entity):
    id = PrimaryKey(int, auto=True)
    creation_time = Required(datetime)
    last_modification_time = Optional(datetime, default=datetime.now)
    title = Required(str)
    contents = Required(str)
    authors = Set('Author')


class Author(db.Entity):
    id = PrimaryKey(int, auto=True)
    first_name = Required(str)
    last_name = Required(str)
    articles = Set(Article)

And here is the code I'm using to get some data:

return left_join((article, author) for article in entities.Article
                 for author in article.authors).prefetch(entities.Author)[:]

Whether I'm using the prefetch method or not, the generated sql always looks the same:

SELECT DISTINCT "article"."id", "t-1"."author"
FROM "article" "article"
  LEFT JOIN "article_author" "t-1"
    ON "article"."id" = "t-1"."article"

And then when I iterated over the results, pony is issuing yet another query (queries):

SELECT "id", "creation_time", "last_modification_time", "title", "contents"
FROM "article"
WHERE "id" = %(p1)s

SELECT "id", "first_name", "last_name"
FROM "author"
WHERE "id" IN (%(p1)s, %(p2)s)

The desired behavior for me would be if the orm would issue just one query that would load all the data needed. So how do I achieve that?

Alexander Kozlovsky · Accepted Answer

Author of PonyORM is here. We don't want to load all this objects using just one query, because this is inefficient.

The only benefit of using a single query to load many-to-many relation is to reduce the number of round-trips to the database. But if we would replace three queries with one, this is not a major improvement. When your database server located near your application server these round-trips are actually very fast, comparing with the processing the resulted data in Python.

On the other side, when both sides of many-to-many relation are loaded using the same query, it is inevitable that the same object's data will be repeated over and over in multiple rows. This has many drawbacks:

The size of data transferred from the database became much larger as compared to situation when no duplicate information is transferred. In your example, if you have ten articles, and each is written by three authors, the single query will return thirty rows, with large fields like article.contents duplicated multiple times. Separate queries will transfer the minimum amount of data possible, the difference in size may easily be an order of magnitude depending on specific many-to-many relation.
The database server is usually written in compiled language like C and works very fast. The same is true for networking layer. But Python code is interpreted, and the time consumed by Python code is (contrary to some opinions) usually much more than the time which is spent in the database. You can see profiling tests that was performed by the SQLAlchemy author Mike Bayer after which he came to conclusion:

A great misconception I seem to encounter often is the notion that communication with the database takes up a majority of the time spent in a database-centric Python application. This perhaps is a common wisdom in compiled languages such as C or maybe even Java, but generally not in Python. Python is very slow, compared to such systems (...) Whether a database driver (DBAPI) is written in pure Python or in C will incur significant additional Python-level overhead. For just the DBAPI alone, this can be as much as an order of magnitude slower.

When all data of many-to-many relation are loaded using the same query and the same data is repeated in many rows, it is necessary to parse all of this repeated data in Python just to throw out most of them. As Python is the slowest part of process, such "optimization" may lead to decreased performance.

As a support to my words I can point to Django ORM. This ORM has two methods which can be used to query optimization. The first one, called select_related loads all related objects in a single query, while the more recently added method called prefetch_related loads objects in a way Pony does by default. According to Django users the second method works much faster:

In some scenarios, we have found up to a 30% speed improvement.
The database is required to perform joins which consume precious resources of the database server.

While Python code is the slowest part when processing a single request, the database server CPU time is a shared resource which is used by all parallel requests. You can scale Python code easily by starting multiple Python processes on different servers, but it is much harder to scale the database. Because of this, in high-load application it is better to offload useful work from the database server to application server, so this work can be done in parallel by multiple application servers.

When database performs join it needs to spend additional time for doing it. But for Pony it is irrelevant if database make join or not, because in any case an object will be interlinked inside ORM identity map. So the work that database doing when perform join is just useless spend of database time. On the other hand, using identity map pattern Pony can link objects equally fast regardless of whether they are provided in the same database row or not.

Returning to the number of round-trips, Pony have dedicated mechanism to eliminate "N+1 query" problem. The "N+1 query" anti-pattern arises when an ORM sends hundreds of very similar queries each of them loads separate object from the database. Many ORMs suffers from this problem. But Pony can detect it and replace repeated N queries with a single query which loads all necessary objects at once. This mechanism is very efficient and can greatly reduce the number of round-trips. But when we speak about loading many-to-many relation, there are no N queries here, there are just three queries which are more efficient when executed separately, so there are no benefit in trying to execute single query instead.

To summarize, I need to say that the ORM performance is a very important to us, Pony ORM developers. And because of that, we don't want to implement loading many-to-many relation in a single query, as it most certainly will be slower than our current solution.

So, to answer your question, you cannot load both side of many-to-many relation in a single query. And I think this is a good thing.

How do I load data in many-to-many relation using pony orm?

Tags:

python

sql

ponyorm

Marek M.

1 Answers

Alexander Kozlovsky

Recent Activity

Donate For Us

How do I load data in many-to-many relation using pony orm?

Tags:

python

sql

ponyorm

Marek M.

1 Answers

Alexander Kozlovsky

Related questions

Recent Activity

Donate For Us