Searching for items in a many-to-many relationship





I'm currently writing an application that allows one to store images, and then tag these images. I'm using Python and the Peewee ORM (http://charlesleifer.com/docs/peewee/), which is very similar to Django's ORM.

My data model looks like this (simplified):

class Image(BaseModel):
    key = CharField()

class Tag(BaseModel):
    tag = CharField()

class TagRelationship(BaseModel):
    relImage = ForeignKeyField(Image)
    relTag   = ForeignKeyField(Tag)

Now, I understand conceptually how to query for all Images that have a given set of tags:

SELECT Image.key
  FROM Image
INNER JOIN TagRelationship
    ON Image.ID = TagRelationship.ImageID
    ON TagRelationship.TagID = Tag.ID
 WHERE Tag.tag
       IN ( 'A' , 'B' )     -- list of multiple tags
GROUP BY Image.key
HAVING COUNT(*) = 2         -- where 2 == the number of tags specified, above

However, I also want to be able to do more complex searches. Specifically, I'd like to be able to specify a list of "all tags" - i.e. an image must have all of the specified tags to be returned, along with a list of "any" and a list of "none".

EDIT: I'd like to clarify this a bit. Specifically, the above query is an "all tags"-style query. It returns Images that have all the given tags. I want to be able to specify something like: "Give me all images that have the tags (green, mountain), any one of the tags (background, landscape) but not the tags (digital, drawing)".

Now, ideally, I'd like this to be one SQL query, because pagination then becomes very easy with LIMIT and OFFSET. I've actually got an implementation working whereby I just load everything into Python sets and then use the various intersection operators. What I'm wondering is if there's a method of doing this all at once?

Also, for those interested, I've emailed the author of Peewee about how to represent the above query using Peewee, and he responded with the following solution:

Image.select(['key']).group_by('key').join(TagRelationship).join(Tag).where(tag__in=['tag1', 'tag2']).having('count(*) = 2')

Or, alternatively, a shorter version:

Image.filter(tagrelationship_set__relTag__tag__in=['tag1', 'tag2']).group_by(Image).having('count(*) = 2')

Thanks in advance for your time.

2 Answers

SELECT Image.key
  FROM Image
  JOIN TagRelationship
    ON Image.ID = TagRelationship.ImageID
  JOIN Tag
    ON TagRelationship.TagID = Tag.ID
 GROUP BY Image.key
HAVING SUM(Tag.tag IN (mandatory tags )) = N  /*the number of mandatory tags*/
   AND SUM(Tag.tag IN (optional tags  )) > 0
   AND SUM(Tag.tag IN (prohibited tags)) = 0


A more universally accepted version of the above query (converts the boolean results of the IN predicates into integers using CASE expressions):

SELECT Image.key
  FROM Image
  JOIN TagRelationship
    ON Image.ID = TagRelationship.ImageID
  JOIN Tag
    ON TagRelationship.TagID = Tag.ID
 GROUP BY Image.key
HAVING SUM(CASE WHEN Tag.tag IN (mandatory tags ) THEN 1 ELSE 0 END) = N  /*the number of mandatory tags*/
   AND SUM(CASE WHEN Tag.tag IN (optional tags  ) THEN 1 ELSE 0 END) > 0
   AND SUM(CASE WHEN Tag.tag IN (prohibited tags) THEN 1 ELSE 0 END) = 0

or with COUNTs instead of SUMs:

SELECT Image.key
  FROM Image
  JOIN TagRelationship
    ON Image.ID = TagRelationship.ImageID
  JOIN Tag
    ON TagRelationship.TagID = Tag.ID
 GROUP BY Image.key
HAVING COUNT(CASE WHEN Tag.tag IN (mandatory tags ) THEN 1 END) = N  /*the number of mandatory tags*/
   AND COUNT(CASE WHEN Tag.tag IN (optional tags  ) THEN 1 END) > 0
   AND COUNT(CASE WHEN Tag.tag IN (prohibited tags) THEN 1 END) = 0
The top half gets the words that match the mandatory tags. The bottom half does the tags where at least 1 must be present. The bottom query doesn't have a GROUP BY because I want to know if an image appears twice. If it does, it has both background and landscape. The ORDER BY count(*) will make pictures with BOTH background and landscape tags to appear at the top. So green, mountain, background landscape will be the most relevant. Then green, mountain, background OR landscape pictures.

SELECT Image.key, count(*) AS 'relevance' 
     (SELECT Image.key
        --good image candidates
        (SELECT Image.key
         FROM Image
         WHERE Image.key NOT IN 
            --Bad Images
            (SELECT DISTINCT(Image.key)   --Will reduce size of set, remove duplicates
             FROM Image
             INNER JOIN TagRelationship
                ON Image.ID = TagRelationship.ImageID
             INNER JOIN Tag
                ON TagRelationship.TagID = Tag.ID
              WHERE Tag.tag
                   IN ('digital', 'drawing' )))
    INNER JOIN TagRelationship
        ON Image.ID = TagRelationship.ImageID
        ON TagRelationship.TagID = Tag.ID
    WHERE Tag.tag
           IN ('green', 'mountain')
    GROUP BY Image.key
    HAVING COUNT(*) = count('green', 'mountain')
    --we need green AND mountain


    --Get all images with one of the following 2 tags
    SELECT * 
        (SELECT Image.key
         FROM Image
         INNER JOIN TagRelationship
             ON Image.ID = TagRelationship.ImageID
         INNER JOIN Tag
             ON TagRelationship.TagID = Tag.ID
          WHERE Tag.tag
             IN ( 'background' , 'landscape' ))
GROUP BY Image.key
ORDER BY relevance DESC
