mercredi 28 octobre 2015

SQL - Computing overlap between Interests

I have a schema (millions of records with proper indexes in place) that looks like this:

groups    |  interests
------    |  ---------
user_id   |  user_id
group_id  |  interest_id

A user can like 0..many interests and belong to 0..many groups.

Problem: Given a group ID, I want to get all the interests for all the users that do not belong to that group, and, that share at least one interest with anyone that belongs to the same provided group.

Since the above might be confusing, here's a straightforward example (SQLFiddle):

| 1 | 2 | 3 | 4 | 5 | (User IDs)
|-------------------|
| A |   | A |   |   |
| B | B | B |   | B |
|   | C |   |   |   |
|   |   | D | D |   |

In the above example users are labeled with numbers while interests have characters.

If we assume that users 1 and 2 belong to group -1, then users 3 and 5 would be interesting:

user_id  interest_id
-------  -----------
      3            A
      3            B
      3            D
      5            B

I already wrote a dumb and very inefficient query that correctly returns the above:

SELECT * FROM "interests" WHERE "user_id" IN (
    SELECT "user_id" FROM "interests" WHERE "interest_id" IN (
        SELECT "interest_id" FROM "interests" WHERE "user_id" IN (
            SELECT "user_id" FROM "groups" WHERE "group_id" = -1
        )
    ) AND "user_id" NOT IN (
        SELECT "user_id" FROM "groups" WHERE "group_id" = -1
    )
);

But all my attempts to translate that into a proper joined query revealed themselves fruitless: either the query returns way more rows than it should or it just takes 10x as long as the sub-query, like:

SELECT "iii"."user_id" FROM "interests" AS "iii"
WHERE EXISTS
(
    SELECT "ii"."user_id", "ii"."interest_id" FROM "groups" AS "gg"
    INNER JOIN "interests" AS "ii" ON "gg"."user_id" = "ii"."user_id"
    WHERE EXISTS
    (
        SELECT "i"."interest_id" FROM "groups" AS "g"
        INNER JOIN "interests" AS "i" ON "g"."user_id" = "i"."user_id"
        WHERE "group_id" = -1 AND "i"."interest_id" = "ii"."interest_id"
    ) AND "group_id" != -1 AND "ii"."user_id" = "iii"."user_id"
);

I've been struggling trying to optimize this query for the past two nights...

Any help or insight that gets me in the right direction would be greatly appreciated. :)


PS: Ideally, one query that returns an aggregated count of common interests would be even nicer:

user_id  totalInterests  commonInterests
-------  --------------  ---------------
      3               3              1/2 (either is fine, but 2 is better)
      5               1                1

However, I'm not sure how much slower it would be compared to doing it in code.

Aucun commentaire:

Enregistrer un commentaire