Sunday, May 1, 2011

MSSQL JOIN ON GROUP BY is too slow

I have the following query in MSSQL

SELECT TOP 50 CustomerID FROM Ratings
WHERE CustomerID != 915
AND MovieID IN (SELECT DISTINCT MovieID FROM Ratings WHERE CustomerID = 915)
GROUP BY CustomerID
ORDER BY count(*) DESC

It is super fast. When I try to use it in a subquery like this.

SELECT * FROM Ratings
WHERE MovieID = 1 AND
CustomerID IN (SELECT TOP 50 CustomerID FROM Ratings
    WHERE CustomerID != 915
    AND MovieID IN (SELECT DISTINCT MovieID FROM Ratings WHERE CustomerID = 915)
    GROUP BY CustomerID
    ORDER BY count(*) DESC)

Any ideas on why this is so slow and how I can speed it up? My primary key is (MovieID-CustomerID) and I added a index on CustomerID

From stackoverflow
  • You'll need to create an extra UNIQUE index on (CustomerID, MovieID) (in this order) to improve this query.

    See the article in my blog for performance details:

    Since your subqueries return UNIQUE sets of values, the query may be rewritten as a JOIN:

    SELECT  r2.*
    FROM    (
            SELECT  TOP 50 CustomerID
            FROM    (
                    SELECT  MovieID
                    FROM    Ratings
                    WHERE   CustomerID = 915
                    ) q
            JOIN    Ratings r
            ON      r.MovieID = q.MovieID
                    AND CustomerID <> 915
            GROUP BY
                    CustomerID
            ORDER BY
                    COUNT(*) DESC
            ) ro
    JOIN    Ratings r2
    ON      r2.MovieID = 1
            AND r2.CustomerID = ro.CustomerID
    

    To select TOP 50 customers of those who have rated Movie 1, use:

    SELECT  r2.*
    FROM    (
            SELECT  TOP 50 CustomerID
            FROM    (
                    SELECT  MovieID
                    FROM    Ratings
                    WHERE   CustomerID = 915
                    ) q
            JOIN    Ratings r
            ON      r.MovieID = q.MovieID
                    AND CustomerID <> 915
                    AND EXISTS
                    (
                    SELECT  1
                    FROM    Ratings re
                    WHERE   re.MovieID = 1
                            AND re.CustomerID = r.CustomerID
                    )
            GROUP BY
                    CustomerID
            ORDER BY
                    COUNT(*) DESC
            ) ro
    JOIN    Ratings r2
    ON      r2.MovieID = 1
            AND r2.CustomerID = ro.CustomerID
    
    James Van Boxtel : Awesome! Only thing is it sometimes returns slightly less than 50, probably cause some of the top people haven't rated MovieID 1. But I should be able to figure out the rest.
    James Van Boxtel : Sweet I got my thing fully working. BTW the query analyzer suggested the same index as you and it helped a lot.

0 comments:

Post a Comment