Thursday, March 3, 2011

Are inline queries a bad idea?

I have a table containing the runtimes for generators on different sites, and I want to select the most recent entry for each site. Each generator is run once or twice a week.

I have a query that will do this, but I wonder if it's the best option. I can't help thinking that using WHERE x IN (SELECT ...) is lazy and not the best way to formulate the query - any query.

The table is as follows:

CREATE TABLE generator_logs (
    id integer NOT NULL,
    site_id character varying(4) NOT NULL,
    start timestamp without time zone NOT NULL,
    "end" timestamp without time zone NOT NULL,
    duration integer NOT NULL
);

And the query:

SELECT id, site_id, start, "end", duration 
FROM generator_logs
WHERE start IN (SELECT MAX(start) AS start 
                FROM generator_logs 
                GROUP BY site_id) 
ORDER BY start DESC

There isn't a huge amount of data, so I'm not worried about optimizing the query. However, I do have to do similar things on tables with 10s of millions of rows, (big tables as far as I'm concerned!) and there optimisation is more important.

So is there a better query for this, and are inline queries generally a bad idea?

From stackoverflow
  • In MYSQL it could be problematic because Last i Checked it was unable to optimise subqueries effectively ( Ie: by query-rewriting )

    Many DBMS's have Genetic Query planners which will do the same thing regardless of your input queries structure.

    MYSQL will in some cases for that situation create a temp table, other times not, and depending on the circumstances, indexing, condtions, subqueries can still be rather quick.

    Some complain that subqueries are hard to read, but they're perfectly fine if you fork them into local variables.

    $maxids = 'SELECT MAX(start) AS start FROM generator_logs GROUP BY site_id';
    $q ="     
        SELECT id, site_id, start, \"end\", duration 
           FROM generator_logs
           WHERE start IN ($maxids) 
           ORDER BY start DESC
    ";
    
  • Should your query not be correlated? i.e.:

    SELECT id, site_id, start, "end", duration 
    FROM generator_logs g1
    WHERE start = (SELECT MAX(g2.start) AS start 
                   FROM generator_logs  g2
                   WHERE g2.site_id = g1.site_id) 
    ORDER BY start DESC
    

    Otherwise you will potentially pick up non-latest logs whose start value happens to match the latest start for a different site.

    Or alternatively:

    SELECT id, site_id, start, "end", duration 
    FROM generator_logs g1
    WHERE (site_id, start) IN (SELECT site_id, MAX(g2.start) AS start 
                               FROM generator_logs  g2
                               GROUP BY site_id)
    ORDER BY start DESC
    
  • I would use joins as they perform much better then "IN" clause:

    select gl.id, gl.site_id, gl.start, gl."end", gl.duration 
    from 
        generator_logs gl
        inner join (
         select max(start) as start, site_id
            from generator_logs 
            group by site_id
        ) gl2
         on gl.site_id = gl2.site_id
         and gl.start = gl2.start
    

    Also as Tony pointed out you were missing correlation in your original query

0 comments:

Post a Comment