I have a table containing the runtimes for generators on different sites, and I want to select the most recent entry for each site. Each generator is run once or twice a week.
I have a query that will do this, but I wonder if it's the best option. I can't help thinking that using WHERE x IN (SELECT ...) is lazy and not the best way to formulate the query - any query.
The table is as follows:
CREATE TABLE generator_logs (
id integer NOT NULL,
site_id character varying(4) NOT NULL,
start timestamp without time zone NOT NULL,
"end" timestamp without time zone NOT NULL,
duration integer NOT NULL
);
And the query:
SELECT id, site_id, start, "end", duration
FROM generator_logs
WHERE start IN (SELECT MAX(start) AS start
FROM generator_logs
GROUP BY site_id)
ORDER BY start DESC
There isn't a huge amount of data, so I'm not worried about optimizing the query. However, I do have to do similar things on tables with 10s of millions of rows, (big tables as far as I'm concerned!) and there optimisation is more important.
So is there a better query for this, and are inline queries generally a bad idea?
-
In MYSQL it could be problematic because Last i Checked it was unable to optimise subqueries effectively ( Ie: by query-rewriting )
Many DBMS's have Genetic Query planners which will do the same thing regardless of your input queries structure.
MYSQL will in some cases for that situation create a temp table, other times not, and depending on the circumstances, indexing, condtions, subqueries can still be rather quick.
Some complain that subqueries are hard to read, but they're perfectly fine if you fork them into local variables.
$maxids = 'SELECT MAX(start) AS start FROM generator_logs GROUP BY site_id'; $q =" SELECT id, site_id, start, \"end\", duration FROM generator_logs WHERE start IN ($maxids) ORDER BY start DESC "; -
Should your query not be correlated? i.e.:
SELECT id, site_id, start, "end", duration FROM generator_logs g1 WHERE start = (SELECT MAX(g2.start) AS start FROM generator_logs g2 WHERE g2.site_id = g1.site_id) ORDER BY start DESCOtherwise you will potentially pick up non-latest logs whose start value happens to match the latest start for a different site.
Or alternatively:
SELECT id, site_id, start, "end", duration FROM generator_logs g1 WHERE (site_id, start) IN (SELECT site_id, MAX(g2.start) AS start FROM generator_logs g2 GROUP BY site_id) ORDER BY start DESC -
I would use joins as they perform much better then "IN" clause:
select gl.id, gl.site_id, gl.start, gl."end", gl.duration from generator_logs gl inner join ( select max(start) as start, site_id from generator_logs group by site_id ) gl2 on gl.site_id = gl2.site_id and gl.start = gl2.startAlso as Tony pointed out you were missing correlation in your original query
0 comments:
Post a Comment