如何筛选具有多对一关系的SQL结果

假设我有桌子student ， club和student_club ：

 student { id name } club { id name } student_club { student_id club_id }

我想知道如何find足球（30）和棒球（50）俱乐部的所有学生。
虽然这个查询不起作用，但这是我迄今为止最接近的事情：

 SELECT student.* FROM student INNER JOIN student_club sc ON student.id = sc.student_id LEFT JOIN club c ON c.id = sc.club_id WHERE c.id = 30 AND c.id = 50

我好奇。众所周知，好奇心有杀猫的声誉。

那么，这是皮肤猫最快的方法吗？

这个testing的精确猫皮肤环境：

Debian上的PostgreSQL 9.0用合适的RAM和设置挤压。
6.000名学生，24,000俱乐部成员资料（从类似数据库复制的数据和实际数据）
在问题中命名模式略有转移： student.id是student.stud_id ， club.id是club.club_id 。
我在他们的作者之后在这个线程中命名了查询，其索引有两个。
我运行了所有的查询几次来填充caching，然后我用EXPLAIN ANALYSEselect了5个最好的。
相关指标（应该是最佳的 – 只要我们缺乏哪些俱乐部将被查询的前提知识）：
```
 ALTER TABLE student ADD CONSTRAINT student_pkey PRIMARY KEY(stud_id ); ALTER TABLE student_club ADD CONSTRAINT sc_pkey PRIMARY KEY(stud_id, club_id); ALTER TABLE club ADD CONSTRAINT club_pkey PRIMARY KEY(club_id ); CREATE INDEX sc_club_id_idx ON student_club (club_id); 
```
这里的大多数查询并不要求club_pkey 。
主键自动实现唯一索引在PostgreSQL中。
最后一个索引是为了弥补PostgreSQL上多列索引的这个已知的缺点：

多列B树索引可以用于涉及索引列的任何子集的查询条件，但是当前导（最左边）列有约束时索引是最有效的。

结果：

EXPLAIN ANALYSE的总运行时间。

1）马丁2：44.594毫秒

 SELECT s.stud_id, s.name FROM student s JOIN student_club sc USING (stud_id) WHERE sc.club_id IN (30, 50) GROUP BY 1,2 HAVING COUNT(*) > 1;

2）欧文1：33.217毫秒

 SELECT s.stud_id, s.name FROM student s JOIN ( SELECT stud_id FROM student_club WHERE club_id IN (30, 50) GROUP BY 1 HAVING COUNT(*) > 1 ) sc USING (stud_id);

3）马丁1：31.735毫秒

 SELECT s.stud_id, s.name FROM student s WHERE student_id IN ( SELECT student_id FROM student_club WHERE club_id = 30 INTERSECT SELECT stud_id FROM student_club WHERE club_id = 50);

4）德里克：2.287毫秒

 SELECT s.stud_id, s.name FROM student s WHERE s.stud_id IN (SELECT stud_id FROM student_club WHERE club_id = 30) AND s.stud_id IN (SELECT stud_id FROM student_club WHERE club_id = 50);

5）Erwin 2：2.181毫秒

 SELECT s.stud_id, s.name FROM student s WHERE EXISTS (SELECT * FROM student_club WHERE stud_id = s.stud_id AND club_id = 30) AND EXISTS (SELECT * FROM student_club WHERE stud_id = s.stud_id AND club_id = 50);

6）肖恩：2.043毫秒

 SELECT s.stud_id, s.name FROM student s JOIN student_club x ON s.stud_id = x.stud_id JOIN student_club y ON s.stud_id = y.stud_id WHERE x.club_id = 30 AND y.club_id = 50;

后三个performance几乎相同。 4）和5）导致相同的查询计划。

后期增加：

花哨的SQL，但性能跟不上。

7）ypercube 1：148.649毫秒

 SELECT s.stud_id, s.name FROM student AS s WHERE NOT EXISTS ( SELECT * FROM club AS c WHERE c.club_id IN (30, 50) AND NOT EXISTS ( SELECT * FROM student_club AS sc WHERE sc.stud_id = s.stud_id AND sc.club_id = c.club_id ) );

8）ypercube 2：147.497毫秒

 SELECT s.stud_id, s.name FROM student AS s WHERE NOT EXISTS ( SELECT * FROM ( SELECT 30 AS club_id UNION ALL SELECT 50 ) AS c WHERE NOT EXISTS ( SELECT * FROM student_club AS sc WHERE sc.stud_id = s.stud_id AND sc.club_id = c.club_id ) );

正如所料，这两个performance几乎相同。查询计划导致表扫描，规划者在这里找不到使用索引的方法。

9）wildplasser 1：49.849毫秒

 WITH RECURSIVE two AS ( SELECT 1::int AS level , stud_id FROM student_club sc1 WHERE sc1.club_id = 30 UNION SELECT two.level + 1 AS level , sc2.stud_id FROM student_club sc2 JOIN two USING (stud_id) WHERE sc2.club_id = 50 AND two.level = 1 ) SELECT s.stud_id, s.student FROM student s JOIN two USING (studid) WHERE two.level > 1;

花哨的SQL，一个CTE体面的performance。非常奇特的查询计划。
再次，9.1将如何处理这个问题。我将尽快将这里使用的数据库集群升级到9.1。也许我会重新运行整个shebang …

10）wildplasser 2：36.986毫秒

 WITH sc AS ( SELECT stud_id FROM student_club WHERE club_id IN (30,50) GROUP BY stud_id HAVING COUNT(*) > 1 ) SELECT s.* FROM student s JOIN sc USING (stud_id);

查询2的CTE变体）。令人惊讶的是，它可能会导致与完全相同的数据略有不同的查询计划。我发现对student的顺序扫描，子查询variables使用索引。

11）ypercube 3：101.482毫秒

另外还有@ypercube。这真是太神奇了，有多less种方法。

 SELECT s.stud_id, s.student FROM student s JOIN student_club sc USING (stud_id) WHERE sc.club_id = 10 -- member in 1st club ... AND NOT EXISTS ( SELECT * FROM (SELECT 14 AS club_id) AS c -- can't be excluded for missing the 2nd WHERE NOT EXISTS ( SELECT * FROM student_club AS d WHERE d.stud_id = sc.stud_id AND d.club_id = c.club_id ) )

12）erwin 3：2.377ms

@ ypercube的11）实际上只是这个简单的变体扭转思维的方法，这也是失踪。几乎与顶级猫一样快。

 SELECT s.* FROM student s JOIN student_club x USING (stud_id) WHERE sc.club_id = 10 -- member in 1st club ... AND EXISTS ( -- ... and membership in 2nd exists SELECT * FROM student_club AS y WHERE y.stud_id = s.stud_id AND y.club_id = 14 )

13）erwin 4：2.375ms

很难相信，但这是另一个真正的新变种。我看到两个以上会员的潜力，但它也是仅有两个的顶级猫。

 SELECT s.* FROM student AS s WHERE EXISTS ( SELECT * FROM student_club AS x JOIN student_club AS y USING (stud_id) WHERE x.stud_id = s.stud_id AND x.club_id = 14 AND y.club_id = 10 )

 SELECT s.* FROM student s INNER JOIN student_club sc_soccer ON s.id = sc_soccer.student_id INNER JOIN student_club sc_baseball ON s.id = sc_baseball.student_id WHERE sc_baseball.club_id = 50 AND sc_soccer.club_id = 30

 select * from student where id in (select student_id from student_club where club_id = 30) and id in (select student_id from student_club where club_id = 50)

如果你只是想student_id那么：

  Select student_id from student_club where club_id in ( 30, 50 ) group by student_id having count( student_id ) = 2

如果你还需要学生的姓名，那么：

 Select student_id, name from student s where exists( select * from student_club sc where s.student_id = sc.student_id and club_id in ( 30, 50 ) group by sc.student_id having count( sc.student_id ) = 2 )

如果你在club_selection表中有两个以上的俱乐部，那么：

 Select student_id, name from student s where exists( select * from student_club sc where s.student_id = sc.student_id and exists( select * from club_selection cs where sc.club_id = cs.club_id ) group by sc.student_id having count( sc.student_id ) = ( select count( * ) from club_selection ) )

 SELECT * FROM student WHERE id IN (SELECT student_id FROM student_club WHERE club_id = 30 INTERSECT SELECT student_id FROM student_club WHERE club_id = 50)

或者更通用的解决scheme更容易扩展到n俱乐部，并避免INTERSECT （在MySQL中不可用）和IN （因为在MySQL中performance糟糕）

 SELECT s.id, s.name FROM student s join student_club sc ON s.id = sc.student_id WHERE sc.club_id IN ( 30, 50 ) GROUP BY s.id, s.name HAVING COUNT(DISTINCT sc.club_id) = 2

另一个CTE。它看起来干净，但它可能会产生一个正常的子查询相同的计划。

 WITH two AS ( SELECT student_id FROM tmp.student_club WHERE club_id IN (30,50) GROUP BY student_id HAVING COUNT(*) > 1 ) SELECT st.* FROM tmp.student st JOIN two ON (two.student_id=st.id) ;

对于那些想要testing的，我生成testdata thingy的副本：

 DROP SCHEMA tmp CASCADE; CREATE SCHEMA tmp; CREATE TABLE tmp.student ( id INTEGER NOT NULL PRIMARY KEY , sname VARCHAR ); CREATE TABLE tmp.club ( id INTEGER NOT NULL PRIMARY KEY , cname VARCHAR ); CREATE TABLE tmp.student_club ( student_id INTEGER NOT NULL REFERENCES tmp.student(id) , club_id INTEGER NOT NULL REFERENCES tmp.club(id) ); INSERT INTO tmp.student(id) SELECT generate_series(1,1000) ; INSERT INTO tmp.club(id) SELECT generate_series(1,100) ; INSERT INTO tmp.student_club(student_id,club_id) SELECT st.id , cl.id FROM tmp.student st, tmp.club cl ; DELETE FROM tmp.student_club WHERE random() < 0.8 ; UPDATE tmp.student SET sname = 'Student#' || id::text ; UPDATE tmp.club SET cname = 'Soccer' WHERE id = 30; UPDATE tmp.club SET cname = 'Baseball' WHERE id = 50; ALTER TABLE tmp.student_club ADD PRIMARY KEY (student_id,club_id) ;

所以有不止一个方法去皮肤猫 。
我会再添加两个，使之更完整。

1）GROUP，稍后join

假设一个理智的数据模型(student_id, club_id)在student_club是唯一的。马丁·史密斯的第二个版本有点类似，但他之后join了第一批。这应该会更快：

 SELECT s.id, s.name FROM student s JOIN ( SELECT student_id FROM student_club WHERE club_id IN (30, 50) GROUP BY 1 HAVING COUNT(*) > 1 ) sc USING (student_id);

2）存在

当然，还有经典的EXISTS 。与Derek的IN类似。简单而快速。（在MySQL中，这应该比带有IN的变种快得多）：

 SELECT s.id, s.name FROM student s WHERE EXISTS (SELECT 1 FROM student_club WHERE student_id = s.student_id AND club_id = 30) AND EXISTS (SELECT 1 FROM student_club WHERE student_id = s.student_id AND club_id = 50);

由于没有人添加这个（经典）版本：

 SELECT s.* FROM student AS s WHERE NOT EXISTS ( SELECT * FROM club AS c WHERE c.id IN (30, 50) AND NOT EXISTS ( SELECT * FROM student_club AS sc WHERE sc.student_id = s.id AND sc.club_id = c.id ) )

或类似：

 SELECT s.* FROM student AS s WHERE NOT EXISTS ( SELECT * FROM ( SELECT 30 AS club_id UNION ALL SELECT 50 ) AS c WHERE NOT EXISTS ( SELECT * FROM student_club AS sc WHERE sc.student_id = s.id AND sc.club_id = c.club_id ) )

再用一个稍微不同的方法尝试一下。受到解释扩展：EAV表中的多个属性的启发：GROUP BY与NOT EXISTS ：

 SELECT s.* FROM student_club AS sc JOIN student AS s ON s.student_id = sc.student_id WHERE sc.club_id = 50 --- one option here AND NOT EXISTS ( SELECT * FROM ( SELECT 30 AS club_id --- all the rest in here --- as in previous query ) AS c WHERE NOT EXISTS ( SELECT * FROM student_club AS scc WHERE scc.student_id = sc.id AND scc.club_id = c.club_id ) )

另一种方法：

 SELECT s.stud_id FROM student s EXCEPT SELECT stud_id FROM ( SELECT s.stud_id, c.club_id FROM student s CROSS JOIN (VALUES (30),(50)) c (club_id) EXCEPT SELECT stud_id, club_id FROM student_club WHERE club_id IN (30, 50) -- optional. Not needed but may affect performance ) x ;

 WITH RECURSIVE two AS ( SELECT 1::integer AS level , student_id FROM tmp.student_club sc0 WHERE sc0.club_id = 30 UNION SELECT 1+two.level AS level , sc1.student_id FROM tmp.student_club sc1 JOIN two ON (two.student_id = sc1.student_id) WHERE sc1.club_id = 50 AND two.level=1 ) SELECT st.* FROM tmp.student st JOIN two ON (two.student_id=st.id) WHERE two.level> 1 ;

这似乎performance相当好，因为CTE扫描避免了两个单独的子查询的需要。

总是有一个错误的recursion查询的理由！

（顺便说一句：MySQL似乎没有recursion查询）

查询2）和10）中的不同查询计划

我在一个真实的生活分贝testing，所以名称不同于猫咪列表。这是一个备份副本，因此在所有testing运行期间都不会有任何更改（除了对目录进行小的更改）。

查询2）

 SELECT a.* FROM ef.adr a JOIN ( SELECT adr_id FROM ef.adratt WHERE att_id IN (10,14) GROUP BY adr_id HAVING COUNT(*) > 1) t using (adr_id); Merge Join (cost=630.10..1248.78 rows=627 width=295) (actual time=13.025..34.726 rows=67 loops=1) Merge Cond: (a.adr_id = adratt.adr_id) -> Index Scan using adr_pkey on adr a (cost=0.00..523.39 rows=5767 width=295) (actual time=0.023..11.308 rows=5356 loops=1) -> Sort (cost=630.10..636.37 rows=627 width=4) (actual time=12.891..13.004 rows=67 loops=1) Sort Key: adratt.adr_id Sort Method: quicksort Memory: 28kB -> HashAggregate (cost=450.87..488.49 rows=627 width=4) (actual time=12.386..12.710 rows=67 loops=1) Filter: (count(*) > 1) -> Bitmap Heap Scan on adratt (cost=97.66..394.81 rows=2803 width=4) (actual time=0.245..5.958 rows=2811 loops=1) Recheck Cond: (att_id = ANY ('{10,14}'::integer[])) -> Bitmap Index Scan on adratt_att_id_idx (cost=0.00..94.86 rows=2803 width=0) (actual time=0.217..0.217 rows=2811 loops=1) Index Cond: (att_id = ANY ('{10,14}'::integer[])) Total runtime: 34.928 ms

查询10）

 WITH two AS ( SELECT adr_id FROM ef.adratt WHERE att_id IN (10,14) GROUP BY adr_id HAVING COUNT(*) > 1 ) SELECT a.* FROM ef.adr a JOIN two using (adr_id); Hash Join (cost=1161.52..1261.84 rows=627 width=295) (actual time=36.188..37.269 rows=67 loops=1) Hash Cond: (two.adr_id = a.adr_id) CTE two -> HashAggregate (cost=450.87..488.49 rows=627 width=4) (actual time=13.059..13.447 rows=67 loops=1) Filter: (count(*) > 1) -> Bitmap Heap Scan on adratt (cost=97.66..394.81 rows=2803 width=4) (actual time=0.252..6.252 rows=2811 loops=1) Recheck Cond: (att_id = ANY ('{10,14}'::integer[])) -> Bitmap Index Scan on adratt_att_id_idx (cost=0.00..94.86 rows=2803 width=0) (actual time=0.226..0.226 rows=2811 loops=1) Index Cond: (att_id = ANY ('{10,14}'::integer[])) -> CTE Scan on two (cost=0.00..50.16 rows=627 width=4) (actual time=13.065..13.677 rows=67 loops=1) -> Hash (cost=384.68..384.68 rows=5767 width=295) (actual time=23.097..23.097 rows=5767 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1153kB -> Seq Scan on adr a (cost=0.00..384.68 rows=5767 width=295) (actual time=0.005..10.955 rows=5767 loops=1) Total runtime: 37.482 ms

@ erwin-brandstetter请参照这个标准：

 SELECT s.stud_id, s.name FROM student s, student_club x, student_club y WHERE x.club_id = 30 AND s.stud_id = x.stud_id AND y.club_id = 50 AND s.stud_id = y.stud_id;

这就像数字6）@sean，只是更干净，我猜。

 -- EXPLAIN ANALYZE WITH two AS ( SELECT c0.student_id FROM tmp.student_club c0 , tmp.student_club c1 WHERE c0.student_id = c1.student_id AND c0.club_id = 30 AND c1.club_id = 50 ) SELECT st.* FROM tmp.student st JOIN two ON (two.student_id=st.id) ;

查询计划：

  Hash Join (cost=1904.76..1919.09 rows=337 width=15) (actual time=6.937..8.771 rows=324 loops=1) Hash Cond: (two.student_id = st.id) CTE two -> Hash Join (cost=849.97..1645.76 rows=337 width=4) (actual time=4.932..6.488 rows=324 loops=1) Hash Cond: (c1.student_id = c0.student_id) -> Bitmap Heap Scan on student_club c1 (cost=32.76..796.94 rows=1614 width=4) (actual time=0.667..1.835 rows=1646 loops=1) Recheck Cond: (club_id = 50) -> Bitmap Index Scan on sc_club_id_idx (cost=0.00..32.36 rows=1614 width=0) (actual time=0.473..0.473 rows=1646 loops=1) Index Cond: (club_id = 50) -> Hash (cost=797.00..797.00 rows=1617 width=4) (actual time=4.203..4.203 rows=1620 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 57kB -> Bitmap Heap Scan on student_club c0 (cost=32.79..797.00 rows=1617 width=4) (actual time=0.663..3.596 rows=1620 loops=1) Recheck Cond: (club_id = 30) -> Bitmap Index Scan on sc_club_id_idx (cost=0.00..32.38 rows=1617 width=0) (actual time=0.469..0.469 rows=1620 loops=1) Index Cond: (club_id = 30) -> CTE Scan on two (cost=0.00..6.74 rows=337 width=4) (actual time=4.935..6.591 rows=324 loops=1) -> Hash (cost=159.00..159.00 rows=8000 width=15) (actual time=1.979..1.979 rows=8000 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 374kB -> Seq Scan on student st (cost=0.00..159.00 rows=8000 width=15) (actual time=0.093..0.759 rows=8000 loops=1) Total runtime: 8.989 ms (20 rows)

所以它似乎还想要seq扫描学生。

 SELECT s.stud_id, s.name FROM student s, ( select x.stud_id from student_club x JOIN student_club y ON x.stud_id = y.stud_id WHERE x.club_id = 30 AND y.club_id = 50 ) tmp_tbl where tmp_tbl.stud_id = s.stud_id ;

使用最快的变体（Mr. Brandstetter图中的肖恩先生）。可能是只有一个join变体，只有student_clubmatrix有权生活。所以，最长的查询将只有两列计算，想法是减less查询。