用MySQL计算中位数的简单方法

用MySQL计算中位数的最简单的方法是什么(希望不要太慢)? 我用AVG(x)找到了平均值,但是我很难找到一个简单的计算中位数的方法。 现在,我将所有行返回给PHP,进行排序,然后选择中间行,但是在一个MySQL查询中肯定有一些简单的方法。

示例数据:

 id | val -------- 1 4 2 7 3 2 4 2 5 9 6 8 7 3 

val排序得到2 2 3 4 7 8 9 ,所以中位数应该是4 ,而对于SELECT AVG(val) ,其中== 5

建议解决方案(TheJacobTaylor)的问题是运行时。 加入表格本身对于大型数据集来说太慢了。 我建议的替代方案在MySQL中运行,具有超棒的运行时,使用显式的ORDER BY语句,所以你不必希望你的索引正确地命令它给出正确的结果,并且很容易展开查询来调试。

 SELECT avg(t1.val) as median_val FROM ( SELECT @rownum:=@rownum+1 as `row_number`, d.val FROM data d, (SELECT @rownum:=0) r WHERE 1 -- put some where clause here ORDER BY d.val ) as t1, ( SELECT count(*) as total_rows FROM data d WHERE 1 -- put same where clause here ) as t2 WHERE 1 AND t1.row_number in ( floor((total_rows+1)/2), floor((total_rows+2)/2) ); 

在(…)中添加avg()周围的t1.val和row_number,以便在有偶数个记录时正确地生成一个中位数。 推理:

 SELECT floor((3+1)/2),floor((3+2)/2);#total_rows is 3, so avg row_numbers 2 and 2 SELECT floor((4+1)/2),floor((4+2)/2);#total_rows is 4, so avg row_numbers 2 and 3 

我只是在评论中发现了另一个在线答案 :

对于几乎任何SQL中的中值:

 SELECT x.val from data x, data y GROUP BY x.val HAVING SUM(SIGN(1-SIGN(y.val-x.val))) = (COUNT(*)+1)/2 

确保你的列索引良好,索引用于过滤和排序。 验证解释计划。

 select count(*) from table --find the number of rows 

计算“中位数”行号。 也许使用: median_row = floor(count / 2)

然后从列表中挑选出来:

 select val from table order by val asc limit median_row,1 

这应该返回你一行,只是你想要的值。

雅各

我发现接受的解决方案没有在我的MySQL安装上工作,返回一个空集,但是这个查询在我测试它的所有情况下工作:

 SELECT x.val from data x, data y GROUP BY x.val HAVING SUM(SIGN(1-SIGN(y.val-x.val)))/COUNT(*) > .5 LIMIT 1 

不幸的是,TheJacobTaylor和Velcro的答案都不会为当前版本的MySQL返回准确的结果。

魔术贴上面的答案是接近的,但对于偶数行的结果集计算不正确。 中位数定义为:1)奇数组中的中间数;或2)偶数组中的两个中间数的平均值。

所以,这是魔术贴的解决方案修补处理奇数和偶数套:

 SELECT AVG(middle_values) AS 'median' FROM ( SELECT t1.median_column AS 'middle_values' FROM ( SELECT @row:=@row+1 as `row`, x.median_column FROM median_table AS x, (SELECT @row:=0) AS r WHERE 1 -- put some where clause here ORDER BY x.median_column ) AS t1, ( SELECT COUNT(*) as 'count' FROM median_table x WHERE 1 -- put same where clause here ) AS t2 -- the following condition will return 1 record for odd number sets, or 2 records for even number sets. WHERE t1.row >= t2.count/2 and t1.row <= ((t2.count/2) +1)) AS t3; 

要使用这个,请按照以下3个简单步骤操作:

  1. 将上面代码中的“median_table”(2次出现)替换为表格的名称
  2. 将“median_column”(3次出现)替换为您希望找到中间值的列名称
  3. 如果您有WHERE条件,请将“WHERE 1”(2次出现)替换为您的条件

我建议更快的方法。

获取行数:

SELECT CEIL(COUNT(*)/2) FROM data;

然后在排序的子查询中取中间值:

SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit @middlevalue) x;

我用一个5x10e6的随机数据集测试了这个数据集,它会在10秒内找到中值。

在MySQL文档中对此页面的评论有以下建议:

 -- (mostly) High Performance scaling MEDIAN function per group -- Median defined in http://en.wikipedia.org/wiki/Median -- -- by Peter Hlavac -- 06.11.2008 -- -- Example Table: DROP table if exists table_median; CREATE TABLE table_median (id INTEGER(11),val INTEGER(11)); COMMIT; INSERT INTO table_median (id, val) VALUES (1, 7), (1, 4), (1, 5), (1, 1), (1, 8), (1, 3), (1, 6), (2, 4), (3, 5), (3, 2), (4, 5), (4, 12), (4, 1), (4, 7); -- Calculating the MEDIAN SELECT @a := 0; SELECT id, AVG(val) AS MEDIAN FROM ( SELECT id, val FROM ( SELECT -- Create an index n for every id @a := (@a + 1) mod oc AS shifted_n, IF(@a mod oc=0, oc, @a) AS n, o.id, o.val, -- the number of elements for every id oc FROM ( SELECT t_o.id, val, c FROM table_median t_o INNER JOIN (SELECT id, COUNT(1) AS c FROM table_median GROUP BY id ) t2 ON (t2.id = t_o.id) ORDER BY t_o.id,val ) o ) a WHERE IF( -- if there is an even number of elements -- take the lower and the upper median -- and use AVG(lower,upper) c MOD 2 = 0, n = c DIV 2 OR n = (c DIV 2)+1, -- if its an odd number of elements -- take the first if its only one element -- or take the one in the middle IF( c = 1, n = 1, n = c DIV 2 + 1 ) ) ) a GROUP BY id; -- Explanation: -- The Statement creates a helper table like -- -- n id val count -- ---------------- -- 1, 1, 1, 7 -- 2, 1, 3, 7 -- 3, 1, 4, 7 -- 4, 1, 5, 7 -- 5, 1, 6, 7 -- 6, 1, 7, 7 -- 7, 1, 8, 7 -- -- 1, 2, 4, 1 -- 1, 3, 2, 2 -- 2, 3, 5, 2 -- -- 1, 4, 1, 4 -- 2, 4, 5, 4 -- 3, 4, 7, 4 -- 4, 4, 12, 4 -- from there we can select the n-th element on the position: count div 2 + 1 

建立魔术贴的答案,对于那些你必须做一个中间值的东西是由另一个参数分组:

 SELECT grp_field,t1.val FROM(
    SELECT grp_field,@rownum:= IF(@s = grp_field,@rownum + 1,0)AS row_number ,
    @s:= IF(@s = grp_field,@s,grp_field)AS sec,d.val
   FROM data d,(SELECT @rownum:= 0,@s:= 0)r
   ORDER BY grp_field,d.val
 )as t1 JOIN(
  选择grp_field,count(*)作为total_rows
   FROM数据d
   GROUP BY grp_field
 )为t2
开t1.grp_field = t2.grp_field
 WHERE t1.row_number = floor(total_rows / 2)+1;

上面的大多数解决方案只适用于表中的一个字段,您可能需要为查询中的许多字段获取中位数(第50百分位数)。

我使用这个:

 SELECT CAST(SUBSTRING_INDEX(SUBSTRING_INDEX( GROUP_CONCAT(field_name ORDER BY field_name SEPARATOR ','), ',', 50/100 * COUNT(*) + 1), ',', -1) AS DECIMAL) AS `Median` FROM table_name; 

你可以用上面例子中的“50”代替任何百分位,效率很高。

只要确保你有足够的内存用于GROUP_CONCAT,你可以用下面的方法改变它:

 SET group_concat_max_len = 10485760; #10MB max length 

更多细节: http : //web.performancerasta.com/metrics-tips-calculating-95th-99th-or-any-percentile-with-single-mysql-query/

你可以使用在这里找到的用户定义的函数。

注意一个奇数值 – 在这种情况下给出中间两个值的平均值。

 SELECT AVG(val) FROM ( SELECT x.id, x.val from data x, data y GROUP BY x.id, x.val HAVING SUM(SIGN(1-SIGN(IF(y.val-x.val=0 AND x.id != y.id, SIGN(x.id-y.id), y.val-x.val)))) IN (ROUND((COUNT(*))/2), ROUND((COUNT(*)+1)/2)) ) sq 

安装并使用这个mysql的统计函数: http : //www.xarg.org/2012/07/statistical-functions-in-mysql/

之后,计算中位数很容易:

SELECT median(x)FROM t1

 SELECT SUBSTRING_INDEX( SUBSTRING_INDEX( GROUP_CONCAT(field ORDER BY field), ',', (( ROUND( LENGTH(GROUP_CONCAT(field)) - LENGTH( REPLACE( GROUP_CONCAT(field), ',', '' ) ) ) / 2) + 1 )), ',', -1 ) FROM table 

以上似乎为我工作。

我有下面的代码,我发现在HackerRank,这是非常简单的,在每一个案件的作品。

 SELECT M.MEDIAN_COL FROM MEDIAN_TABLE M WHERE (SELECT COUNT(MEDIAN_COL) FROM MEDIAN_TABLE WHERE MEDIAN_COL < M.MEDIAN_COL ) = (SELECT COUNT(MEDIAN_COL) FROM MEDIAN_TABLE WHERE MEDIAN_COL > M.MEDIAN_COL ); 

我使用了两种查询方法:

  • 第一个得到计数,最小值,最大值和平均值
  • 第二个(准备好的语句)用“LIMIT @ count / 2,1”和“ORDER BY ..”来得到中值

这些被封装在一个函数defn中,所以所有的值都可以从一个调用中返回。

如果你的范围是静态的,你的数据不会经常变化,那么预先计算/存储这些值并使用存储的值可能会更有效率,而不是每次从头开始查询。

我的代码,有效的没有表或额外的变量:

 SELECT ((SUBSTRING_INDEX(SUBSTRING_INDEX(group_concat(val order by val), ',', floor(1+((count(val)-1) / 2))), ',', -1)) + (SUBSTRING_INDEX(SUBSTRING_INDEX(group_concat(val order by val), ',', ceiling(1+((count(val)-1) / 2))), ',', -1)))/2 as median FROM table; 

或者,您也可以在存储过程中执行此操作:

 DROP PROCEDURE IF EXISTS median; DELIMITER // CREATE PROCEDURE median (table_name VARCHAR(255), column_name VARCHAR(255), where_clause VARCHAR(255)) BEGIN -- Set default parameters IF where_clause IS NULL OR where_clause = '' THEN SET where_clause = 1; END IF; -- Prepare statement SET @sql = CONCAT( "SELECT AVG(middle_values) AS 'median' FROM ( SELECT t1.", column_name, " AS 'middle_values' FROM ( SELECT @row:=@row+1 as `row`, x.", column_name, " FROM ", table_name," AS x, (SELECT @row:=0) AS r WHERE ", where_clause, " ORDER BY x.", column_name, " ) AS t1, ( SELECT COUNT(*) as 'count' FROM ", table_name, " x WHERE ", where_clause, " ) AS t2 -- the following condition will return 1 record for odd number sets, or 2 records for even number sets. WHERE t1.row >= t2.count/2 AND t1.row <= ((t2.count/2)+1)) AS t3 "); -- Execute statement PREPARE stmt FROM @sql; EXECUTE stmt; END// DELIMITER ; -- Sample usage: -- median(table_name, column_name, where_condition); CALL median('products', 'price', NULL); 

因为我只需要一个中位数和百分位数的解决方案,所以我根据这个线程的发现做出了一个简单而相当灵活的函数。 我知道,如果我发现自己的项目很容易包含“现成”功能,那么我就很快乐,因此我决定尽快分享:

 function mysql_percentile($table, $column, $where, $percentile = 0.5) { $sql = " SELECT `t1`.`".$column."` as `percentile` FROM ( SELECT @rownum:=@rownum+1 as `row_number`, `d`.`".$column."` FROM `".$table."` `d`, (SELECT @rownum:=0) `r` ".$where." ORDER BY `d`.`".$column."` ) as `t1`, ( SELECT count(*) as `total_rows` FROM `".$table."` `d` ".$where." ) as `t2` WHERE 1 AND `t1`.`row_number`=floor(`total_rows` * ".$percentile.")+1; "; $result = sql($sql, 1); if (!empty($result)) { return $result['percentile']; } else { return 0; } } 

使用非常简单,例如我目前的项目:

 ... $table = DBPRE."zip_".$slug; $column = 'seconds'; $where = "WHERE `reached` = '1' AND `time` >= '".$start_time."'"; $reaching['median'] = mysql_percentile($table, $column, $where, 0.5); $reaching['percentile25'] = mysql_percentile($table, $column, $where, 0.25); $reaching['percentile75'] = mysql_percentile($table, $column, $where, 0.75); ... 

这是我的方式。 当然,你可以把它放到一个程序中:-)

 SET @median_counter = (SELECT FLOOR(COUNT(*)/2) - 1 AS `median_counter` FROM `data`); SET @median = CONCAT('SELECT `val` FROM `data` ORDER BY `val` LIMIT ', @median_counter, ', 1'); PREPARE median FROM @median; EXECUTE median; 

你可以避免变量@median_counter ,如果你用它代替:

 SET @median = CONCAT( 'SELECT `val` FROM `data` ORDER BY `val` LIMIT ', (SELECT FLOOR(COUNT(*)/2) - 1 AS `median_counter` FROM `data`), ', 1' ); PREPARE median FROM @median; EXECUTE median; 

下面介绍的解决方案仅仅在一个查询中工作,不需要创建表,变量甚至子查询。 另外,它允许你在group-by查询中获得每个组的中位数(这是我所需要的):

 SELECT `columnA`, SUBSTRING_INDEX(SUBSTRING_INDEX(GROUP_CONCAT(`columnB` ORDER BY `columnB`), ',', CEILING((COUNT(`columnB`)/2))), ',', -1) medianOfColumnB FROM `tableC` -- some where clause if you want GROUP BY `columnA`; 

这是因为聪明地使用了group_concat和substring_index。

但是,要允许大的group_concat,您必须将group_concat_max_len设置为更高的值(默认情况下为1024个字符)。 你可以像这样设置(对于当前的sql会话):

 SET SESSION group_concat_max_len = 10000; -- up to 4294967295 in 32-bits platform. 

更多信息group_concat_max_len: https ://dev.mysql.com/doc/refman/5.1/en/server-system-variables.html#sysvar_group_concat_max_len

另一个Velcrow的答案,但使用单个中间表,并利用用于行编号的变量来获得计数,而不是执行额外的查询来计算它。 也开始计数,使第一行是第0行,允许简单地使用Floor和Ceil来选择中间行(s)。

 SELECT Avg(tmp.val) as median_val FROM (SELECT inTab.val, @rows := @rows + 1 as rowNum FROM data as inTab, (SELECT @rows := -1) as init -- Replace with better where clause or delete WHERE 2 > 1 ORDER BY inTab.val) as tmp WHERE tmp.rowNum in (Floor(@rows / 2), Ceil(@rows / 2)); 

如果MySQL有ROW_NUMBER,那么MEDIAN(受此SQL Server查询的启发):

 WITH Numbered AS ( SELECT *, COUNT(*) OVER () AS Cnt, ROW_NUMBER() OVER (ORDER BY val) AS RowNum FROM yourtable ) SELECT id, val FROM Numbered WHERE RowNum IN ((Cnt+1)/2, (Cnt+2)/2) ; 

IN用于偶数个条目。

如果你想找到每组的中位数,那么只需在你的OVER子句中使用PARTITION BY组。

在阅读了之前的所有内容之后,他们并不符合我的实际要求,所以我实现了自己的不需要任何过程或复杂的语句,只需要GROUP_CONCAT列中的所有值,我想获得MEDIAN并应用COUNT DIV BY 2我从列表的中间提取值,如下面的查询所示:

(POS是我想获得其中位数的列的名称)

 (query) SELECT SUBSTRING_INDEX ( SUBSTRING_INDEX ( GROUP_CONCAT(pos ORDER BY CAST(pos AS SIGNED INTEGER) desc SEPARATOR ';') , ';', COUNT(*)/2 ) , ';', -1 ) AS `pos_med` FROM table_name GROUP BY any_criterial 

我希望这可能是有用的方式很多人从这个网站上给我的评论。

知道确切的行数你可以使用这个查询:

 SELECT <value> AS VAL FROM <table> ORDER BY VAL LIMIT 1 OFFSET <half> 

其中<half> = ceiling(<size> / 2.0) - 1

我有一个包含约10亿行的数据库,我们需要确定集合中的年龄中位数。 对十亿行进行排序是很困难的,但是如果将可以找到的不同值(年龄范围从0到100)进行聚合,则可以对该列表进行排序,并使用一些算术魔法来查找任何所需的百分点,如下所示:

 with rawData(count_value) as ( select p.YEAR_OF_BIRTH from dbo.PERSON p ), overallStats (avg_value, stdev_value, min_value, max_value, total) as ( select avg(1.0 * count_value) as avg_value, stdev(count_value) as stdev_value, min(count_value) as min_value, max(count_value) as max_value, count(*) as total from rawData ), aggData (count_value, total, accumulated) as ( select count_value, count(*) as total, SUM(count(*)) OVER (ORDER BY count_value ROWS UNBOUNDED PRECEDING) as accumulated FROM rawData group by count_value ) select o.total as count_value, o.min_value, o.max_value, o.avg_value, o.stdev_value, MIN(case when d.accumulated >= .50 * o.total then count_value else o.max_value end) as median_value, MIN(case when d.accumulated >= .10 * o.total then count_value else o.max_value end) as p10_value, MIN(case when d.accumulated >= .25 * o.total then count_value else o.max_value end) as p25_value, MIN(case when d.accumulated >= .75 * o.total then count_value else o.max_value end) as p75_value, MIN(case when d.accumulated >= .90 * o.total then count_value else o.max_value end) as p90_value from aggData d cross apply overallStats o GROUP BY o.total, o.min_value, o.max_value, o.avg_value, o.stdev_value ; 

这个查询取决于你的数据库支持的窗口函数(包括ROWS UNBOUNDED PRECEDING),但是如果你没有这个,那么加入aggData CTE是一件简单的事情,并且将所有先前的总和汇总到“累积”列中,用来确定哪个值包含指定的高级。 上述样本计算p10,p25,p50(中值),p75和p90。

-克里斯

取自: http : //mdb-blog.blogspot.com/2015/06/mysql-find-median-nth-element-without.html

我会建议另一种方式, 不加入 ,但使用字符串

我没有检查与大数据表,但小/中等表工作得很好。

这里的好东西,它也适用于GROUPING,所以它可以返回几个项目的中位数。

这里是测试表的测试代码:

 DROP TABLE test.test_median CREATE TABLE test.test_median AS SELECT 'book' AS grp, 4 AS val UNION ALL SELECT 'book', 7 UNION ALL SELECT 'book', 2 UNION ALL SELECT 'book', 2 UNION ALL SELECT 'book', 9 UNION ALL SELECT 'book', 8 UNION ALL SELECT 'book', 3 UNION ALL SELECT 'note', 11 UNION ALL SELECT 'bike', 22 UNION ALL SELECT 'bike', 26 

以及为每组找到中位数的代码:

 SELECT grp, SUBSTRING_INDEX( SUBSTRING_INDEX( GROUP_CONCAT(val ORDER BY val), ',', COUNT(*)/2 ), ',', -1) as the_median, GROUP_CONCAT(val ORDER BY val) as all_vals_for_debug FROM test.test_median GROUP BY grp 

输出:

 grp | the_median| all_vals_for_debug bike| 22 | 22,26 book| 4 | 2,2,3,4,7,8,9 note| 11 | 11 

在某些情况下,中位数计算如下:

“中位数”是按数值排列的数字列表中的“中间”数值。 对于偶数集, 中位数是两个中间值的平均值 。 我为此创建了一个简单的代码:

 $midValue = 0; $rowCount = "SELECT count(*) as count {$from} {$where}"; $even = FALSE; $offset = 1; $medianRow = floor($rowCount / 2); if ($rowCount % 2 == 0 && !empty($medianRow)) { $even = TRUE; $offset++; $medianRow--; } $medianValue = "SELECT column as median {$fromClause} {$whereClause} ORDER BY median LIMIT {$medianRow},{$offset}"; $medianValDAO = db_query($medianValue); while ($medianValDAO->fetch()) { if ($even) { $midValue = $midValue + $medianValDAO->median; } else { $median = $medianValDAO->median; } } if ($even) { $median = $midValue / 2; } return $median; 

$中位数返回将是所需的结果:-)

按维度分组的Medians:

 SELECT your_dimension, avg(t1.val) as median_val FROM ( SELECT @rownum:=@rownum+1 AS `row_number`, IF(@dim <> d.your_dimension, @rownum := 0, NULL), @dim := d.your_dimension AS your_dimension, d.val FROM data d, (SELECT @rownum:=0) r, (SELECT @dim := 'something_unreal') d WHERE 1 -- put some where clause here ORDER BY d.your_dimension, d.val ) as t1 INNER JOIN ( SELECT d.your_dimension, count(*) as total_rows FROM data d WHERE 1 -- put same where clause here GROUP BY d.your_dimension ) as t2 USING(your_dimension) WHERE 1 AND t1.row_number in ( floor((total_rows+1)/2), floor((total_rows+2)/2) ) GROUP BY your_dimension; 

这种方式似乎包括没有子查询的偶数。

 SELECT AVG(t1.x) FROM table t1, table t2 GROUP BY t1.x HAVING SUM(SIGN(t1.x - t2.x)) = 0 

根据@ bob的回答,这个概括性的查询有能力返回多个中位数,按照一定的标准分组。

想想,例如,汽车中二手车的平均销售价格,按年份分组。

 SELECT period, AVG(middle_values) AS 'median' FROM ( SELECT t1.sale_price AS 'middle_values', t1.row_num, t1.period, t2.count FROM ( SELECT @last_period:=@period AS 'last_period', @period:=DATE_FORMAT(sale_date, '%Y-%m') AS 'period', IF (@period<>@last_period, @row:=1, @row:=@row+1) as `row_num`, x.sale_price FROM listings AS x, (SELECT @row:=0) AS r WHERE 1 -- where criteria goes here ORDER BY DATE_FORMAT(sale_date, '%Y%m'), x.sale_price ) AS t1 LEFT JOIN ( SELECT COUNT(*) as 'count', DATE_FORMAT(sale_date, '%Y-%m') AS 'period' FROM listings x WHERE 1 -- same where criteria goes here GROUP BY DATE_FORMAT(sale_date, '%Y%m') ) AS t2 ON t1.period = t2.period ) AS t3 WHERE row_num >= (count/2) AND row_num <= ((count/2) + 1) GROUP BY t3.period ORDER BY t3.period; 

这些方法从同一个表中选择两次。 如果源数据来自昂贵的查询,这是一种避免运行两次的方法:

 select KEY_FIELD, AVG(VALUE_FIELD) MEDIAN_VALUE from ( select KEY_FIELD, VALUE_FIELD, RANKF , @rownumr := IF(@prevrowidr=KEY_FIELD,@rownumr+1,1) RANKR , @prevrowidr := KEY_FIELD FROM ( SELECT KEY_FIELD, VALUE_FIELD, RANKF FROM ( SELECT KEY_FIELD, VALUE_FIELD , @rownumf := IF(@prevrowidf=KEY_FIELD,@rownumf+1,1) RANKF , @prevrowidf := KEY_FIELD FROM ( SELECT KEY_FIELD, VALUE_FIELD FROM ( -- some expensive query ) B ORDER BY KEY_FIELD, VALUE_FIELD ) C , (SELECT @rownumf := 1) t_rownum , (SELECT @prevrowidf := '*') t_previd ) D ORDER BY KEY_FIELD, RANKF DESC ) E , (SELECT @rownumr := 1) t_rownum , (SELECT @prevrowidr := '*') t_previd ) F WHERE RANKF-RANKR BETWEEN -1 and 1 GROUP BY KEY_FIELD