在Sql Server中计算中位数的函数

根据MSDN ,Median在Transact-Sql中不可用作为聚合函数。 但是,我想知道是否有可能创建此功能(使用创建聚合函数,用户定义函数或其他方法)。

什么是最好的方法(如果可能)做到这一点 – 允许在一个聚合查询中计算一个中值(假设一个数字数据类型)?

    有很多方法可以做到这一点,表现有很大的变化。 这是一个特别优化的解决方案,来自http://sqlblog.com/blogs/adam_machanic/archive/2006/12/18/medians-row-numbers-and-performance.aspx 。 对于执行期间生成的实际I / O,这是一个特别优化的解决方案 – 看起来比其他解决方案成本更高,但实际上要快得多。

    该页面还包含对其他解决方案和性能测试细节的讨论。 请注意,如果有多个行的中值列的值相同,则使用唯一的列作为消除歧义。

    和所有的数据库性能场景一样,总是试着用实际硬件上的真实数据来测试一个解决方案 – 你永远不知道什么时候改变SQL Server的优化器或者在你的环境中的一个特性会使速度较慢的解决方案变慢。

    SELECT CustomerId, AVG(TotalDue) FROM ( SELECT CustomerId, TotalDue, ROW_NUMBER() OVER ( PARTITION BY CustomerId ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc, ROW_NUMBER() OVER ( PARTITION BY CustomerId ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc FROM Sales.SalesOrderHeader SOH ) x WHERE RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1) GROUP BY CustomerId ORDER BY CustomerId; 

    如果您使用的是SQL 2005或更高版本,那么对于表中的单个列来说,这是一个很好的简单的中值计算:

     SELECT ( (SELECT MAX(Score) FROM (SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score) AS BottomHalf) + (SELECT MIN(Score) FROM (SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score DESC) AS TopHalf) ) / 2 AS Median 

    在SQL Server 2012中,您应该使用PERCENTILE_CONT :

     SELECT SalesOrderID, OrderQty, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY OrderQty) OVER (PARTITION BY SalesOrderID) AS MedianCont FROM Sales.SalesOrderDetail WHERE SalesOrderID IN (43670, 43669, 43667, 43663) ORDER BY SalesOrderID DESC 

    另请参阅: http : //blog.sqlauthority.com/2011/11/20/sql-server-introduction-to-percentile_cont-analytic-functions-introduced-in-sql-server-2012/

    我原来的回答是:

     select max(my_column) as [my_column], quartile from (select my_column, ntile(4) over (order by my_column) as [quartile] from my_table) i --where quartile = 2 group by quartile 

    这将一举让你的中位数和四分位数范围。 如果你真的只想要一行是中位数,那么取消注释where子句。

    当你把这个插入一个解释计划时,60%的工作是在计算位置依赖的统计数据时排序不可避免的数据。

    我已经修改了答案,以遵循RobertŠevčk-Robajz在以下评论中的出色建议:

     ;with PartitionedData as (select my_column, ntile(10) over (order by my_column) as [percentile] from my_table), MinimaAndMaxima as (select min(my_column) as [low], max(my_column) as [high], percentile from PartitionedData group by percentile) select case when b.percentile = 10 then cast(b.high as decimal(18,2)) else cast((a.low + b.high) as decimal(18,2)) / 2 end as [value], --b.high, a.low, b.percentile from MinimaAndMaxima a join MinimaAndMaxima b on (a.percentile -1 = b.percentile) or (a.percentile = 10 and b.percentile = 10) --where b.percentile = 5 

    这应该计算正确的中位数和百分位数值,当你有一个偶数的数据项。 再次,取消注释最后的where子句,如果你只想要中位数而不是整个百分位数分布。

    更好:

     SELECT @Median = AVG(1.0 * val) FROM ( SELECT o.val, rn = ROW_NUMBER() OVER (ORDER BY o.val), cc FROM dbo.EvenRows AS o CROSS JOIN (SELECT c = COUNT(*) FROM dbo.EvenRows) AS c ) AS x WHERE rn IN ((c + 1)/2, (c + 2)/2); 

    从主人本人, 伊兹克本甘 !

    MS SQL Server 2012(及更高版本)具有PERCENTILE_DISC函数,该函数计算排序值的特定百分比。 PERCENTILE_DISC(0.5)将计算中位数 – https://msdn.microsoft.com/zh-cn/library/hh231327.aspx

    我刚刚遇到这个页面,同时寻找一组基于解决方案的中位数。 在看了这里的一些解决方案之后,我想出了以下内容。 希望是帮助/工作。

     DECLARE @test TABLE( i int identity(1,1), id int, score float ) INSERT INTO @test (id,score) VALUES (1,10) INSERT INTO @test (id,score) VALUES (1,11) INSERT INTO @test (id,score) VALUES (1,15) INSERT INTO @test (id,score) VALUES (1,19) INSERT INTO @test (id,score) VALUES (1,20) INSERT INTO @test (id,score) VALUES (2,20) INSERT INTO @test (id,score) VALUES (2,21) INSERT INTO @test (id,score) VALUES (2,25) INSERT INTO @test (id,score) VALUES (2,29) INSERT INTO @test (id,score) VALUES (2,30) INSERT INTO @test (id,score) VALUES (3,20) INSERT INTO @test (id,score) VALUES (3,21) INSERT INTO @test (id,score) VALUES (3,25) INSERT INTO @test (id,score) VALUES (3,29) DECLARE @counts TABLE( id int, cnt int ) INSERT INTO @counts ( id, cnt ) SELECT id, COUNT(*) FROM @test GROUP BY id SELECT drv.id, drv.start, AVG(t.score) FROM ( SELECT MIN(ti)-1 AS start, t.id FROM @test t GROUP BY t.id ) drv INNER JOIN @test t ON drv.id = t.id INNER JOIN @counts c ON t.id = c.id WHERE ti = ((c.cnt+1)/2)+drv.start OR ( ti = (((c.cnt+1)%2) * ((c.cnt+2)/2))+drv.start AND ((c.cnt+1)%2) * ((c.cnt+2)/2) <> 0 ) GROUP BY drv.id, drv.start 

    简单,快速,准确

     SELECT x.Amount FROM (SELECT amount, Count(1) OVER (partition BY 'A') AS TotalRows, Row_number() OVER (ORDER BY Amount ASC) AS AmountOrder FROM facttransaction ft) x WHERE x.AmountOrder = Round(x.TotalRows / 2.0, 0) 

    如果你想在SQL Server中使用Create Aggregate函数,那么就是这样做的。 这样做的好处是能够编写干净的查询。 请注意,这个过程可以很容易地用来计算百分比值。

    创建一个新的Visual Studio项目并将目标框架设置为.NET 3.5(这适用于SQL 2008,在SQL 2012中可能不同)。 然后创建一个类文件,并放入下面的代码,或c#等效:

     Imports Microsoft.SqlServer.Server Imports System.Data.SqlTypes Imports System.IO <Serializable> <SqlUserDefinedAggregate(Format.UserDefined, IsInvariantToNulls:=True, IsInvariantToDuplicates:=False, _ IsInvariantToOrder:=True, MaxByteSize:=-1, IsNullIfEmpty:=True)> Public Class Median Implements IBinarySerialize Private _items As List(Of Decimal) Public Sub Init() _items = New List(Of Decimal)() End Sub Public Sub Accumulate(value As SqlDecimal) If Not value.IsNull Then _items.Add(value.Value) End If End Sub Public Sub Merge(other As Median) If other._items IsNot Nothing Then _items.AddRange(other._items) End If End Sub Public Function Terminate() As SqlDecimal If _items.Count <> 0 Then Dim result As Decimal _items = _items.OrderBy(Function(i) i).ToList() If _items.Count Mod 2 = 0 Then result = ((_items((_items.Count / 2) - 1)) + (_items(_items.Count / 2))) / 2@ Else result = _items((_items.Count - 1) / 2) End If Return New SqlDecimal(result) Else Return New SqlDecimal() End If End Function Public Sub Read(r As BinaryReader) Implements IBinarySerialize.Read 'deserialize it from a string Dim list = r.ReadString() _items = New List(Of Decimal) For Each value In list.Split(","c) Dim number As Decimal If Decimal.TryParse(value, number) Then _items.Add(number) End If Next End Sub Public Sub Write(w As BinaryWriter) Implements IBinarySerialize.Write 'serialize the list to a string Dim list = "" For Each item In _items If list <> "" Then list += "," End If list += item.ToString() Next w.Write(list) End Sub End Class 

    然后编译它并将DLL和PDB文件复制到您的SQL Server计算机并在SQL Server中运行以下命令:

     CREATE ASSEMBLY CustomAggregate FROM '{path to your DLL}' WITH PERMISSION_SET=SAFE; GO CREATE AGGREGATE Median(@value decimal(9, 3)) RETURNS decimal(9, 3) EXTERNAL NAME [CustomAggregate].[{namespace of your DLL}.Median]; GO 

    然后,您可以编写一个查询来计算中位数,如下所示:SELECT dbo.Median(Field)FROM Table

    贾斯汀上面的例子非常好。 但是,主键需要说清楚。 我已经看到了没有密钥的野外代码,结果是不好的。

    我得到的关于Percentile_Cont的抱怨是它不会给你一个来自数据集的实际值。 从数据集中得到一个“中位数”是一个实际值,使用Percentile_Disc。

     SELECT SalesOrderID, OrderQty, PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY OrderQty) OVER (PARTITION BY SalesOrderID) AS MedianCont FROM Sales.SalesOrderDetail WHERE SalesOrderID IN (43670, 43669, 43667, 43663) ORDER BY SalesOrderID DESC 

    以下查询从一列中的值列表中返回中位数 。 它不能作为聚合函数使用,也可以在内部select中使用WHERE子句作为子查询。

    SQL Server 2005+:

     SELECT TOP 1 value from ( SELECT TOP 50 PERCENT value FROM table_name ORDER BY value )for_median ORDER BY value DESC 

    请参阅SQL中的中值计算的其他解决方案:“ 使用MySQL计算中值的简单方法 ”(解决方案大多与供应商无关)。

    在UDF中,写

      Select Top 1 medianSortColumn from Table T Where (Select Count(*) from Table Where MedianSortColumn < (Select Count(*) From Table) / 2) Order By medianSortColumn 

    对于来自“table1”的连续变量/度量“col1”

     select col1 from (select top 50 percent col1, ROW_NUMBER() OVER(ORDER BY col1 ASC) AS Rowa, ROW_NUMBER() OVER(ORDER BY col1 DESC) AS Rowd from table1 ) tmp where tmp.Rowa = tmp.Rowd 

    虽然贾斯汀授予的解决方案似乎很稳固,但我发现,如果在给定的分区键中有多个重复的值,那么ASC重复值的行号就会出现乱序,因此它们不能正确对齐。

    这是我的结果中的一个片段:

     KEY VALUE ROWA ROWD 13 2 22 182 13 1 6 183 13 1 7 184 13 1 8 185 13 1 9 186 13 1 10 187 13 1 11 188 13 1 12 189 13 0 1 190 13 0 2 191 13 0 3 192 13 0 4 193 13 0 5 194 

    我用Justin的代码作为这个解决方案的基础。 尽管使用多个派生表的效率不高,但它解决了我遇到的行排序问题。 任何改进都是值得欢迎的,因为我没有T-SQL的经验。

     SELECT PKEY, cast(AVG(VALUE)as decimal(5,2)) as MEDIANVALUE FROM ( SELECT PKEY,VALUE,ROWA,ROWD, 'FLAG' = (CASE WHEN ROWA IN (ROWD,ROWD-1,ROWD+1) THEN 1 ELSE 0 END) FROM ( SELECT PKEY, cast(VALUE as decimal(5,2)) as VALUE, ROWA, ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY ROWA DESC) as ROWD FROM ( SELECT PKEY, VALUE, ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY VALUE ASC,PKEY ASC ) as ROWA FROM [MTEST] )T1 )T2 )T3 WHERE FLAG = '1' GROUP BY PKEY ORDER BY PKEY 

    我想自己找出一个解决方案,但是我的大脑绊倒了,并且在路上倒下了。 我认为它是有效的,但不要求我在早上解释它。 :P

     DECLARE @table AS TABLE ( Number int not null ); insert into @table select 2; insert into @table select 4; insert into @table select 9; insert into @table select 15; insert into @table select 22; insert into @table select 26; insert into @table select 37; insert into @table select 49; DECLARE @Count AS INT SELECT @Count = COUNT(*) FROM @table; WITH MyResults(RowNo, Number) AS ( SELECT RowNo, Number FROM (SELECT ROW_NUMBER() OVER (ORDER BY Number) AS RowNo, Number FROM @table) AS Foo ) SELECT AVG(Number) FROM MyResults WHERE RowNo = (@Count+1)/2 OR RowNo = ((@Count+1)%2) * ((@Count+2)/2) 
     --Create Temp Table to Store Results in DECLARE @results AS TABLE ( [Month] datetime not null ,[Median] int not null ); --This variable will determine the date DECLARE @IntDate as int set @IntDate = -13 WHILE (@IntDate < 0) BEGIN --Create Temp Table DECLARE @table AS TABLE ( [Rank] int not null ,[Days Open] int not null ); --Insert records into Temp Table insert into @table SELECT rank() OVER (ORDER BY DATEADD(mm, DATEDIFF(mm, 0, DATEADD(ss, SVR.close_date, '1970')), 0), DATEDIFF(day,DATEADD(ss, SVR.open_date, '1970'),DATEADD(ss, SVR.close_date, '1970')),[SVR].[ref_num]) as [Rank] ,DATEDIFF(day,DATEADD(ss, SVR.open_date, '1970'),DATEADD(ss, SVR.close_date, '1970')) as [Days Open] FROM mdbrpt.dbo.View_Request SVR LEFT OUTER JOIN dbo.dtv_apps_systems vapp on SVR.category = vapp.persid LEFT OUTER JOIN dbo.prob_ctg pctg on SVR.category = pctg.persid Left Outer Join [mdbrpt].[dbo].[rootcause] as [Root Cause] on [SVR].[rootcause]=[Root Cause].[id] Left Outer Join [mdbrpt].[dbo].[cr_stat] as [Status] on [SVR].[status]=[Status].[code] LEFT OUTER JOIN [mdbrpt].[dbo].[net_res] as [net] on [net].[id]=SVR.[affected_rc] WHERE SVR.Type IN ('P') AND SVR.close_date IS NOT NULL AND [Status].[SYM] = 'Closed' AND SVR.parent is null AND [Root Cause].[sym] in ( 'RC - Application','RC - Hardware', 'RC - Operational', 'RC - Unknown') AND ( [vapp].[appl_name] in ('3PI','Billing Rpts/Files','Collabrent','Reports','STMS','STMS 2','Telco','Comergent','OOM','C3-BAU','C3-DD','DIRECTV','DIRECTV Sales','DIRECTV Self Care','Dealer Website','EI Servlet','Enterprise Integration','ET','ICAN','ODS','SB-SCM','SeeBeyond','Digital Dashboard','IVR','OMS','Order Services','Retail Services','OSCAR','SAP','CTI','RIO','RIO Call Center','RIO Field Services','FSS-RIO3','TAOS','TCS') OR pctg.sym in ('Systems.Release Health Dashboard.Problem','DTV QA Test.Enterprise Release.Deferred Defect Log') AND [Net].[nr_desc] in ('3PI','Billing Rpts/Files','Collabrent','Reports','STMS','STMS 2','Telco','Comergent','OOM','C3-BAU','C3-DD','DIRECTV','DIRECTV Sales','DIRECTV Self Care','Dealer Website','EI Servlet','Enterprise Integration','ET','ICAN','ODS','SB-SCM','SeeBeyond','Digital Dashboard','IVR','OMS','Order Services','Retail Services','OSCAR','SAP','CTI','RIO','RIO Call Center','RIO Field Services','FSS-RIO3','TAOS','TCS') ) AND DATEADD(mm, DATEDIFF(mm, 0, DATEADD(ss, SVR.close_date, '1970')), 0) = DATEADD(mm, DATEDIFF(mm,0,DATEADD(mm,@IntDate,getdate())), 0) ORDER BY [Days Open] DECLARE @Count AS INT SELECT @Count = COUNT(*) FROM @table; WITH MyResults(RowNo, [Days Open]) AS ( SELECT RowNo, [Days Open] FROM (SELECT ROW_NUMBER() OVER (ORDER BY [Days Open]) AS RowNo, [Days Open] FROM @table) AS Foo ) insert into @results SELECT DATEADD(mm, DATEDIFF(mm,0,DATEADD(mm,@IntDate,getdate())), 0) as [Month] ,AVG([Days Open])as [Median] FROM MyResults WHERE RowNo = (@Count+1)/2 OR RowNo = ((@Count+1)%2) * ((@Count+2)/2) set @IntDate = @IntDate+1 DELETE FROM @table END select * from @results order by [Month] 

    这适用于SQL 2000:

     DECLARE @testTable TABLE ( VALUE INT ) --INSERT INTO @testTable -- Even Test --SELECT 3 UNION ALL --SELECT 5 UNION ALL --SELECT 7 UNION ALL --SELECT 12 UNION ALL --SELECT 13 UNION ALL --SELECT 14 UNION ALL --SELECT 21 UNION ALL --SELECT 23 UNION ALL --SELECT 23 UNION ALL --SELECT 23 UNION ALL --SELECT 23 UNION ALL --SELECT 29 UNION ALL --SELECT 40 UNION ALL --SELECT 56 -- --INSERT INTO @testTable -- Odd Test --SELECT 3 UNION ALL --SELECT 5 UNION ALL --SELECT 7 UNION ALL --SELECT 12 UNION ALL --SELECT 13 UNION ALL --SELECT 14 UNION ALL --SELECT 21 UNION ALL --SELECT 23 UNION ALL --SELECT 23 UNION ALL --SELECT 23 UNION ALL --SELECT 23 UNION ALL --SELECT 29 UNION ALL --SELECT 39 UNION ALL --SELECT 40 UNION ALL --SELECT 56 DECLARE @RowAsc TABLE ( ID INT IDENTITY, Amount INT ) INSERT INTO @RowAsc SELECT VALUE FROM @testTable ORDER BY VALUE ASC SELECT AVG(amount) FROM @RowAsc ra WHERE ra.id IN ( SELECT ID FROM @RowAsc WHERE ra.id - ( SELECT MAX(id) / 2.0 FROM @RowAsc ) BETWEEN 0 AND 1 ) 

    对于像我这样正在学习基础知识的新手来说,我个人觉得这个例子更容易理解,因为更容易理解到底发生了什么,以及中值的来源。

     select ( max(a.[Value1]) + min(a.[Value1]) ) / 2 as [Median Value1] ,( max(a.[Value2]) + min(a.[Value2]) ) / 2 as [Median Value2] from (select datediff(dd,startdate,enddate) as [Value1] ,xxxxxxxxxxxxxx as [Value2] from dbo.table1 )a 

    在绝对敬畏上面的一些代码虽然!

    这是我能想出的一个简单的答案。 我的数据工作得很好。 如果你想排除某些值只是添加一个where子句的内部选择。

     SELECT TOP 1 ValueField AS MedianValue FROM (SELECT TOP(SELECT COUNT(1)/2 FROM tTABLE) ValueField FROM tTABLE ORDER BY ValueField) A ORDER BY ValueField DESC 

    以下解决方案在这些假设下工作:

    • 没有重复的值
    • 没有NULL

    码:

     IF OBJECT_ID('dbo.R', 'U') IS NOT NULL DROP TABLE dbo.R CREATE TABLE R ( A FLOAT NOT NULL); INSERT INTO R VALUES (1); INSERT INTO R VALUES (2); INSERT INTO R VALUES (3); INSERT INTO R VALUES (4); INSERT INTO R VALUES (5); INSERT INTO R VALUES (6); -- Returns Median(R) select SUM(A) / CAST(COUNT(A) AS FLOAT) from R R1 where ((select count(A) from R R2 where R1.A > R2.A) = (select count(A) from R R2 where R1.A < R2.A)) OR ((select count(A) from R R2 where R1.A > R2.A) + 1 = (select count(A) from R R2 where R1.A < R2.A)) OR ((select count(A) from R R2 where R1.A > R2.A) = (select count(A) from R R2 where R1.A < R2.A) + 1) ; 
     DECLARE @Obs int DECLARE @RowAsc table ( ID INT IDENTITY, Observation FLOAT ) INSERT INTO @RowAsc SELECT Observations FROM MyTable ORDER BY 1 SELECT @Obs=COUNT(*)/2 FROM @RowAsc SELECT Observation AS Median FROM @RowAsc WHERE ID=@Obs 

    对于大规模数据集,你可以试试这个GIST:

    https://gist.github.com/chrisknoll/1b38761ce8c5016ec5b2

    它通过聚合在你的集合中找到的不同的值(比如年龄,出生年份等),并使用SQL窗口函数来查找你在查询中指定的百分位数。

    我尝试了几个选择,但由于我的数据记录具有重复的值,ROW_NUMBER版本似乎不是我的选择。 所以在这里我使用的查询(与NTILE版本):

     SELECT distinct CustomerId, ( MAX(CASE WHEN Percent50_Asc=1 THEN TotalDue END) OVER (PARTITION BY CustomerId) + MIN(CASE WHEN Percent50_desc=1 THEN TotalDue END) OVER (PARTITION BY CustomerId) )/2 MEDIAN FROM ( SELECT CustomerId, TotalDue, NTILE(2) OVER ( PARTITION BY CustomerId ORDER BY TotalDue ASC) AS Percent50_Asc, NTILE(2) OVER ( PARTITION BY CustomerId ORDER BY TotalDue DESC) AS Percent50_desc FROM Sales.SalesOrderHeader SOH ) x ORDER BY CustomerId; 

    在Jeff Atwood的回答的基础上,用GROUP BY和一个相关的子查询来得到每个组的中位数。

     SELECT TestID, ( (SELECT MAX(Score) FROM (SELECT TOP 50 PERCENT Score FROM Posts WHERE TestID = Posts_parent.TestID ORDER BY Score) AS BottomHalf) + (SELECT MIN(Score) FROM (SELECT TOP 50 PERCENT Score FROM Posts WHERE TestID = Posts_parent.TestID ORDER BY Score DESC) AS TopHalf) ) / 2 AS MedianScore, AVG(Score) AS AvgScore, MIN(Score) AS MinScore, MAX(Score) AS MaxScore FROM Posts_parent GROUP BY Posts_parent.TestID 

    通常情况下,我们可能需要计算中间值不仅仅是整个表,而是关于某个ID的聚合。 换句话说,计算我们表中每个ID的中位数,其中每个ID都有很多记录。 (基于@gdoron编辑的解决方案:良好的性能,并在许多SQL中工作)

     SELECT our_id, AVG(1.0 * our_val) as Median FROM ( SELECT our_id, our_val, COUNT(*) OVER (PARTITION BY our_id) AS cnt, ROW_NUMBER() OVER (PARTITION BY our_id ORDER BY our_val) AS rnk FROM our_table ) AS x WHERE rnk IN ((cnt + 1)/2, (cnt + 2)/2) GROUP BY our_id; 

    希望它有帮助。

    对于你的问题,杰夫·阿特伍德已经给出了简单而有效的解决方案。 但是,如果你正在寻找一些替代方法来计算中位数,下面的SQL代码将帮助你。

     create table employees(salary int); insert into employees values(8); insert into employees values(23); insert into employees values(45); insert into employees values(123); insert into employees values(93); insert into employees values(2342); insert into employees values(2238); select * from employees; declare @odd_even int; declare @cnt int; declare @middle_no int; set @cnt=(select count(*) from employees); set @middle_no=(@cnt/2)+1; select @odd_even=case when (@cnt%2=0) THEN -1 ELse 0 END ; select AVG(tbl.salary) from (select salary,ROW_NUMBER() over (order by salary) as rno from employees group by salary) tbl where tbl.rno=@middle_no or tbl.rno=@middle_no+@odd_even;