SQL来确定最小连续的访问天数?

以下“用户历史logging”表格包含给定用户每天访问网站 (在UTC时间24小时内)的每一天的logging 。 它有成千上万的logging,但每个用户每天只有一个logging。 如果用户当天没有访问该网站,则不会生成任何logging。

 Id UserId CreationDate
 ------ ------ ------------
 750997 12 2009-07-07 18:42:20.723
 750998 15 2009-07-07 18:42:20.927
 751000 19 2009-07-07 18:42:22.283

我正在寻找的是这个表上的一个SQL查询, 性能很好 ,告诉我哪些用户名连续几天访问了网站,而不会错过任何一天。

换句话说,有多less用户在这个表中有连续(date前或日后)date的logging ? 如果序列中缺less任何一天,序列就会中断,应该在1处重新开始; 我们正在寻找已经连续几天没有任何差距的用户。

这个查询和特定的Stack Overflow徽章之间的任何相似之处纯属巧合,当然.. 🙂

答案显然是:

SELECT DISTINCT UserId FROM UserHistory uh1 WHERE ( SELECT COUNT(*) FROM UserHistory uh2 WHERE uh2.CreationDate BETWEEN uh1.CreationDate AND DATEADD(d, @days, uh1.CreationDate) ) = @days OR UserId = 52551 

编辑:

好的,这是我认真的回答:

 DECLARE @days int DECLARE @seconds bigint SET @days = 30 SET @seconds = (@days * 24 * 60 * 60) - 1 SELECT DISTINCT UserId FROM ( SELECT uh1.UserId, Count(uh1.Id) as Conseq FROM UserHistory uh1 INNER JOIN UserHistory uh2 ON uh2.CreationDate BETWEEN uh1.CreationDate AND DATEADD(s, @seconds, DATEADD(dd, DATEDIFF(dd, 0, uh1.CreationDate), 0)) AND uh1.UserId = uh2.UserId GROUP BY uh1.Id, uh1.UserId ) as Tbl WHERE Conseq >= @days 

编辑:

[Jeff Atwood]这是一个非常快速的解决scheme,值得接受,但Rob Farley的解决scheme也非常出色 ,可以说甚至更快(!)。 请检查出来!

怎么样(请确保以前的语句以分号结尾):

 WITH numberedrows AS (SELECT ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY CreationDate) - DATEDIFF(day,'19000101',CreationDate) AS TheOffset, CreationDate, UserID FROM tablename) SELECT MIN(CreationDate), MAX(CreationDate), COUNT(*) AS NumConsecutiveDays, UserID FROM numberedrows GROUP BY UserID, TheOffset 

这个想法是,如果我们有天数列表(作为一个数字)和一个row_number,那么错过的日子使这两个列表之间的偏移稍大。 所以我们正在寻找具有一致偏移的范围。

你可以在这里使用“ORDER BY NumConsecutiveDays DESC”,或者说“HAVING count(*)> 14”作为阈值。

我还没有testing过,只是把它写下来而已。 希望在SQL2005和。

…并将非常帮助tablename(UserID,CreationDate)上的索引,

编辑:结果偏移是一个保留字,所以我用TheOffset代替。

编辑:build议使用COUNT(*)是非常有效的 – 我应该首先做到这一点,但没有真正想到。 以前它使用datediff(day,min(CreationDate),max(CreationDate))来代替。

如果你可以改变表格模式,我build议你在表格中添加一个LongestStreak列,这个列表将被设置为以CreationDate结尾的连续日子的数量。 在login时很容易更新表格(类似于你现在正在做的事情,如果当天没有行,你会检查前一天是否有行)如果是true,你将增加LongestStreak in新的行,否则,你将它设置为1)。

添加此列后,查询将显而易见:

 if exists(select * from table where LongestStreak >= 30 and UserId = @UserId) -- award the Woot badge. 

一些很好的expression的SQL沿着:

 select userId, dbo.MaxConsecutiveDates(CreationDate) as blah from dbo.Logins group by userId 

假设你有一个用户定义的聚合函数 (注意这是越野车):

 using System; using System.Data.SqlTypes; using Microsoft.SqlServer.Server; using System.Runtime.InteropServices; namespace SqlServerProject1 { [StructLayout(LayoutKind.Sequential)] [Serializable] internal struct MaxConsecutiveState { public int CurrentSequentialDays; public int MaxSequentialDays; public SqlDateTime LastDate; } [Serializable] [SqlUserDefinedAggregate( Format.Native, IsInvariantToNulls = true, //optimizer property IsInvariantToDuplicates = false, //optimizer property IsInvariantToOrder = false) //optimizer property ] [StructLayout(LayoutKind.Sequential)] public class MaxConsecutiveDates { /// <summary> /// The variable that holds the intermediate result of the concatenation /// </summary> private MaxConsecutiveState _intermediateResult; /// <summary> /// Initialize the internal data structures /// </summary> public void Init() { _intermediateResult = new MaxConsecutiveState { LastDate = SqlDateTime.MinValue, CurrentSequentialDays = 0, MaxSequentialDays = 0 }; } /// <summary> /// Accumulate the next value, not if the value is null /// </summary> /// <param name="value"></param> public void Accumulate(SqlDateTime value) { if (value.IsNull) { return; } int sequentialDays = _intermediateResult.CurrentSequentialDays; int maxSequentialDays = _intermediateResult.MaxSequentialDays; DateTime currentDate = value.Value.Date; if (currentDate.AddDays(-1).Equals(new DateTime(_intermediateResult.LastDate.TimeTicks))) sequentialDays++; else { maxSequentialDays = Math.Max(sequentialDays, maxSequentialDays); sequentialDays = 1; } _intermediateResult = new MaxConsecutiveState { CurrentSequentialDays = sequentialDays, LastDate = currentDate, MaxSequentialDays = maxSequentialDays }; } /// <summary> /// Merge the partially computed aggregate with this aggregate. /// </summary> /// <param name="other"></param> public void Merge(MaxConsecutiveDates other) { // add stuff for two separate calculations } /// <summary> /// Called at the end of aggregation, to return the results of the aggregation. /// </summary> /// <returns></returns> public SqlInt32 Terminate() { int max = Math.Max((int) ((sbyte) _intermediateResult.CurrentSequentialDays), (sbyte) _intermediateResult.MaxSequentialDays); return new SqlInt32(max); } } } 

似乎你可以利用这个事实,连续n天,将需要有n行。

所以像这样:

 SELECT users.UserId, count(1) as cnt FROM users WHERE users.CreationDate > now() - INTERVAL 30 DAY GROUP BY UserId HAVING cnt = 30 

用单个SQL查询做这件事似乎对我来说过于复杂。 让我把这个答案分成两部分。

  1. 你到现在应该做的,现在就开始做:
    运行每日cron作业,检查今天login的每个用户,然后增加一个计数器(如果他没有),或者将其设置为0。
  2. 你现在应该做什么:
    – 将此表导出到不运行您的网站的服务器,并且不需要一段时间。 ;)
    – 按用户sorting,然后date。
    – 顺序经过,保持一个柜台…

如果这对你来说非常重要,那就来源于这个事件,并开一张桌子给你这个信息。 没有必要用这些疯狂的查询来杀死机器。

你可以使用recursionCTE(SQL Server 2005+):

 WITH recur_date AS ( SELECT t.userid, t.creationDate, DATEADD(day, 1, t.created) 'nextDay', 1 'level' FROM TABLE t UNION ALL SELECT t.userid, t.creationDate, DATEADD(day, 1, t.created) 'nextDay', rd.level + 1 'level' FROM TABLE t JOIN recur_date rd on t.creationDate = rd.nextDay AND t.userid = rd.userid) SELECT t.* FROM recur_date t WHERE t.level = @numDays ORDER BY t.userid 

Joe Celko在Smarties中有一个完整的章节(叫做运行和序列)。 我家里没有这本书,所以当我开始工作时,我会回答这个问题。 (假设历史表称为dbo.UserHistory,天数为@Days)

另一个主angular来自SQL Team的博客上

我有的另一个想法,但没有一个SQL服务器方便在这里工作是使用一个CTE的分区ROW_NUMBER是这样的:

 WITH Runs AS (SELECT UserID , CreationDate , ROW_NUMBER() OVER(PARTITION BY UserId ORDER BY CreationDate) - ROW_NUMBER() OVER(PARTITION BY UserId, NoBreak ORDER BY CreationDate) AS RunNumber FROM (SELECT UH.UserID , UH.CreationDate , ISNULL((SELECT TOP 1 1 FROM dbo.UserHistory AS Prior WHERE Prior.UserId = UH.UserId AND Prior.CreationDate BETWEEN DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), -1) AND DATEADD(dd, DATEDIFF(dd, 0, UH.CreationDate), 0)), 0) AS NoBreak FROM dbo.UserHistory AS UH) AS Consecutive ) SELECT UserID, MIN(CreationDate) AS RunStart, MAX(CreationDate) AS RunEnd FROM Runs GROUP BY UserID, RunNumber HAVING DATEDIFF(dd, MIN(CreationDate), MAX(CreationDate)) >= @Days 

上面的方法可能比它的方式更难 ,但是当你对“一次运行”有一些其他的定义而不仅仅是date的时候,它就是一个让人大惑不解的问题

几个SQL Server 2012选项 (假设N = 100以下)。

 ;WITH T(UserID, NRowsPrevious) AS (SELECT UserID, DATEDIFF(DAY, LAG(CreationDate, 100) OVER (PARTITION BY UserID ORDER BY CreationDate), CreationDate) FROM UserHistory) SELECT DISTINCT UserID FROM T WHERE NRowsPrevious = 100 

尽pipe用我的样本数据,下面的工作效率更高

 ;WITH U AS (SELECT DISTINCT UserId FROM UserHistory) /*Ideally replace with Users table*/ SELECT UserId FROM U CROSS APPLY (SELECT TOP 1 * FROM (SELECT DATEDIFF(DAY, LAG(CreationDate, 100) OVER (ORDER BY CreationDate), CreationDate) FROM UserHistory UH WHERE U.UserId = UH.UserID) T(NRowsPrevious) WHERE NRowsPrevious = 100) O 

两者都依赖于问题中所述的约束,即每个用户每天最多有一条logging。

像这样的东西?

 select distinct userid from table t1, table t2 where t1.UserId = t2.UserId AND trunc(t1.CreationDate) = trunc(t2.CreationDate) + n AND ( select count(*) from table t3 where t1.UserId = t3.UserId and CreationDate between trunc(t1.CreationDate) and trunc(t1.CreationDate)+n ) = n 

我用一个简单的math属性来确定谁连续访问该网站。 此属性是,您应该有第一次访问和上次时间之间的天差异等于您的访问表日志中的logging数。

以下是我在Oracle数据库中testing的SQL脚本(它也应该在其他数据库中工作):

 -- show basic understand of the math properties select ceil(max (creation_date) - min (creation_date)) max_min_days_diff, count ( * ) real_day_count from user_access_log group by user_id; -- select all users that have consecutively accessed the site select user_id from user_access_log group by user_id having ceil(max (creation_date) - min (creation_date)) / count ( * ) = 1; -- get the count of all users that have consecutively accessed the site select count(user_id) user_count from user_access_log group by user_id having ceil(max (creation_date) - min (creation_date)) / count ( * ) = 1; 

表格准备脚本:

 -- create table create table user_access_log (id number, user_id number, creation_date date); -- insert seed data insert into user_access_log (id, user_id, creation_date) values (1, 12, sysdate); insert into user_access_log (id, user_id, creation_date) values (2, 12, sysdate + 1); insert into user_access_log (id, user_id, creation_date) values (3, 12, sysdate + 2); insert into user_access_log (id, user_id, creation_date) values (4, 16, sysdate); insert into user_access_log (id, user_id, creation_date) values (5, 16, sysdate + 1); insert into user_access_log (id, user_id, creation_date) values (6, 16, sysdate + 5); 
 declare @startdate as datetime, @days as int set @startdate = cast('11 Jan 2009' as datetime) -- The startdate set @days = 5 -- The number of consecutive days SELECT userid ,count(1) as [Number of Consecutive Days] FROM UserHistory WHERE creationdate >= @startdate AND creationdate < dateadd(dd, @days, cast(convert(char(11), @startdate, 113) as datetime)) GROUP BY userid HAVING count(1) >= @days 

语句cast(convert(char(11), @startdate, 113) as datetime)时间cast(convert(char(11), @startdate, 113) as datetime)删除cast(convert(char(11), @startdate, 113) as datetime)的时间部分,所以我们在午夜开始。

我也假设creationdateuserid列被索引。

我只是意识到,这不会告诉你所有的用户和他们连续的日子。 但是会告诉你哪些用户将从您select的date起访问一定天数。

修订的解决scheme:

 declare @days as int set @days = 30 select t1.userid from UserHistory t1 where (select count(1) from UserHistory t3 where t3.userid = t1.userid and t3.creationdate >= DATEADD(dd, DATEDIFF(dd, 0, t1.creationdate), 0) and t3.creationdate < DATEADD(dd, DATEDIFF(dd, 0, t1.creationdate) + @days, 0) group by t3.userid ) >= @days group by t1.userid 

我已经检查过,它会查询所有用户和所有date。 这是基于斯宾塞的第一个(笑话?)解决scheme ,但我的作品。

更新:改进了第二个解决scheme中的date处理。

这应该做你想要的,但我没有足够的数据来testing效率。 令人费解的CONVERT / FLOOR的东西是去除date时间字段的时间部分。 如果您使用的是SQL Server 2008,那么您可以使用CAST(x.CreationDate AS DATE)。

将@Range声明为INT
 SET @Range = 10

 SELECT DISTINCT UserId,CONVERT(DATETIME,FLOOR(CONVERT(FLOAT,a.CreationDate)))
  从tblUserLogin a
在哪里存在
    (select1 
      从tblUserLogin b 
      WHERE a.userId = b.userId 
        AND(SELECT COUNT(DISTINCT(CONVERT(DATETIME,FLOOR(CONVERT(FLOAT,CreationDate))))) 
               FROM tblUserLogin c 
             在哪里c.userid = b.userid 
                AND CONVERT(DATETIME,FLOOR(CONVERT(FLOAT,c.CreationDate)))CONVERT(DATETIME,FLOOR(CONVERT(FLOAT,a.CreationDate)))和CONVERT(DATETIME,FLOOR(CONVERT(FLOAT,a.CreationDate)) )+ @ Range-1)= @Range)

创build脚本

 CREATE TABLE [dbo]。[tblUserLogin](
     [Id] [int] IDENTITY(1,1)NOT NULL,
     [UserId] [int] NULL,
     [CreationDate] [datetime] NULL
 )ON [PRIMARY]

斯宾塞几乎做到了,但这应该是工作代码:

 SELECT DISTINCT UserId FROM History h1 WHERE ( SELECT COUNT(*) FROM History WHERE UserId = h1.UserId AND CreationDate BETWEEN h1.CreationDate AND DATEADD(d, @n-1, h1.CreationDate) ) >= @n 

在我头顶,MySQLish:

 SELECT start.UserId FROM UserHistory AS start LEFT OUTER JOIN UserHistory AS pre_start ON pre_start.UserId=start.UserId AND DATE(pre_start.CreationDate)=DATE_SUB(DATE(start.CreationDate), INTERVAL 1 DAY) LEFT OUTER JOIN UserHistory AS subsequent ON subsequent.UserId=start.UserId AND DATE(subsequent.CreationDate)<=DATE_ADD(DATE(start.CreationDate), INTERVAL 30 DAY) WHERE pre_start.Id IS NULL GROUP BY start.Id HAVING COUNT(subsequent.Id)=30 

未经testing,并且几乎可以肯定需要对MSSQL进行一些转换,但我认为这是一些想法。

如何使用Tally表? 它遵循更多的algorithm,执行计划是一件轻而易举的事情。 用1到MaxDaysBehind之间的数字填充你想要扫描表格的数字表格(例如,90将会在3个月后查找等等)。

 declare @ContinousDays int set @ContinousDays = 30 -- select those that have 30 consecutive days create table #tallyTable (Tally int) insert into #tallyTable values (1) ... insert into #tallyTable values (90) -- insert numbers for as many days behind as you want to scan select [UserId],count(*),t.Tally from HistoryTable join #tallyTable as t on t.Tally>0 where [CreationDate]> getdate()-@ContinousDays-t.Tally and [CreationDate]<getdate()-t.Tally group by [UserId],t.Tally having count(*)>=@ContinousDays delete #tallyTable 

调整比尔的查询了一下。 在分组之前,您可能必须截取date才能每天只计算一次login…

 SELECT UserId from History WHERE CreationDate > ( now() - n ) GROUP BY UserId, DATEADD(dd, DATEDIFF(dd, 0, CreationDate), 0) AS TruncatedCreationDate HAVING COUNT(TruncatedCreationDate) >= n 

编辑使用DATEADD(dd,DATEDIFF(dd,0,CreationDate),0)而不是convert(char(10),CreationDate,101)。

@IDisposable我正在寻找使用datepart更早,但我懒得查找语法,所以我想身份证使用转换,而不是。 我知道它有一个重大的影响谢谢! 现在我明白了。

假设一个模式如下所示:

 create table dba.visits ( id integer not null, user_id integer not null, creation_date date not null ); 

这将从具有间隙的date序列中提取连续的范围。

 select l.creation_date as start_d, -- Get first date in contiguous range ( select min(a.creation_date ) as creation_date from "DBA"."visits" a left outer join "DBA"."visits" b on a.creation_date = dateadd(day, -1, b.creation_date ) and a.user_id = b.user_id where b.creation_date is null and a.creation_date >= l.creation_date and a.user_id = l.user_id ) as end_d -- Get last date in contiguous range from "DBA"."visits" l left outer join "DBA"."visits" r on r.creation_date = dateadd(day, -1, l.creation_date ) and r.user_id = l.user_id where r.creation_date is null