请教SQL一个query的问题，average daily account active time

大约 4 年

楼主 (未名空间)

我现在有一个project，是计算用户账号平均每天的使用时间。
原始数据类似于这样子：
UserID Event Date_time
1 Login 2021-01-05 08:25: 33
1 Logout 2021-01-05 08:35: 47
1 Login 2021-01-05 08:36: 12
1 Logout 2021-01-05 18:22: 05
2 Login 2021-01-05 08:25: 33
2 Logout 2021-01-05 08:35: 47
2 Login 2021-01-05 08:36: 12
2 Logout 2021-01-05 18:22: 05
3 Login 2021-01-05 08:25: 33
3 Logout 2021-01-05 08:35: 47
3 Login 2021-01-05 08:36: 12
3 Logout 2021-01-05 18:22: 05
.
.
.

我本来觉得是一个简单的问题，现在的想法是做一个pivot，变成类似于：
UserID log_date Log_sequence Login_time Logout_time
这样子两个column相减就得到了这一次login的使用时间。

难点是：
1. 用户每天有多次login和logout，需要把login logout一一对应好才行，这个怎么办到？
2. 用户可能在前一天的夜晚12点前login，但是在第二天的凌晨某个时候logout。为了方便，对于这样的case，我把用户的使用时间归为login那一点的那天，而不是login-
12：00am， 12：00am-logout这样子分开算。
3. 我在计算的时候必须要设定一个时间范围，比如从01/01到03/31。但是对于某些用
户，他可能存在的情况是类似于第二点里那样子，他在01/01这一天的第一个event是个logout，而不是login。在03/31这一天的最后一个event是个login，而不是logout。这种的case要掐头或者去尾才行。

请问这个该如何解决？谢谢了

TheMatrix

大约 4 年

这个如果用通用编程语言比如python思考的话，不难，就是个list processing，多遍
pass。

SQL query现在已经是图灵完备的了，其他语言能处理，SQL query就能处理。而SQL
query很接近list processing以及functional programming，对这类问题处理起来应该都没啥问题。

【在 robustsong (莫非) 的大作中提到: 】
: 我现在有一个project，是计算用户账号平均每天的使用时间。
: 原始数据类似于这样子：
: UserID Event Date_time
: 1 Login 2021-01-05 08:25: 33
: 1 Logout 2021-01-05 08:35: 47
: 1 Login 2021-01-05 08:36: 12
: 1 Logout 2021-01-05 18:22: 05
: 2 Login 2021-01-05 08:25: 33
: 2 Logout 2021-01-05 08:35: 47
: 2 Login 2021-01-05 08:36: 12
: ...................

robustsong

大约 4 年

谢谢解释。你能不能帮我把具体的query大概写一下呢？我实在是没有办法解决我的难
点。
【在 TheMatrix (TheMatrix) 的大作中提到: 】
: 这个如果用通用编程语言比如python思考的话，不难，就是个list processing，多遍
: pass。
: SQL query现在已经是图灵完备的了，其他语言能处理，SQL query就能处理。而SQL : query很接近list processing以及functional programming，对这类问题处理起来应该
: 都没啥问题。

ztk

大约 4 年

大概写个oracle query，没实际测试，主要试图解决问题1&2，其他细节你自己补充，
如有不对欢迎指正

with sortedLog as
(select rowNum as logId, log.* from log
order by userId, date_time
)
select (logouts.date_time-logins.date_time) as usageTime, to_date(logins.
date_
time,'YYYY/MM/DD') as eventDate, ...
from
( select * from sortedLog
where event = 'login'
) as logins
join
( select * from sortedLog
where event = 'logout'
) as logouts
on logins.userId=logouts.userId and
logins.logid=logouts.logid-1

【在 robustsong (莫非) 的大作中提到: 】
: 谢谢解释。你能不能帮我把具体的query大概写一下呢？我实在是没有办法解决我的难
: 点。

TheMatrix

大约 4 年

我是用SQL Server的SQL，因为用到时间差函数，各个SQL不一样，SQLServer里是
datediff。我的code在codeshare:
https://codeshare.io/X8AZXz

有一个对数据的假定是，一个用户的login-logout period不能overlap。这是合理的假定。

结果有3个字段：
UserID
EventDate - 就是Login Date
DurationSec - 一天中全部login时长之和，in seconds.

【在 robustsong (莫非) 的大作中提到: 】
: 我现在有一个project，是计算用户账号平均每天的使用时间。
: 原始数据类似于这样子：
: UserID Event Date_time
: 1 Login 2021-01-05 08:25: 33
: 1 Logout 2021-01-05 08:35: 47
: 1 Login 2021-01-05 08:36: 12
: 1 Logout 2021-01-05 18:22: 05
: 2 Login 2021-01-05 08:25: 33
: 2 Logout 2021-01-05 08:35: 47
: 2 Login 2021-01-05 08:36: 12
: ...................

TheMatrix

大约 4 年

这个query也解决了你说的“掐头去尾”。每一个login和相应的logout pair在一起，
最后只取全部login，并且去掉无相应logout的。这就实现了掐头去尾。

【在 TheMatrix (TheMatrix) 的大作中提到: 】
: 我是用SQL Server的SQL，因为用到时间差函数，各个SQL不一样，SQLServer里是
: datediff。我的code在codeshare:
: https://codeshare.io/X8AZXz
: 有一个对数据的假定是，一个用户的login-logout period不能overlap。这是合理的假
: 定。
: 结果有3个字段：
: UserID
: EventDate - 就是Login Date
: DurationSec - 一天中全部login时长之和，in seconds.

rtn

大约 4 年

学习了lead() function，非常感谢。

如果不用analytic function, could I get the list of time used with the
following two joins:

select a.*, datediff('ss', a.eventtime, b.eventtime) from data_1 a, data_1 b, data_1 c
where a.userid = b.userid
and a.userid = c.userid(+)
and a.eventtime < b.eventtime
and a.eventtime < c.eventtime(+)
and a.event = 'Login'
and a.event != b.event
and b.eventtime >= c.eventtime(+)
and c.eventtime(+) is null
order by a.userid, a.eventtime;

【在 TheMatrix (TheMatrix) 的大作中提到: 】
: 这个query也解决了你说的“掐头去尾”。每一个login和相应的logout pair在一起，
: 最后只取全部login，并且去掉无相应logout的。这就实现了掐头去尾。

TheMatrix

大约 4 年

analytic function可以用join实现，比较麻烦，效率也要低很多。我的data_2 CTE改
成用join实现是这样的。见附图。

你的这个实现，我理解你的意思是让a和b两个之间紧挨着，也就是中间隔着一个c，但
是c必须是null。Oracle的outer join的(+)方法我不熟悉，不好说你的对不对。但是我感觉你这个很难对。

【在 rtn (rtn) 的大作中提到: 】
: 学习了lead() function，非常感谢。
: 如果不用analytic function, could I get the list of time used with the
: following two joins:
: select a.*, datediff('ss', a.eventtime, b.eventtime) from data_1 a, data_1 b
: , data_1 c
: where a.userid = b.userid
: and a.userid = c.userid(+)
: and a.eventtime < b.eventtime
: and a.eventtime < c.eventtime(+)
: and a.event = 'Login'
: ...................

ztk

大约 4 年

不是很明白你query中c的用法。感觉靠比时间大小来join不牢靠。

我那个query是用join实现的，懒得去查函数细节。
(只列出每次login的时间，每天平均之类的细节就留给楼主自己补充了）

数据假设和TheMatrix指出的一样，同一个用户login/logout按时间顺序配对。（没有
同一个用户开好多浏览器同时login之类的情况）
有错的话欢迎指正，可以一起提高。

【在 rtn (rtn) 的大作中提到: 】
: 学习了lead() function，非常感谢。
: 如果不用analytic function, could I get the list of time used with the
: following two joins:
: select a.*, datediff('ss', a.eventtime, b.eventtime) from data_1 a, data_1 b
: , data_1 c
: where a.userid = b.userid
: and a.userid = c.userid(+)
: and a.eventtime < b.eventtime
: and a.eventtime < c.eventtime(+)
: and a.event = 'Login'
: ...................

rtn

大约 4 年

谢谢两位回复。

To TheMatrix:
我这个left outer join是用来找到满总某种要求的第一行，跟first_value()有些像。以前不知道这些Analytic function，就用这个方法。
你的left outer join用了min()，这样就不需要额外的join，也是一个好方法。

Why you need 1=1 in your data_3 where condition?

To ztk:
table c 是用来确保the row selected in the table b is the first row
satisfying the conditions by requiring the row in c is null.

你的方法是排序后，给每个相应的行一个行号，然后join，我觉得应该行。就是不知道加行号和join比，哪个比较好？不过感觉using analytic function should be a
better approach.

多谢两位提出的方法，学习了。

【在 TheMatrix (TheMatrix) 的大作中提到: 】
: analytic function可以用join实现，比较麻烦，效率也要低很多。我的data_2 CTE改
: 成用join实现是这样的。见附图。
: 你的这个实现，我理解你的意思是让a和b两个之间紧挨着，也就是中间隔着一个c，但
: 是c必须是null。Oracle的outer join的(+)方法我不熟悉，不好说你的对不对。但是我
: 感觉你这个很难对。
: b

【在 ztk (ztk) 的大作中提到: 】
: 不是很明白你query中c的用法。感觉靠比时间大小来join不牢靠。
: 我那个query是用join实现的，懒得去查函数细节。
: (只列出每次login的时间，每天平均之类的细节就留给楼主自己补充了）
: 数据假设和TheMatrix指出的一样，同一个用户login/logout按时间顺序配对。（没有
: 同一个用户开好多浏览器同时login之类的情况）
: 有错的话欢迎指正，可以一起提高。
: b

TheMatrix

大约 4 年

你这个方法我用 join on condition 的方法写了一下，可以。
先考虑两个copy的join
from data_1 a
left join data_1 b on a.UserID = b.UserID
and a.EventTime < b.EventTime
...
where a.Event = 'Login' and a.Event != b.Event
得到的是同一UserID的login-logout pair，但是可以跳跃。

然后再考虑第三个copy的join，目的是选出不跳跃的：
from data_1 a
left join data_1 b on a.UserID = b.UserID
and a.EventTime < b.EventTime
left join data_1 c on a.UserID = c.UserID
and a.EventTime < c.EventTime and c.EventTime < b.EventTime
where a.Event = 'Login' and a.Event != b.Event
and c.EventTime is null

我附图中的query注释了"c.EventTime is null"，
是为了看清楚结果哪些c.EventTime为null，哪些不为null。

不过你条件里的">="好像有点问题。

另外1=1 always evaluate to true。这样后面跟几个and condition都可以，包括一个都没有也可以。我也是跟别人学的。

【在 rtn (rtn) 的大作中提到: 】
: 谢谢两位回复。
: To TheMatrix:
: 我这个left outer join是用来找到满总某种要求的第一行，跟first_value()有些像。
: 以前不知道这些Analytic function，就用这个方法。
: 你的left outer join用了min()，这样就不需要额外的join，也是一个好方法。
: Why you need 1=1 in your data_3 where condition?
: To ztk:
: table c 是用来确保the row selected in the table b is the first row
: satisfying the conditions by requiring the row in c is null.
: 你的方法是排序后，给每个相应的行一个行号，然后join，我觉得应该行。就是不知道
: ...................

rtn

大约 4 年

应该只是 b.eventtime > c.eventtime(+) and c.eventtime is null，谢谢纠正。

【在 TheMatrix (TheMatrix) 的大作中提到: 】
: 你这个方法我用 join on condition 的方法写了一下，可以。
: 先考虑两个copy的join
: from data_1 a
: left join data_1 b on a.UserID = b.UserID
: and a.EventTime < b.EventTime
: ...
: where a.Event = 'Login' and a.Event != b.Event
: 得到的是同一UserID的login-logout pair，但是可以跳跃。
: 然后再考虑第三个copy的join，目的是选出不跳跃的：
: from data_1 a
: ...................

robustsong

大约 4 年

非常感谢帮忙。学到了不少有用的办法