SQL

SQL:SQL 窗口函数 & 聚合函数

RANK & DENSE RANK & PRECEDING & UNBOUNDED & LAG & LEAD & GROUP_CONCAT

1. 窗口函数

  • 窗口函数,也叫 OLAP 函数(Online Anallytical Processing,联机分析处理),可以对数据库数据进行实时分析处理。

  • 基本语法
      <窗口函数> over (partition by <用于分组的列名>
                      order by <用于排序的列名>)
    
  • 窗口函数分类
    • 专用窗口函数,包括 rank, dense_rank, row_number 等专用窗口函数;
    • 聚合函数,如 sum、avg、count、max、min 等;
  • 因为窗口函数是对 where 或者 group by 子句处理后的结果进行操作,所以窗口函数原则上只能写在 select 子句中。

2. 窗口函数 VS GROUP BY

  • 窗口函数具备了我们之前学过的 group by 子句分组的功能和 order by 子句排序的功能,但是二者具有完全不同的应用场景。

  • 窗口函数有以下功能:

    • 同时具有分组和排序的功能
    • 不减少原表的行数

3. RANK

  • 语法

      select *,
         rank() over (partition by 分组字段
                       order by 排序字段 desc) as ranking
      from table
    
    • partition by 用来对表分组;
    • order by 子句的功能是对分组后的结果进行排序;
  • 示例

      select *,
         rank() over (partition by t.department_id
                       order by t.salary_amount desc) as ranking
      from (
          SELECT s.employee_id,e.employee_name,e.department_id as department_id,s.salary_amount  
          FROM test.employee e join test.salary s  on s.employee_id=e.employee_id
      ) t;
    

    sql-2

4. DENSE RANK

  • dense_rank,排序时会将相同值的排名排为同一个值;

  • 语法

      select *,
         dense_rank() over (partition by 分组字段
                       order by 排序字段 desc) as ranking
      from table
    
    • partition by 用来对表分组;
    • order by 子句的功能是对分组后的结果进行排序;
  • 示例

      select *,
         dense_rank() over (partition by t.department_id
                       order by t.salary_amount desc) as ranking
      from (
          SELECT s.employee_id,e.employee_name,e.department_id as department_id,s.salary_amount  
          FROM test.employee e join test.salary s  on s.employee_id=e.employee_id
      ) t;
    

    sql-2

窗口聚合函数

  • 示例
      select *,
         sum(salary_amount) over (partition by t.department_id) as dep_sum_salary,
         avg(salary_amount) over (partition by t.department_id) as dep_avg_salary,
         count(salary_amount) over (partition by t.department_id) as dep_count,
         max(salary_amount) over (partition by t.department_id) as dep_max_salary,
      from (
          SELECT s.employee_id,e.employee_name,e.department_id as department_id,s.salary_amount  
          FROM test.employee e join test.salary s  on s.employee_id=e.employee_id
      ) t;
    

    sql-2

TOP N 问题

  • 示例

      select * from (     -- 注意:这里需要使用嵌套才能使用排名字段
          select *,        
             dense_rank() over (PARTITION by t.department_id order by t.salary_amount desc) as 排名
          from (
              SELECT s.employee_id,e.employee_name,e.department_id as department_id,s.salary_amount  
              FROM test.employee e join test.salary s  on s.employee_id=e.employee_id
          ) t
      ) t1
      where t1.排名<=10
      ;
    

    sql-2

组内比较问题

  • 示例

      select * from (
          select *,        
             avg(salary_amount) over (partition by t.department_id) as dep_avg_salary
          from (
              SELECT s.employee_id,e.employee_name,e.department_id as department_id,s.salary_amount  
              FROM test.employee e join test.salary s  on s.employee_id=e.employee_id
          ) t
      ) t1
      where t1.salary_amount>=t1.dep_avg_salary
      ;
    

    sql-2

PRECEDING

  • 往前 n 行数据

最近 N (小时/天/周/月)平均统计问题

  • 示例:
    • 查找不同产品每个月销量、以及截至当前月最近3个月的平均销售额;

        SELECT m.product,m.ym,m.amount,
        AVG(m.amount) OVER(
         PARTITION BY m.product
         ORDER BY m.ym
         ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
        )
        FROM sales_monthly m
        ORDER BY m.product,m.ym
      
      • AVG 函数 OVER 子句中的 PARTITION BY 选项表示按照产品进行分区。
      • ORDER BY选项表示按照月份进行排序;
      • ROWS BETWEEN 2 PRECEDING AND CURRENT ROW 表示窗口从当前行的前2行开始,直到当前行结束。

FOLLOWING

  • 往后 n 行数据
  • 使用方法,与 PRECEDING 类似,只不过对象是统计到当前行开始往后。

UNBOUNDED

  • UNBOUNDED PRECEDING 表示从前面的起点;

累计求和问题(ROWS BETWEEN)

  • 示例:
    • 查找不同产品截至当前月份的累计销售额

        SELECT m.product,m.ym,m.amount,
        SUM(m.amount) OVER(
         PARTITION BY m.product
         ORDER BY m.ym
         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        )
        FROM sales_monthly m
        ORDER BY m.product,m.ym
      
      • SUM函数OVER子句中的PARTITION BY选项表示按照产品进行分区。
      • ORDER BY选项表示按照月份进行排序。
      • ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW 表示窗口从当前分区第1行开始,直到当前行结束。
  • UNBOUNDED FOLLOWING 表示到后面的终点。

累计求和问题(RANGE BETWEEN)

  • 示例:
    • 查找短期之内(5天)累计转账超过100万元的账户。

        SELECT log_ts,from_user,total_amount FROM (
            SELECT to_char(t.log_ts,'yyyy-mm-dd hh24:mi:ss') log_ts,t.from_user,t.amount,
              SUM(t.amount) OVER(
                PARTITION BY t.from_user
                ORDER BY t.log_ts
                RANGE INTERVAL '5' DAY PRECEDING
              ) AS total_amount
            FROM transfer_log t 
            WHERE t.type = '转账'
        )
        WHERE total_amount >= 1000000;
      

LAG

  • 往前第 n 行数据
  • 语法:
      Lag ( scalar_expression [ ,offset ] , [ default ] ) OVER ( [ partition_by_clause ] order_by_clause )
    

去年同期问题

TODO

LEAD

  • 往后第 n 行数据
  • 基本语法:
      LAG( 想要的返回值(可以是各种函数) 
            [,偏移量(从1开始)] 
            [,如果偏移后没有选定的行,则返回的默认值])  
      OVER ( [ partition_by 分区 ] order_by 顺序 ) 
    
  • 高级语法:
      lag(列名,1,0) over (partition by 分组列 order by 排序列 rows between 开始位置 preceding and 结束位置 following)
    

用户访问时长问题

  • 有一个日志登陆列表,获取用户在某个页面停留时长
      +------------------+----------------------+---------------+--+
      | userid           |             time     | url           |
      +------------------+----------------------+---------------+--+
      | Peter            | 2015-10-12 01:10:00  | url1          |
      | Peter            | 2015-10-12 01:15:10  | url2          |
      | Peter            | 2015-10-12 01:16:40  | url3          |
      | Peter            | 2015-10-12 02:13:00  | url4          |
      | Peter            | 2015-10-12 03:14:30  | url5          |
      | Marry            | 2015-11-12 01:10:00  | url1          |
      | Marry            | 2015-11-12 01:15:10  | url2          |
      | Marry            | 2015-11-12 01:16:40  | url3          |
      | Marry            | 2015-11-12 02:13:00  | url4          |
      | Marry            | 2015-11-12 03:14:30  | url5          |
      +------------------+----------------------+---------------+--+
    
  • 代码
      select
          userid
          ,time
          ,UNIX_TIMESTAMP(
              lead(time,1) over(partition by userid order by time),'yyyy-MM-dd HH:mm:ss'
          ) 
          - 
          UNIX_TIMESTAMP(
              time,'yyyy-MM-dd HH:mm:ss'
          ) as period
          ,url
      from 
          user_log
    
  • 结果
      +---------+----------------------+----------------------+---------+-------+--+
      | userid  |        stime         |        etime         | period  |  url  |
      +---------+----------------------+----------------------+---------+-------+--+
      | Marry   | 2015-11-12 01:10:00  | 2015-11-12 01:15:10  | 310     | url1  |
      | Marry   | 2015-11-12 01:15:10  | 2015-11-12 01:16:40  | 90      | url2  |
      | Marry   | 2015-11-12 01:16:40  | 2015-11-12 02:13:00  | 3380    | url3  |
      | Marry   | 2015-11-12 02:13:00  | 2015-11-12 03:14:30  | 3690    | url4  |
      | Marry   | 2015-11-12 03:14:30  | NULL                 | NULL    | url5  |
      | Peter   | 2015-10-12 01:10:00  | 2015-10-12 01:15:10  | 310     | url1  |
      | Peter   | 2015-10-12 01:15:10  | 2015-10-12 01:16:40  | 90      | url2  |
      | Peter   | 2015-10-12 01:16:40  | 2015-10-12 02:13:00  | 3380    | url3  |
      | Peter   | 2015-10-12 02:13:00  | 2015-10-12 03:14:30  | 3690    | url4  |
      | Peter   | 2015-10-12 03:14:30  | NULL                 | NULL    | url5  |
      +---------+----------------------+----------------------+---------+-------+--+
    

连续登录问题

  • 寻找至少连续出现3次的数字
      +--------------+----+
      | id           |num |
      +--------------+----+
      | 1            | 1  | 
      | 2            | 1  | 
      | 3            | 1  | 
      | 4            | 1  | 
      | 5            | 2  | 
      | 6            | 2  | 
      | 7            | 3  | 
      | 8            | 3  | 
      | 9            | 3  | 
      | 10           | 4  | 
      +--------------+----+
    
  • 思路:增加两列,使用lag函数-把下面的数据往上错位一个,错位2个,判断num和错位的两列是否相等
  • 代码
       select
          id
          ,distinct num
      from
      (
          select
              id
              ,num
              ,lag(num,1) over(partition by id) as lag1
              ,lag(num,2) over(partition by id) as lag2
          from 
              log_table
      ) a
      where num=lag1 and lag1=lag2
    

用户先后进行某项操作问题

  • 统计每天符合以下条件的用户数:
    • A 操作(opr_id)之后是 B 操作,AB 操作必须相邻
      select date,count(*)
      from(
          select user_id
          from(
              select user_id,
                  convert(log_time,date) date,
                  opr_id f,
                  lag(opr_id,1) over(
                      partition by user_id,convert(log_time,date) order by log_time
                  ) l
              from tracking_log
              ) a
          where f='A' and l='B'
          ) b
      group by date;
    

最近 N 天(小时/周/月)重复行为问题

  • 获取在 48 小时之内重复的记录

      SELECT  *
        
      FROM (
              SELECT b.* ,
                  LAG(b.OperatorTime, 1, b.OperatorTime) OVER ( PARTITION BY b.No ORDER BY b.OperatorTime ) AS BeforTime ,
                  LEAD(b.OperatorTime, 1, b.OperatorTime) OVER ( PARTITION BY b.No ORDER BY b.OperatorTime ) AS NextTime
              FROM Test b
           ) a
      WHERE 
          DATEDIFF(HH, a.BeforTime, a.OperatorTime) < 24
          AND DATEDIFF(HH, a.OperatorTime, a.NextTime) < 24
          AND a.No IN (
                       SELECT   c.No
                       FROM     dbo.Test c
                       GROUP BY c.No
                       HAVING   COUNT(c.No) > 1
                      )
    

NTILE(n)

  • 把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE 返回此行所属的组的编号。
  • 注意:n 必须为 int 类型。

GROUP_CONCAT

  • 示例 1:

    • 有下面的数据表,要求按分数 score 进行分组,并将分组后的学生姓名打印下来;
              
        |id    |subject |student|teacher|score|
        ---------------------------------------
        |1	   |数学    |小红	|王老师	|80   |
        |2	   |数学    |小李	|王老师	|80   |
        |3	   |数学    |小王	|王老师	|70   |
        |4	   |数学    |小张	|王老师	|90   |
        |5	   |数学    |小赵	|王老师	|70   |
        |6	   |数学    |小孙	|王老师	|80   |
        |7	   |数学    |小钱	|王老师	|90   |
        |8	   |数学    |小高	|王老师	|70   |
        |9	   |数学    |小秦	|王老师	|80   |
        |10	   |数学    |小马	|王老师	|90   |
        |11	   |数学    |小朱	|王老师	|90   |
        |12	   |语文    |小高	|李老师	|70   |
        |15	   |语文    |小秦	|李老师	|70   |
        |18	   |语文    |小马	|李老师	|80   |
        |21	   |语文    |小朱	|李老师	|90   |
        |24	   |语文    |小钱	|李老师	|90   |
      
    • 代码:

        select score,group_concat(student) from exam group by score;
      
    • 执行结果为

        |score |group_concat(student)		|
        -------------------------------------
        |70	   |小王,小赵,小高,小高,小秦		|
        |80	   |小红,小李,小孙,小秦,小马		|
        |90	   |小张,小钱,小马,小朱,小朱,小钱	|
      
    • 如果我们需要去重,则需要给函数中加一个distinct参数:
      select score,group_concat(distinct student) from exam group by score;
    
    • 执行结果为:

        |score |group_concat(student)	|
        ---------------------------------
        |70	   |小王,小赵,小高,小秦		|
        |80	   |小红,小李,小孙,小秦,小马	|
        |90	   |小张,小钱,小马,小朱		|
      
  • 示例 2:

    • 统计用户行为序列为 A-B-D 的用户数
      • 其中:A-B 之间可以有任何其他浏览记录(如 C、E 等);
      • B-D 之间除了 C 记录可以有任何其他浏览记录(如 A、E 等)。
          select count(*)
          from(
          select user_id,group_concat(opr_id) ubp
          from tracking_log
          group by user_id
          ) a
          where ubp like '%A%B%D%' and ubp not like '%A%B%C%D%'
        

参考链接

  • https://blog.csdn.net/weixin_34280060/article/details/123130155