article/SQL-窗口函数.md
2023-06-04 15:59:11 +08:00

13 KiB
Raw Blame History

SQL窗口函数

一. 什么是窗口函数

基本含义

窗口限定一个范围,它可以理解为满足某些条件的记录集合,窗口函数也就是在窗口范围内执行的函数。

基本语法

窗口函数有over关键字指定函数执行的范围可分为三部分分组子句(partition by),排序子句(order by),窗口子句(rows)

<函数名> over (partition by <分组的列> order by <排序的列> rows between <起始行> and <终止行>)

注意Mysql8才支持窗口函数

演示表格

cid班级id) sname学生姓名 score分数
001 张三 78
001 李四 82
002 小明 90
001 王五 67
002 小红 85
002 小刚 62

演示脚本

CREATE TABLE SQL_5 (
       cid varchar(4),
       sname varchar(4),
       score int
);

insert into SQL_5 (cid, sname, score) values ('001', '张三', 78);
insert into SQL_5 (cid, sname, score) values ('001', '李四', 82);
insert into SQL_5 (cid, sname, score) values ('002', '小明', 90);
insert into SQL_5 (cid, sname, score) values ('001', '王五', 67);
insert into SQL_5 (cid, sname, score) values ('002', '小红', 85);
insert into SQL_5 (cid, sname, score) values ('002', '小刚', 62);

二. 窗口的确定

例子:

select *, sum(score) over (partition by cid order by score rows between unbounded preceding and unbounded following) as '班级总分' from SQL_5;

分组子句(partition by)

不分组可以写成partition by null或者直接不写

后面可以跟多个列, 如 partition by cid, sname

注意 partition by与group by的区别

1前者不会压缩行数但是后者会

2后者只能选取分组的列和聚合的列

也就是说group by 后生成的结果集与原表的行数和列数都不同

排序子句(order by)

不排序可以写成order by null 或者直接不写

asc或不写表示升序desc表示降序

后面可以跟多个列, 如 order by cid, sname

窗口子句(rows)

窗口子句的描述

  1. 起始行: N preceding/unbounded preceding

  2. 当前行: current row

  3. 终止行: N following/unbounded following

举例:

rows between unbounded preceding and current row 从之前所有的行到当前行

rows between 2 preceding and current row 从前面两行到当前行

rows between current row and unbounded following 从当前行到之后所有的行

rows between current row and 1following 从当前行到后面一行

注意:

排序子句后面缺少窗口子句,窗口规范默认是 rows between unbounded preceding and current row

排序子句和窗口子句都缺失,窗口规范默认是 rows between unbounded preceding and unbounded following

总体流程

1 通过partition by 和 order by 子句确定大窗口( 定义出上界unbounded preceding和下界unbounded following

2 通过row 子句针对每一行数据确定小窗口(滑动窗口)

3 对每行的小窗口内的数据执行函数并生成新的列

三. 函数分类

排序类

rank, dense_rank, row_number

-- 【排序类】
-- 按班级分组后打上序号 不考虑并列
select *, row_number() over (partition by cid order by score desc) as '不可并列排名' from SQL_5;
-- 按班级分组后作跳跃排名 考虑并列
select *, rank() over (partition by cid order by score desc) as '跳跃可并列排名' from SQL_5;
-- 按班级分组后作连续排名 考虑并列
select *, dense_rank() over (partition by cid order by score desc) as '连续可并列排名' from SQL_5;
-- 合并起来对比
select *, row_number() over (partition by cid order by score desc) as '不可并列排名' ,
          rank() over (partition by cid order by score desc) as '跳跃可并列排名',
          dense_rank() over (partition by cid order by score desc) as '连续可并列排名'
from SQL_5;

聚合类

sum. avg, count, max, min

-- 【聚合类】
-- 让同一班级每个学生都知道班级总分是多少
select *, sum(score) over (partition by cid) as '班级总分' from SQL_5;
-- 或者可以写成
select *, sum(score) over (partition by cid rows between unbounded preceding and unbounded following) as '班级总分' from SQL_5;

-- 计算同一班级,每个同学和比他分数低的同学的累计总分是多少
select *, sum(score) over (partition by cid order by score) '累加分数' from SQL_5;
-- 或者可以写成 其中rows between ... and 是规定窗口大小
select *, sum(score) over (partition by cid order by score rows between unbounded preceding and current row) as '累加分数' from SQL_5;

跨行类

lag, lead

-- 【跨行类】
-- lag/lead 函数 参数1:比较的列 参数2: 偏移量 参数3:找不到的默认值
-- 同一班级内,成绩比自己低一名的分数是多少
select *, lag(score, 1) over (partition by cid order by score) as '低一名的分数' from SQL_5;
-- 或者写成
select *, lag(score, 1, 0) over (partition by cid order by score) as '低一名的分数' from SQL_5;

-- 同一班级内成绩比自己高2名的分数是多少
select *, lead(score, 2) over (partition by cid order by score) as '高两名的分数' from SQL_5;

四. 相关题目

表格

cid sname course score
001 张三 语文 78
002 小刚 语文 71
001 李四 数学 56
002 小明 数学 54
... ... ... ...

脚本

CREATE TABLE SQL_6 (
       cid varchar(4),
       sname varchar(4),
       course varchar(10),
       score int
);

insert into SQL_6 (cid, sname, course, score) values ('001', '张三', '语文', 78);
insert into SQL_6 (cid, sname, course, score) values ('002', '小刚', '语文', 71);
insert into SQL_6 (cid, sname, course, score) values ('001', '李四', '数学', 56);
insert into SQL_6 (cid, sname, course, score) values ('001', '王五', '数学', 97);
insert into SQL_6 (cid, sname, course, score) values ('002', '小明', '数学', 54);
insert into SQL_6 (cid, sname, course, score) values ('002', '小刚', '数学', 67);
insert into SQL_6 (cid, sname, course, score) values ('002', '小红', '数学', 82);
insert into SQL_6 (cid, sname, course, score) values ('001', '王五', '语文', 80);
insert into SQL_6 (cid, sname, course, score) values ('001', '张三', '数学', 77);
insert into SQL_6 (cid, sname, course, score) values ('002', '小明', '语文', 58);
insert into SQL_6 (cid, sname, course, score) values ('002', '小红', '语文', 87);
insert into SQL_6 (cid, sname, course, score) values ('001', '李四', '语文', 60);
insert into SQL_6 (cid, sname, course, score) values ('001', '张三', '英语', 66);
insert into SQL_6 (cid, sname, course, score) values ('002', '小刚', '英语', 50);
insert into SQL_6 (cid, sname, course, score) values ('001', '李四', '地理', 59);
insert into SQL_6 (cid, sname, course, score) values ('001', '王五', '地理', 88);
insert into SQL_6 (cid, sname, course, score) values ('002', '小明', '地理', 45);
insert into SQL_6 (cid, sname, course, score) values ('002', '小刚', '地理', 66);
insert into SQL_6 (cid, sname, course, score) values ('002', '小红', '地理', 82);
insert into SQL_6 (cid, sname, course, score) values ('001', '王五', '英语', 81);
insert into SQL_6 (cid, sname, course, score) values ('001', '张三', '地理', 77);
insert into SQL_6 (cid, sname, course, score) values ('002', '小明', '英语', 55);
insert into SQL_6 (cid, sname, course, score) values ('002', '小红', '英语', 87);
insert into SQL_6 (cid, sname, course, score) values ('001', '李四', '英语', 61);
分组内topN

问题1求出每个学生成绩最高的三条记录

select * from 
(
    select *, row_number() over (partition by sname order by score desc) as rn from SQL_6   
) temp
where rn <= 3
公式:
select * from
(
	select *, row_number() over (partition by 分组列 order by 比较列) as rn from table
) as tmp
where rn <= N;
汇总分析

问题2找出每门课程都高于班级课程平均分的学生

可以拆解成以下几个问题:

1求出每个班级每门课程的平均分

with
-- 1) 求出每个班级,每门课程的平均分
t1 as
(
select *,
       avg(score) over (partition by cid, course) as 'avg'
from SQL_6
),

2将学生每门课程的成绩与所在班级的对应课程平均分相减结果大于0就说明该学生的这门成绩高于课程平均分

t2 as (
    select *,
           score - avg  as 'del'
    from t1
)

3“找出每门课程都高于班级课程平均分的学生”说明对于学生来说最小的“相减结果”都是大于0的

select sname from t2
group by sname
having min(del) > 0;

合并后的SQL语句

with
t1 as
(
select *,
       avg(score) over (partition by cid, course) as 'avg'
from SQL_6
),
t2 as (
    select *,
           score - avg  as 'del'
    from t1
)
select sname from t2
group by sname
having min(del) > 0;

-- 或者
select sname from (
                      select *,
                             score - avg  as 'del'
                      from (
                               select *,
                                      avg(score) over (partition by cid, course) as 'avg'
                               from SQL_6
                           ) t1
                  ) t2
group by sname
having min(del) > 0;

表格

empno ename hire_date salary dept_no
001 Adam 2018-03-01 1000 A
002 Bill 2021-03-01 1200 A
003 Cindy 2016-03-01 1500 A
004 Danny 2020-03-01 5000 A
005 Eason 2020-03-01 4000 B
006 Fred 2018-03-01 3500 B
007 Gary 2017-03-01 1800 B
008 Hugo 2020-03-01 2500 B

脚本

CREATE TABLE SQL_7 (
       empno varchar(4),
       ename varchar(10),
       hire_date varchar(10),
       salary int,
       dept_no varchar(2)
);
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('001', 'Adam', '2018-03-01', 1000, 'A');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('002', 'Bill', '2021-03-01', 1200, 'A');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('003', 'Cindy', '2016-03-01', 1500, 'A');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('004', 'Danny', '2020-03-01', 5000, 'A');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('005', 'Eason', '2020-03-01', 4000, 'B');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('006', 'Fred', '2018-03-01', 3500, 'B');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('007', 'Gary', '2017-03-01', 1800, 'B');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('008', 'Hugo', '2020-03-01', 4500, 'B');

select * from SQL_7;
分组内topN

问题一:求出每个部门工资最高的前三名员工

select * from
    (
        select *, row_number() over (partition by dept_no order by salary desc) as rn from SQL_7
    ) as tmp
where rn <= 3;
汇总分析

问题二:计算这些员工的工资占所属部门总工资的百分比

with
t1 as (
    select * , sum(salary) over (partition by dept_no) as 'sum_sal' from SQL_7
),
t2 as (
    select *, round(salary*100/sum_sal,2) as 'percentage' from t1
)
select * from t2;

问题三对各部门员工的工资进行从小到大排序排名前30%为低层30%-80%为中层高于80%为高层,并打上标签

with
    t1 as (
        select * , row_number() over (partition by dept_no order by salary) as cnt,
               count(empno) over (partition by dept_no) as 'sum' from SQL_7
    ),
    t2 as (
        select *, round(cnt/sum,2) as 'percentage' from t1
    ),
    t3 as (
        select *, case when percentage <= 0.3 then '低层'
                       when percentage <= 0.8 then '中层'
                       when percentage <= 1 then '高层' end as 'label'
        from t2
    )
select empno, ename, hire_date, salary, dept_no, label from t3;

问题四:统计每年入职总数以及截至本年累计入职总人数(本年总入职人数 + 本年之前所有年的总入职人数之和)

select year(hire_date) as hire_year, count(empno) as cnt
    from SQL_7
    group by year(hire_date) order by hire_year;
with t1 as (
    select year(hire_date) as hire_year, count(empno) as cnt from SQL_7 group by year(hire_date) order by hire_year
)

select *, sum(cnt) over(partition by null rows between unbounded preceding and current row) as sum from t1;

五. 技巧

1分组内topN公式

select * from
(
	select *, row_number() over (partition by 分组列 order by 比较列) as rn from table
) as tmp
where rn <= N;

2 窗口函数 -> 生成辅助列(相当于高级语言的临时变量)

  1. with 语句 -> 生成临时表(相当于高级语言的局部方法)

把复杂的问题拆分成多个子问题并用临时表去表达