article/SQL-窗口函数.md
2023-06-04 15:59:11 +08:00

459 lines
13 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# SQL窗口函数
### 一. 什么是窗口函数
#### 基本含义
窗口限定一个范围,它可以理解为满足某些条件的记录集合,窗口函数也就是在窗口范围内执行的函数。
#### 基本语法
窗口函数有over关键字指定函数执行的范围可分为三部分分组子句(partition by),排序子句(order by),窗口子句(rows)
```sql
<函数名> over (partition by <分组的列> order by <排序的列> rows between <起始行> and <终止行>)
```
**注意Mysql8才支持窗口函数**
#### 演示表格
| cid班级id) | sname学生姓名 | score分数 |
| ------------ | ----------------- | ------------- |
| 001 | 张三 | 78 |
| 001 | 李四 | 82 |
| 002 | 小明 | 90 |
| 001 | 王五 | 67 |
| 002 | 小红 | 85 |
| 002 | 小刚 | 62 |
#### 演示脚本
```sql
CREATE TABLE SQL_5 (
cid varchar(4),
sname varchar(4),
score int
);
insert into SQL_5 (cid, sname, score) values ('001', '张三', 78);
insert into SQL_5 (cid, sname, score) values ('001', '李四', 82);
insert into SQL_5 (cid, sname, score) values ('002', '小明', 90);
insert into SQL_5 (cid, sname, score) values ('001', '王五', 67);
insert into SQL_5 (cid, sname, score) values ('002', '小红', 85);
insert into SQL_5 (cid, sname, score) values ('002', '小刚', 62);
```
### 二. 窗口的确定
例子:
```sql
select *, sum(score) over (partition by cid order by score rows between unbounded preceding and unbounded following) as '班级总分' from SQL_5;
```
#### 分组子句(partition by)
不分组可以写成partition by null或者直接不写
后面可以跟多个列, 如 partition by cid, sname
**注意 partition by与group by的区别**
1前者不会压缩行数但是后者会
2后者只能选取分组的列和聚合的列
也就是说group by 后生成的结果集与原表的行数和列数都不同
#### 排序子句(order by)
不排序可以写成order by null 或者直接不写
asc或不写表示升序desc表示降序
后面可以跟多个列, 如 order by cid, sname
#### 窗口子句(rows)
窗口子句的描述
1) 起始行: N preceding/unbounded preceding
2) 当前行: current row
3) 终止行: N following/unbounded following
举例:
rows between unbounded preceding and current row 从之前所有的行到当前行
rows between 2 preceding and current row 从前面两行到当前行
rows between current row and unbounded following 从当前行到之后所有的行
rows between current row and 1following 从当前行到后面一行
**注意:**
**排序子句后面缺少窗口子句,窗口规范默认是 rows between unbounded preceding and current row**
**排序子句和窗口子句都缺失,窗口规范默认是 rows between unbounded preceding and unbounded following**
#### 总体流程
1 通过partition by 和 order by 子句确定大窗口( 定义出上界unbounded preceding和下界unbounded following
2 通过row 子句针对每一行数据确定小窗口(滑动窗口)
3 对每行的小窗口内的数据执行函数并生成新的列
### 三. 函数分类
#### 排序类
rank, dense_rank, row_number
```sql
-- 【排序类】
-- 按班级分组后打上序号 不考虑并列
select *, row_number() over (partition by cid order by score desc) as '不可并列排名' from SQL_5;
-- 按班级分组后作跳跃排名 考虑并列
select *, rank() over (partition by cid order by score desc) as '跳跃可并列排名' from SQL_5;
-- 按班级分组后作连续排名 考虑并列
select *, dense_rank() over (partition by cid order by score desc) as '连续可并列排名' from SQL_5;
-- 合并起来对比
select *, row_number() over (partition by cid order by score desc) as '不可并列排名' ,
rank() over (partition by cid order by score desc) as '跳跃可并列排名',
dense_rank() over (partition by cid order by score desc) as '连续可并列排名'
from SQL_5;
```
#### 聚合类
sum. avg, count, max, min
```sql
-- 【聚合类】
-- 让同一班级每个学生都知道班级总分是多少
select *, sum(score) over (partition by cid) as '班级总分' from SQL_5;
-- 或者可以写成
select *, sum(score) over (partition by cid rows between unbounded preceding and unbounded following) as '班级总分' from SQL_5;
-- 计算同一班级,每个同学和比他分数低的同学的累计总分是多少
select *, sum(score) over (partition by cid order by score) '累加分数' from SQL_5;
-- 或者可以写成 其中rows between ... and 是规定窗口大小
select *, sum(score) over (partition by cid order by score rows between unbounded preceding and current row) as '累加分数' from SQL_5;
```
#### 跨行类
lag, lead
```sql
-- 【跨行类】
-- lag/lead 函数 参数1:比较的列 参数2: 偏移量 参数3:找不到的默认值
-- 同一班级内,成绩比自己低一名的分数是多少
select *, lag(score, 1) over (partition by cid order by score) as '低一名的分数' from SQL_5;
-- 或者写成
select *, lag(score, 1, 0) over (partition by cid order by score) as '低一名的分数' from SQL_5;
-- 同一班级内成绩比自己高2名的分数是多少
select *, lead(score, 2) over (partition by cid order by score) as '高两名的分数' from SQL_5;
```
### 四. 相关题目
#### 表格
| cid | sname | course | score |
| ---- | ----- | ------ | ----- |
| 001 | 张三 | 语文 | 78 |
| 002 | 小刚 | 语文 | 71 |
| 001 | 李四 | 数学 | 56 |
| 002 | 小明 | 数学 | 54 |
| ... | ... | ... | ... |
#### 脚本
```sql
CREATE TABLE SQL_6 (
cid varchar(4),
sname varchar(4),
course varchar(10),
score int
);
insert into SQL_6 (cid, sname, course, score) values ('001', '张三', '语文', 78);
insert into SQL_6 (cid, sname, course, score) values ('002', '小刚', '语文', 71);
insert into SQL_6 (cid, sname, course, score) values ('001', '李四', '数学', 56);
insert into SQL_6 (cid, sname, course, score) values ('001', '王五', '数学', 97);
insert into SQL_6 (cid, sname, course, score) values ('002', '小明', '数学', 54);
insert into SQL_6 (cid, sname, course, score) values ('002', '小刚', '数学', 67);
insert into SQL_6 (cid, sname, course, score) values ('002', '小红', '数学', 82);
insert into SQL_6 (cid, sname, course, score) values ('001', '王五', '语文', 80);
insert into SQL_6 (cid, sname, course, score) values ('001', '张三', '数学', 77);
insert into SQL_6 (cid, sname, course, score) values ('002', '小明', '语文', 58);
insert into SQL_6 (cid, sname, course, score) values ('002', '小红', '语文', 87);
insert into SQL_6 (cid, sname, course, score) values ('001', '李四', '语文', 60);
insert into SQL_6 (cid, sname, course, score) values ('001', '张三', '英语', 66);
insert into SQL_6 (cid, sname, course, score) values ('002', '小刚', '英语', 50);
insert into SQL_6 (cid, sname, course, score) values ('001', '李四', '地理', 59);
insert into SQL_6 (cid, sname, course, score) values ('001', '王五', '地理', 88);
insert into SQL_6 (cid, sname, course, score) values ('002', '小明', '地理', 45);
insert into SQL_6 (cid, sname, course, score) values ('002', '小刚', '地理', 66);
insert into SQL_6 (cid, sname, course, score) values ('002', '小红', '地理', 82);
insert into SQL_6 (cid, sname, course, score) values ('001', '王五', '英语', 81);
insert into SQL_6 (cid, sname, course, score) values ('001', '张三', '地理', 77);
insert into SQL_6 (cid, sname, course, score) values ('002', '小明', '英语', 55);
insert into SQL_6 (cid, sname, course, score) values ('002', '小红', '英语', 87);
insert into SQL_6 (cid, sname, course, score) values ('001', '李四', '英语', 61);
```
##### 分组内topN
问题1求出每个学生成绩最高的三条记录
```sql
select * from
(
select *, row_number() over (partition by sname order by score desc) as rn from SQL_6
) temp
where rn <= 3
```
###### 公式:
```sql
select * from
(
select *, row_number() over (partition by 分组列 order by 比较列) as rn from table
) as tmp
where rn <= N;
```
##### 汇总分析
问题2找出每门课程都高于班级课程平均分的学生
可以拆解成以下几个问题:
1求出每个班级每门课程的平均分
```sql
with
-- 1) 求出每个班级,每门课程的平均分
t1 as
(
select *,
avg(score) over (partition by cid, course) as 'avg'
from SQL_6
),
```
2将学生每门课程的成绩与所在班级的对应课程平均分相减结果大于0就说明该学生的这门成绩高于课程平均分
```sql
t2 as (
select *,
score - avg as 'del'
from t1
)
```
3“找出每门课程都高于班级课程平均分的学生”说明对于学生来说最小的“相减结果”都是大于0的
```sql
select sname from t2
group by sname
having min(del) > 0;
```
合并后的SQL语句
```sql
with
t1 as
(
select *,
avg(score) over (partition by cid, course) as 'avg'
from SQL_6
),
t2 as (
select *,
score - avg as 'del'
from t1
)
select sname from t2
group by sname
having min(del) > 0;
-- 或者
select sname from (
select *,
score - avg as 'del'
from (
select *,
avg(score) over (partition by cid, course) as 'avg'
from SQL_6
) t1
) t2
group by sname
having min(del) > 0;
```
#### 表格
| empno | ename | hire_date | salary | dept_no |
| ----- | ----- | ---------- | ------ | ------- |
| 001 | Adam | 2018-03-01 | 1000 | A |
| 002 | Bill | 2021-03-01 | 1200 | A |
| 003 | Cindy | 2016-03-01 | 1500 | A |
| 004 | Danny | 2020-03-01 | 5000 | A |
| 005 | Eason | 2020-03-01 | 4000 | B |
| 006 | Fred | 2018-03-01 | 3500 | B |
| 007 | Gary | 2017-03-01 | 1800 | B |
| 008 | Hugo | 2020-03-01 | 2500 | B |
#### 脚本
```sql
CREATE TABLE SQL_7 (
empno varchar(4),
ename varchar(10),
hire_date varchar(10),
salary int,
dept_no varchar(2)
);
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('001', 'Adam', '2018-03-01', 1000, 'A');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('002', 'Bill', '2021-03-01', 1200, 'A');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('003', 'Cindy', '2016-03-01', 1500, 'A');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('004', 'Danny', '2020-03-01', 5000, 'A');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('005', 'Eason', '2020-03-01', 4000, 'B');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('006', 'Fred', '2018-03-01', 3500, 'B');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('007', 'Gary', '2017-03-01', 1800, 'B');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('008', 'Hugo', '2020-03-01', 4500, 'B');
select * from SQL_7;
```
##### 分组内topN
问题一:求出每个部门工资最高的前三名员工
```sql
select * from
(
select *, row_number() over (partition by dept_no order by salary desc) as rn from SQL_7
) as tmp
where rn <= 3;
```
##### 汇总分析
问题二:计算这些员工的工资占所属部门总工资的百分比
```sql
with
t1 as (
select * , sum(salary) over (partition by dept_no) as 'sum_sal' from SQL_7
),
t2 as (
select *, round(salary*100/sum_sal,2) as 'percentage' from t1
)
select * from t2;
```
问题三对各部门员工的工资进行从小到大排序排名前30%为低层30%-80%为中层高于80%为高层,并打上标签
```label
with
t1 as (
select * , row_number() over (partition by dept_no order by salary) as cnt,
count(empno) over (partition by dept_no) as 'sum' from SQL_7
),
t2 as (
select *, round(cnt/sum,2) as 'percentage' from t1
),
t3 as (
select *, case when percentage <= 0.3 then '低层'
when percentage <= 0.8 then '中层'
when percentage <= 1 then '高层' end as 'label'
from t2
)
select empno, ename, hire_date, salary, dept_no, label from t3;
```
问题四:统计每年入职总数以及截至本年累计入职总人数(本年总入职人数 + 本年之前所有年的总入职人数之和)
```sql
select year(hire_date) as hire_year, count(empno) as cnt
from SQL_7
group by year(hire_date) order by hire_year;
```
```sql
with t1 as (
select year(hire_date) as hire_year, count(empno) as cnt from SQL_7 group by year(hire_date) order by hire_year
)
select *, sum(cnt) over(partition by null rows between unbounded preceding and current row) as sum from t1;
```
### 五. 技巧
1分组内topN公式
```sql
select * from
(
select *, row_number() over (partition by 分组列 order by 比较列) as rn from table
) as tmp
where rn <= N;
```
2 窗口函数 -> 生成辅助列(相当于高级语言的临时变量)
3) with 语句 -> 生成临时表(相当于高级语言的局部方法)
把复杂的问题拆分成多个子问题并用临时表去表达