article/SQL-窗口函数.md

# SQL窗口函数

### 一. 什么是窗口函数

#### 基本含义

窗口限定一个范围，它可以理解为满足某些条件的记录集合，窗口函数也就是在窗口范围内执行的函数。

#### 基本语法

窗口函数有over关键字，指定函数执行的范围，可分为三部分：分组子句(partition by)，排序子句(order by)，窗口子句(rows)

```sql
<函数名> over (partition by <分组的列> order by <排序的列> rows between <起始行> and <终止行>)
```


**注意Mysql8才支持窗口函数**


#### 演示表格

| cid（班级id) | sname（学生姓名） | score（分数） |
| ------------ | ----------------- | ------------- |
| 001          | 张三              | 78            |
| 001          | 李四              | 82            |
| 002          | 小明              | 90            |
| 001          | 王五              | 67            |
| 002          | 小红              | 85            |
| 002          | 小刚              | 62            |

#### 演示脚本

```sql
CREATE TABLE SQL_5 (
       cid varchar(4),
       sname varchar(4),
       score int
);

insert into SQL_5 (cid, sname, score) values ('001', '张三', 78);
insert into SQL_5 (cid, sname, score) values ('001', '李四', 82);
insert into SQL_5 (cid, sname, score) values ('002', '小明', 90);
insert into SQL_5 (cid, sname, score) values ('001', '王五', 67);
insert into SQL_5 (cid, sname, score) values ('002', '小红', 85);
insert into SQL_5 (cid, sname, score) values ('002', '小刚', 62);
```


### 二. 窗口的确定

例子：

```sql
select *, sum(score) over (partition by cid order by score rows between unbounded preceding and unbounded following) as '班级总分' from SQL_5;
```

#### 分组子句(partition by)

不分组可以写成partition by null或者直接不写

后面可以跟多个列， 如 partition by cid, sname

**注意 partition by与group by的区别**

1）前者不会压缩行数但是后者会

2）后者只能选取分组的列和聚合的列

也就是说group by 后生成的结果集与原表的行数和列数都不同

#### 排序子句(order by)

不排序可以写成order by null 或者直接不写

asc或不写表示升序，desc表示降序

后面可以跟多个列， 如 order by cid, sname

#### 窗口子句(rows)

窗口子句的描述

1) 起始行:   N preceding/unbounded preceding

2) 当前行:   current row

3) 终止行:   N following/unbounded following

举例：

rows between unbounded  preceding and current row  从之前所有的行到当前行

rows between 2 preceding and current row  从前面两行到当前行

rows between current row and unbounded following  从当前行到之后所有的行

rows between current row and 1following  从当前行到后面一行

**注意：**

**排序子句后面缺少窗口子句，窗口规范默认是 rows between unbounded preceding and current row**

**排序子句和窗口子句都缺失，窗口规范默认是 rows between unbounded preceding and unbounded following**

#### 总体流程

1） 通过partition by 和 order by 子句确定大窗口（ 定义出上界unbounded preceding和下界unbounded following）

2） 通过row 子句针对每一行数据确定小窗口（滑动窗口）

3） 对每行的小窗口内的数据执行函数并生成新的列


### 三. 函数分类

#### 排序类

rank, dense_rank, row_number

```sql
-- 【排序类】
-- 按班级分组后打上序号 不考虑并列
select *, row_number() over (partition by cid order by score desc) as '不可并列排名' from SQL_5;
-- 按班级分组后作跳跃排名 考虑并列
select *, rank() over (partition by cid order by score desc) as '跳跃可并列排名' from SQL_5;
-- 按班级分组后作连续排名 考虑并列
select *, dense_rank() over (partition by cid order by score desc) as '连续可并列排名' from SQL_5;
-- 合并起来对比
select *, row_number() over (partition by cid order by score desc) as '不可并列排名' ,
          rank() over (partition by cid order by score desc) as '跳跃可并列排名',
          dense_rank() over (partition by cid order by score desc) as '连续可并列排名'
from SQL_5;
```

#### 聚合类

sum. avg, count, max, min

```sql
-- 【聚合类】
-- 让同一班级每个学生都知道班级总分是多少
select *, sum(score) over (partition by cid) as '班级总分' from SQL_5;
-- 或者可以写成
select *, sum(score) over (partition by cid rows between unbounded preceding and unbounded following) as '班级总分' from SQL_5;

-- 计算同一班级，每个同学和比他分数低的同学的累计总分是多少
select *, sum(score) over (partition by cid order by score) '累加分数' from SQL_5;
-- 或者可以写成 其中rows between ... and 是规定窗口大小
select *, sum(score) over (partition by cid order by score rows between unbounded preceding and current row) as '累加分数' from SQL_5;
```

#### 跨行类

lag, lead

```sql
-- 【跨行类】
-- lag/lead 函数 参数1:比较的列 参数2: 偏移量 参数3:找不到的默认值
-- 同一班级内，成绩比自己低一名的分数是多少
select *, lag(score, 1) over (partition by cid order by score) as '低一名的分数' from SQL_5;
-- 或者写成
select *, lag(score, 1, 0) over (partition by cid order by score) as '低一名的分数' from SQL_5;

-- 同一班级内，成绩比自己高2名的分数是多少
select *, lead(score, 2) over (partition by cid order by score) as '高两名的分数' from SQL_5;
```


### 四. 相关题目

#### 表格

| cid  | sname | course | score |
| ---- | ----- | ------ | ----- |
| 001  | 张三  | 语文   | 78    |
| 002  | 小刚  | 语文   | 71    |
| 001  | 李四  | 数学   | 56    |
| 002  | 小明  | 数学   | 54    |
| ...  | ...   | ...    | ...   |

#### 脚本

```sql
CREATE TABLE SQL_6 (
       cid varchar(4),
       sname varchar(4),
       course varchar(10),
       score int
);

insert into SQL_6 (cid, sname, course, score) values ('001', '张三', '语文', 78);
insert into SQL_6 (cid, sname, course, score) values ('002', '小刚', '语文', 71);
insert into SQL_6 (cid, sname, course, score) values ('001', '李四', '数学', 56);
insert into SQL_6 (cid, sname, course, score) values ('001', '王五', '数学', 97);
insert into SQL_6 (cid, sname, course, score) values ('002', '小明', '数学', 54);
insert into SQL_6 (cid, sname, course, score) values ('002', '小刚', '数学', 67);
insert into SQL_6 (cid, sname, course, score) values ('002', '小红', '数学', 82);
insert into SQL_6 (cid, sname, course, score) values ('001', '王五', '语文', 80);
insert into SQL_6 (cid, sname, course, score) values ('001', '张三', '数学', 77);
insert into SQL_6 (cid, sname, course, score) values ('002', '小明', '语文', 58);
insert into SQL_6 (cid, sname, course, score) values ('002', '小红', '语文', 87);
insert into SQL_6 (cid, sname, course, score) values ('001', '李四', '语文', 60);
insert into SQL_6 (cid, sname, course, score) values ('001', '张三', '英语', 66);
insert into SQL_6 (cid, sname, course, score) values ('002', '小刚', '英语', 50);
insert into SQL_6 (cid, sname, course, score) values ('001', '李四', '地理', 59);
insert into SQL_6 (cid, sname, course, score) values ('001', '王五', '地理', 88);
insert into SQL_6 (cid, sname, course, score) values ('002', '小明', '地理', 45);
insert into SQL_6 (cid, sname, course, score) values ('002', '小刚', '地理', 66);
insert into SQL_6 (cid, sname, course, score) values ('002', '小红', '地理', 82);
insert into SQL_6 (cid, sname, course, score) values ('001', '王五', '英语', 81);
insert into SQL_6 (cid, sname, course, score) values ('001', '张三', '地理', 77);
insert into SQL_6 (cid, sname, course, score) values ('002', '小明', '英语', 55);
insert into SQL_6 (cid, sname, course, score) values ('002', '小红', '英语', 87);
insert into SQL_6 (cid, sname, course, score) values ('001', '李四', '英语', 61);
```

##### 分组内topN

问题1：求出每个学生成绩最高的三条记录

```sql
select * from
(
    select *, row_number() over (partition by sname order by score desc) as rn from SQL_6
) temp
where rn <= 3
```

###### 公式:

```sql
select * from
(
	select *, row_number() over (partition by 分组列 order by 比较列) as rn from table
) as tmp
where rn <= N;
```


##### 汇总分析

问题2：找出每门课程都高于班级课程平均分的学生

可以拆解成以下几个问题：

1）求出每个班级，每门课程的平均分

```sql
with
-- 1) 求出每个班级，每门课程的平均分
t1 as
(
select *,
       avg(score) over (partition by cid, course) as 'avg'
from SQL_6
),
```

2）将学生每门课程的成绩与所在班级的对应课程平均分相减，结果大于0就说明该学生的这门成绩高于课程平均分

```sql
t2 as (
    select *,
           score - avg  as 'del'
    from t1
)
```

3）“找出每门课程都高于班级课程平均分的学生”说明对于学生来说，最小的“相减结果”都是大于0的

```sql
select sname from t2
group by sname
having min(del) > 0;
```

合并后的SQL语句

```sql
with
t1 as
(
select *,
       avg(score) over (partition by cid, course) as 'avg'
from SQL_6
),
t2 as (
    select *,
           score - avg  as 'del'
    from t1
)
select sname from t2
group by sname
having min(del) > 0;

-- 或者
select sname from (
                      select *,
                             score - avg  as 'del'
                      from (
                               select *,
                                      avg(score) over (partition by cid, course) as 'avg'
                               from SQL_6
                           ) t1
                  ) t2
group by sname
having min(del) > 0;
```


#### 表格

| empno | ename | hire_date  | salary | dept_no |
| ----- | ----- | ---------- | ------ | ------- |
| 001   | Adam  | 2018-03-01 | 1000   | A       |
| 002   | Bill  | 2021-03-01 | 1200   | A       |
| 003   | Cindy | 2016-03-01 | 1500   | A       |
| 004   | Danny | 2020-03-01 | 5000   | A       |
| 005   | Eason | 2020-03-01 | 4000   | B       |
| 006   | Fred  | 2018-03-01 | 3500   | B       |
| 007   | Gary  | 2017-03-01 | 1800   | B       |
| 008   | Hugo  | 2020-03-01 | 2500   | B       |

#### 脚本

```sql
CREATE TABLE SQL_7 (
       empno varchar(4),
       ename varchar(10),
       hire_date varchar(10),
       salary int,
       dept_no varchar(2)
);
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('001', 'Adam', '2018-03-01', 1000, 'A');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('002', 'Bill', '2021-03-01', 1200, 'A');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('003', 'Cindy', '2016-03-01', 1500, 'A');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('004', 'Danny', '2020-03-01', 5000, 'A');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('005', 'Eason', '2020-03-01', 4000, 'B');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('006', 'Fred', '2018-03-01', 3500, 'B');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('007', 'Gary', '2017-03-01', 1800, 'B');
insert into SQL_7 (empno, ename, hire_date, salary, dept_no) values ('008', 'Hugo', '2020-03-01', 4500, 'B');

select * from SQL_7;
```

##### 分组内topN

问题一：求出每个部门工资最高的前三名员工

```sql
select * from
    (
        select *, row_number() over (partition by dept_no order by salary desc) as rn from SQL_7
    ) as tmp
where rn <= 3;
```


##### 汇总分析

问题二：计算这些员工的工资占所属部门总工资的百分比

```sql
with
t1 as (
    select * , sum(salary) over (partition by dept_no) as 'sum_sal' from SQL_7
),
t2 as (
    select *, round(salary*100/sum_sal,2) as 'percentage' from t1
)
select * from t2;
```

问题三：对各部门员工的工资进行从小到大排序，排名前30%为低层，30%-80%为中层，高于80%为高层，并打上标签

```label
with
    t1 as (
        select * , row_number() over (partition by dept_no order by salary) as cnt,
               count(empno) over (partition by dept_no) as 'sum' from SQL_7
    ),
    t2 as (
        select *, round(cnt/sum,2) as 'percentage' from t1
    ),
    t3 as (
        select *, case when percentage <= 0.3 then '低层'
                       when percentage <= 0.8 then '中层'
                       when percentage <= 1 then '高层' end as 'label'
        from t2
    )
select empno, ename, hire_date, salary, dept_no, label from t3;
```

问题四：统计每年入职总数以及截至本年累计入职总人数（本年总入职人数 + 本年之前所有年的总入职人数之和）

```sql
select year(hire_date) as hire_year, count(empno) as cnt
    from SQL_7
    group by year(hire_date) order by hire_year;
```

```sql
with t1 as (
    select year(hire_date) as hire_year, count(empno) as cnt from SQL_7 group by year(hire_date) order by hire_year
)

select *, sum(cnt) over(partition by null rows between unbounded preceding and current row) as sum from t1;
```


### 五. 技巧

1）分组内topN公式

```sql
select * from
(
	select *, row_number() over (partition by 分组列 order by 比较列) as rn from table
) as tmp
where rn <= N;
```

2） 窗口函数 -> 生成辅助列（相当于高级语言的临时变量)

3)   with 语句 -> 生成临时表（相当于高级语言的局部方法)

       把复杂的问题拆分成多个子问题并用临时表去表达