Hive数据查询详解
This commit is contained in:
parent
02b56c2432
commit
2c37679b55
13
README.md
13
README.md
@ -60,13 +60,14 @@
|
|||||||
|
|
||||||
1. [数据仓库Hive简介](https://github.com/heibaiying/BigData-Notes/blob/master/notes/Hive.md)
|
1. [数据仓库Hive简介](https://github.com/heibaiying/BigData-Notes/blob/master/notes/Hive.md)
|
||||||
2. [Linux环境下Hive的安装部署](https://github.com/heibaiying/BigData-Notes/blob/master/notes/installation/Linux%E7%8E%AF%E5%A2%83%E4%B8%8BHive%E7%9A%84%E5%AE%89%E8%A3%85%E9%83%A8%E7%BD%B2.md)
|
2. [Linux环境下Hive的安装部署](https://github.com/heibaiying/BigData-Notes/blob/master/notes/installation/Linux%E7%8E%AF%E5%A2%83%E4%B8%8BHive%E7%9A%84%E5%AE%89%E8%A3%85%E9%83%A8%E7%BD%B2.md)
|
||||||
3. [Hive Shell基本使用](https://github.com/heibaiying/BigData-Notes/blob/master/notes/HiveShell基本使用.md)
|
3. 连接Hive的三种方式
|
||||||
4. [Hive 核心概念讲解](https://github.com/heibaiying/BigData-Notes/blob/master/notes/Hive核心概念讲解.md)
|
4. [Hive Shell基本使用](https://github.com/heibaiying/BigData-Notes/blob/master/notes/HiveShell基本使用.md)
|
||||||
5. [Hive 分区表和分桶表](https://github.com/heibaiying/BigData-Notes/blob/master/notes/Hive分区表和分桶表.md)
|
5. [Hive 核心概念讲解](https://github.com/heibaiying/BigData-Notes/blob/master/notes/Hive核心概念讲解.md)
|
||||||
6. [Hive 常用DDL操作](https://github.com/heibaiying/BigData-Notes/blob/master/notes/Hive常用DDL操作.md)
|
6. [Hive 常用DDL操作](https://github.com/heibaiying/BigData-Notes/blob/master/notes/Hive常用DDL操作.md)
|
||||||
7. Hive 数据查询
|
7. [Hive 分区表和分桶表](https://github.com/heibaiying/BigData-Notes/blob/master/notes/Hive分区表和分桶表.md)
|
||||||
8. Hive 视图和索引
|
8. [Hive 视图和索引](https://github.com/heibaiying/BigData-Notes/blob/master/notes/Hive数据查询详解.md)
|
||||||
9. Hive 模式设计
|
9. [Hive常用DML操作](https://github.com/heibaiying/BigData-Notes/blob/master/notes/Hive常用DML操作.md)
|
||||||
|
10. [Hive 数据查询详解](https://github.com/heibaiying/BigData-Notes/blob/master/notes/Hive数据查询详解.md)
|
||||||
|
|
||||||
## 三、Spark
|
## 三、Spark
|
||||||
|
|
||||||
|
@ -16,7 +16,8 @@
|
|||||||
将文件数据加载到表时,Hive不会进行任何转换,加载操作是纯复制/移动操作,它将数据文件移动到Hive表定义的存储位置。
|
将文件数据加载到表时,Hive不会进行任何转换,加载操作是纯复制/移动操作,它将数据文件移动到Hive表定义的存储位置。
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
|
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE]
|
||||||
|
INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
|
||||||
```
|
```
|
||||||
|
|
||||||
- Load 关键字代表从本地文件系统加载文件,省略则代表从HDFS上加载文件:
|
- Load 关键字代表从本地文件系统加载文件,省略则代表从HDFS上加载文件:
|
||||||
@ -74,8 +75,11 @@ LOAD DATA INPATH "hdfs://hadoop001:8020/mydir/emp.txt" OVERWRITE INTO TABLE emp
|
|||||||
### 2.1 语法
|
### 2.1 语法
|
||||||
|
|
||||||
```sql
|
```sql
|
||||||
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
|
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)
|
||||||
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
|
[IF NOT EXISTS]] select_statement1 FROM from_statement;
|
||||||
|
|
||||||
|
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)]
|
||||||
|
select_statement1 FROM from_statement;
|
||||||
```
|
```
|
||||||
|
|
||||||
+ Hive 0.13.0开始,建表时可以通过使用TBLPROPERTIES(“immutable”=“true”)来创建不可变表(immutable table) ,如果不可以变表中存在数据,则INSERT INTO失败。(注:INSERT OVERWRITE的语句不受`immutable`属性的影响);
|
+ Hive 0.13.0开始,建表时可以通过使用TBLPROPERTIES(“immutable”=“true”)来创建不可变表(immutable table) ,如果不可以变表中存在数据,则INSERT INTO失败。(注:INSERT OVERWRITE的语句不受`immutable`属性的影响);
|
||||||
@ -90,7 +94,8 @@ INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] sele
|
|||||||
|
|
||||||
```sql
|
```sql
|
||||||
FROM from_statement
|
FROM from_statement
|
||||||
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
|
INSERT OVERWRITE TABLE tablename1
|
||||||
|
[PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
|
||||||
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
|
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
|
||||||
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
|
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
|
||||||
```
|
```
|
||||||
@ -98,8 +103,11 @@ INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] sele
|
|||||||
### 2.2 动态插入分区
|
### 2.2 动态插入分区
|
||||||
|
|
||||||
```sql
|
```sql
|
||||||
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
|
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...)
|
||||||
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
|
select_statement FROM from_statement;
|
||||||
|
|
||||||
|
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...)
|
||||||
|
select_statement FROM from_statement;
|
||||||
```
|
```
|
||||||
|
|
||||||
在向分区表插入数据时候,分区列名是必须的,但是列值是可选的。如果给出了分区列值,我们将其称为静态分区,否则它是动态分区。动态分区列必须在SELECT语句的列中最后指定,并且与它们在PARTITION()子句中出现的顺序相同。
|
在向分区表插入数据时候,分区列名是必须的,但是列值是可选的。如果给出了分区列值,我们将其称为静态分区,否则它是动态分区。动态分区列必须在SELECT语句的列中最后指定,并且与它们在PARTITION()子句中出现的顺序相同。
|
||||||
@ -146,7 +154,8 @@ TRUNCATE TABLE emp_ptn;
|
|||||||
3. 静态分区演示:从`emp`表中查询部门编号为20的员工数据,并插入`emp_ptn`表中,语句如下:
|
3. 静态分区演示:从`emp`表中查询部门编号为20的员工数据,并插入`emp_ptn`表中,语句如下:
|
||||||
|
|
||||||
```sql
|
```sql
|
||||||
INSERT OVERWRITE TABLE emp_ptn PARTITION (deptno=20) SELECT empno,ename,job,mgr,hiredate,sal,comm FROM emp WHERE deptno=20;
|
INSERT OVERWRITE TABLE emp_ptn PARTITION (deptno=20)
|
||||||
|
SELECT empno,ename,job,mgr,hiredate,sal,comm FROM emp WHERE deptno=20;
|
||||||
```
|
```
|
||||||
|
|
||||||
完成后`emp_ptn`表中数据如下:
|
完成后`emp_ptn`表中数据如下:
|
||||||
@ -160,7 +169,8 @@ INSERT OVERWRITE TABLE emp_ptn PARTITION (deptno=20) SELECT empno,ename,job,mgr,
|
|||||||
set hive.exec.dynamic.partition.mode=nonstrict;
|
set hive.exec.dynamic.partition.mode=nonstrict;
|
||||||
|
|
||||||
-- 动态分区 此时查询语句的最后一列为动态分区列,即deptno
|
-- 动态分区 此时查询语句的最后一列为动态分区列,即deptno
|
||||||
INSERT OVERWRITE TABLE emp_ptn PARTITION (deptno) SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp WHERE deptno=30;
|
INSERT OVERWRITE TABLE emp_ptn PARTITION (deptno)
|
||||||
|
SELECT empno,ename,job,mgr,hiredate,sal,comm,deptno FROM emp WHERE deptno=30;
|
||||||
```
|
```
|
||||||
|
|
||||||
完成后`emp_ptn`表中数据如下:
|
完成后`emp_ptn`表中数据如下:
|
||||||
@ -172,7 +182,8 @@ INSERT OVERWRITE TABLE emp_ptn PARTITION (deptno) SELECT empno,ename,job,mgr,hir
|
|||||||
## 三、使用SQL语句插入值
|
## 三、使用SQL语句插入值
|
||||||
|
|
||||||
```sql
|
```sql
|
||||||
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES ( value [, value ...] )
|
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)]
|
||||||
|
VALUES ( value [, value ...] )
|
||||||
```
|
```
|
||||||
|
|
||||||
+ 使用时必须为表中的每个列都提供值。不支持只向部分列插入值(可以为缺省值的列提供空值来消除这个弊端);
|
+ 使用时必须为表中的每个列都提供值。不支持只向部分列插入值(可以为缺省值的列提供空值来消除这个弊端);
|
||||||
|
407
notes/Hive数据查询详解.md
Normal file
407
notes/Hive数据查询详解.md
Normal file
@ -0,0 +1,407 @@
|
|||||||
|
# Hive数据查询详解
|
||||||
|
|
||||||
|
<nav>
|
||||||
|
<a href="#一数据准备">一、数据准备</a><br/>
|
||||||
|
<a href="#二单表查询">二、单表查询</a><br/>
|
||||||
|
<a href="#21-SELECT">2.1 SELECT</a><br/>
|
||||||
|
<a href="#22-WHERE">2.2 WHERE</a><br/>
|
||||||
|
<a href="#23--DISTINCT">2.3 DISTINCT</a><br/>
|
||||||
|
<a href="#24-分区查询">2.4 分区查询</a><br/>
|
||||||
|
<a href="#25-LIMIT">2.5 LIMIT</a><br/>
|
||||||
|
<a href="#26-GROUP-BY">2.6 GROUP BY</a><br/>
|
||||||
|
<a href="#27-ORDER-AND-SORT">2.7 ORDER AND SORT</a><br/>
|
||||||
|
<a href="#28-HAVING">2.8 HAVING</a><br/>
|
||||||
|
<a href="#29-DISTRIBUTE-BY">2.9 DISTRIBUTE BY</a><br/>
|
||||||
|
<a href="#210-CLUSTER-BY">2.10 CLUSTER BY</a><br/>
|
||||||
|
<a href="#三多表联结查询">三、多表联结查询</a><br/>
|
||||||
|
<a href="#31-INNER-JOIN">3.1 INNER JOIN</a><br/>
|
||||||
|
<a href="#32-LEFT-OUTER--JOIN">3.2 LEFT OUTER JOIN </a><br/>
|
||||||
|
<a href="#33-RIGHT-OUTER--JOIN">3.3 RIGHT OUTER JOIN</a><br/>
|
||||||
|
<a href="#34-FULL-OUTER--JOIN">3.4 FULL OUTER JOIN </a><br/>
|
||||||
|
<a href="#35-LEFT-SEMI-JOIN">3.5 LEFT SEMI JOIN</a><br/>
|
||||||
|
<a href="#36-JOIN">3.6 JOIN</a><br/>
|
||||||
|
<a href="#四JOIN优化">四、JOIN优化</a><br/>
|
||||||
|
<a href="#41-STREAMTABLE">4.1 STREAMTABLE</a><br/>
|
||||||
|
<a href="#42-MAPJOIN">4.2 MAPJOIN</a><br/>
|
||||||
|
<a href="#五SELECT的其他用途">五、SELECT的其他用途</a><br/>
|
||||||
|
<a href="#六本地模式">六、本地模式</a><br/>
|
||||||
|
</nav>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## 一、数据准备
|
||||||
|
|
||||||
|
为了演示查询操作,这里需要预先创建两张表,并加载数据。
|
||||||
|
|
||||||
|
> 这里的表结构主要参考自Oracle内置的练习表——emp表和dept表,涉及到数据文件emp.txt和dept.txt可以在本仓库的sources目录下下载。
|
||||||
|
|
||||||
|
### 1.1 员工表
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- 建表语句
|
||||||
|
CREATE TABLE emp(
|
||||||
|
empno INT, -- 员工表编号
|
||||||
|
ename STRING, -- 员工姓名
|
||||||
|
job STRING, -- 职位类型
|
||||||
|
mgr INT,
|
||||||
|
hiredate TIMESTAMP, --雇佣日期
|
||||||
|
sal DECIMAL(7,2), --工资
|
||||||
|
comm DECIMAL(7,2),
|
||||||
|
deptno INT) --部门编号
|
||||||
|
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
|
||||||
|
|
||||||
|
--加载数据
|
||||||
|
LOAD DATA LOCAL INPATH "/usr/file/emp.txt" OVERWRITE INTO TABLE emp;
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1.2 部门表
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- 建表语句
|
||||||
|
CREATE TABLE dept(
|
||||||
|
deptno INT, --部门编号
|
||||||
|
dname STRING, --部门名称
|
||||||
|
loc STRING --部门所在的城市
|
||||||
|
)
|
||||||
|
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
|
||||||
|
|
||||||
|
--加载数据
|
||||||
|
LOAD DATA LOCAL INPATH "/usr/file/dept.txt" OVERWRITE INTO TABLE dept;
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1.3 分区表
|
||||||
|
|
||||||
|
这里需要额外创建一张分区表,主要是为了演示分区查询:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
CREATE EXTERNAL TABLE emp_ptn(
|
||||||
|
empno INT,
|
||||||
|
ename STRING,
|
||||||
|
job STRING,
|
||||||
|
mgr INT,
|
||||||
|
hiredate TIMESTAMP,
|
||||||
|
sal DECIMAL(7,2),
|
||||||
|
comm DECIMAL(7,2)
|
||||||
|
)
|
||||||
|
PARTITIONED BY (deptno INT) -- 按照部门编号进行分区
|
||||||
|
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
|
||||||
|
|
||||||
|
|
||||||
|
--加载数据
|
||||||
|
LOAD DATA LOCAL INPATH "/usr/file/emp.txt" OVERWRITE INTO TABLE emp_ptn PARTITION (deptno=20)
|
||||||
|
LOAD DATA LOCAL INPATH "/usr/file/emp.txt" OVERWRITE INTO TABLE emp_ptn PARTITION (deptno=30)
|
||||||
|
LOAD DATA LOCAL INPATH "/usr/file/emp.txt" OVERWRITE INTO TABLE emp_ptn PARTITION (deptno=40)
|
||||||
|
LOAD DATA LOCAL INPATH "/usr/file/emp.txt" OVERWRITE INTO TABLE emp_ptn PARTITION (deptno=50)
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## 二、单表查询
|
||||||
|
|
||||||
|
### 2.1 SELECT
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- 查询表中全部数据
|
||||||
|
SELECT * FROM emp;
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### 2.2 WHERE
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- 查询10号部门中员工编号大于 7782 的员工信息
|
||||||
|
SELECT * FROM emp WHERE empno > 7782 AND deptno = 10;
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### 2.3 DISTINCT
|
||||||
|
|
||||||
|
Hive支持使用DISTINCT关键字去重。
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- 查询所有工作类型
|
||||||
|
SELECT DISTINCT job FROM emp;
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### 2.4 分区查询
|
||||||
|
|
||||||
|
分区查询(Partition Based Queries),可以指定某个分区或者分区范围。
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- 查询分区表中部门编号在[20,40]之间的员工
|
||||||
|
SELECT emp_ptn.* FROM emp_ptn
|
||||||
|
WHERE emp_ptn.deptno >= 20 AND emp_ptn.deptno <= 40;
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### 2.5 LIMIT
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- 查询薪资最高的5名员工
|
||||||
|
SELECT * FROM emp ORDER BY sal DESC LIMIT 5;
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### 2.6 GROUP BY
|
||||||
|
|
||||||
|
Hive支持使用GROUP BY进行分组聚合操作。
|
||||||
|
|
||||||
|
```sql
|
||||||
|
set hive.map.aggr=true;
|
||||||
|
|
||||||
|
-- 查询各个部门薪酬综合
|
||||||
|
SELECT deptno,SUM(sal) FROM emp GROUP BY deptno;
|
||||||
|
```
|
||||||
|
|
||||||
|
`hive.map.aggr`控制程序如何进行聚合。默认值为false。如果设置为true,Hive会在map任务中就执行一次聚合。这可以提高聚合效率,但需要消耗更多内存。
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### 2.7 ORDER AND SORT
|
||||||
|
|
||||||
|
可以使用ORDER BY或者Sort BY对查询结果进行排序,排序字段可以是整型也可以是字符串:如果是整型,则按照大小排序;如果是字符串,则按照字典序排序。ORDER BY 和 Sort BY 的区别如下:
|
||||||
|
|
||||||
|
+ 使用ORDER BY时会有一个Reducer对全部查询结果进行排序,能保证数据的全局有序性;
|
||||||
|
+ 使用Sort BY时只会在每个Reducer中进行排序,这可以保证每个Reducer的输出数据时有序的,但是并不能保证全局有序。
|
||||||
|
|
||||||
|
由于ORDER BY的操作时间可能过长,如果你设置了严格模式(hive.mapred.mode = strict),则其后面必须再跟一个`limit`子句。
|
||||||
|
|
||||||
|
> 注 :hive.mapred.mode默认值是nonstrict ,也就是非严格模式。
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- 查询员工工资,结果按照部门升序,按照工资降序排列
|
||||||
|
SELECT empno, deptno, sal FROM emp ORDER BY deptno ASC, sal DESC;
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### 2.8 HAVING
|
||||||
|
|
||||||
|
可以使用HAVING对分组数据进行过滤。
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- 查询工资总和大于9000的所有部门
|
||||||
|
SELECT deptno,SUM(sal) FROM emp GROUP BY deptno HAVING SUM(sal)>9000;
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### 2.9 DISTRIBUTE BY
|
||||||
|
|
||||||
|
默认情况下,MapReduce程序会对Map输出结果的Key值进行散列,并均匀分发到所有Reducer上。如果想要把具有相同Key值的数据分发到同一个Reducer进行处理,这就需要使用DISTRIBUTE BY字句。
|
||||||
|
|
||||||
|
需要注意的是,DISTRIBUTE BY虽然能保证具有相同Key值的数据分发到同一个Reducer,但是不能保证数据在Reducer上是有序的。情况如下:
|
||||||
|
|
||||||
|
把以下5个数据发送到两个Reducer上进行处理:
|
||||||
|
|
||||||
|
```properties
|
||||||
|
k1
|
||||||
|
k2
|
||||||
|
k4
|
||||||
|
k3
|
||||||
|
k1
|
||||||
|
```
|
||||||
|
|
||||||
|
Reducer1得到如下乱序数据:
|
||||||
|
|
||||||
|
```properties
|
||||||
|
k1
|
||||||
|
k2
|
||||||
|
k1
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
Reducer2得到数据如下:
|
||||||
|
|
||||||
|
```properties
|
||||||
|
k4
|
||||||
|
k3
|
||||||
|
```
|
||||||
|
|
||||||
|
如果想让Reducer上的数据时有序的,可以结合`SORT BY`使用(示例如下),或者使用下面我们将要介绍的CLUSTER BY。
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- 将数据按照部门分发到对应的Reducer上处理
|
||||||
|
SELECT empno, deptno, sal FROM emp DISTRIBUTE BY deptno SORT BY deptno ASC;
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### 2.10 CLUSTER BY
|
||||||
|
|
||||||
|
如果`SORT BY`和`DISTRIBUTE BY`指定的是相同字段,且SORT BY排序规则是ASC,此时可以使用`CLUSTER BY`进行替换,同时`CLUSTER BY`可以保证数据在全局是有序的。
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT empno, deptno, sal FROM emp CLUSTER BY deptno ;
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## 三、多表联结查询
|
||||||
|
|
||||||
|
Hive支持内连接,外连接,左外连接,右外连接,笛卡尔连接,这和传统数据库中的概念一致。关于以上概念,可以参见下图。
|
||||||
|
|
||||||
|
需要特别强调:JOIN语句的关联条件必须用ON指定,不能用WHERE指定,否则就会先做笛卡尔积,再过滤,这会导致你得不到预期的结果(下面的演示会有说明)。
|
||||||
|
|
||||||
|
<div align="center"> <img width="700px" src="https://github.com/heibaiying/BigData-Notes/blob/master/pictures/sql-join.jpg"/> </div>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### 3.1 INNER JOIN
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- 查询员工编号为7369的员工的详细信息
|
||||||
|
SELECT e.*,d.* FROM
|
||||||
|
emp e JOIN dept d
|
||||||
|
ON e.deptno = d.deptno
|
||||||
|
WHERE empno=7369;
|
||||||
|
|
||||||
|
--如果是三表或者更多表连接,语法如下
|
||||||
|
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.2 LEFT OUTER JOIN
|
||||||
|
|
||||||
|
LEFT OUTER JOIN 和 LEFT JOIN是等价的。
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- 左连接
|
||||||
|
SELECT e.*,d.*
|
||||||
|
FROM emp e LEFT OUTER JOIN dept d
|
||||||
|
ON e.deptno = d.deptno;
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.3 RIGHT OUTER JOIN
|
||||||
|
|
||||||
|
```sql
|
||||||
|
--右连接
|
||||||
|
SELECT e.*,d.*
|
||||||
|
FROM emp e RIGHT OUTER JOIN dept d
|
||||||
|
ON e.deptno = d.deptno;
|
||||||
|
```
|
||||||
|
|
||||||
|
执行右连接后,由于40号部门下没有任何员工,所以此时员工信息为NULL。这个查询可以很好的复述上面提到的——JOIN语句的关联条件必须用ON指定,不能用WHERE指定。你可以把ON改成WHERE,你会发现无论如何都查不出40号部门这条数据,因为笛卡尔运算不会有(NULL, 40)这种情况。
|
||||||
|
|
||||||
|
<div align="center"> <img width="700px" src="https://github.com/heibaiying/BigData-Notes/blob/master/pictures/hive-right-join.png"/> </div>
|
||||||
|
|
||||||
|
### 3.4 FULL OUTER JOIN
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT e.*,d.*
|
||||||
|
FROM emp e FULL OUTER JOIN dept d
|
||||||
|
ON e.deptno = d.deptno;
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.5 LEFT SEMI JOIN
|
||||||
|
|
||||||
|
LEFT SEMI JOIN (左半连接)是 IN/EXISTS 子查询的一种更高效的实现。
|
||||||
|
|
||||||
|
+ JOIN 子句中右边的表只能在 ON 子句中设置过滤条件;
|
||||||
|
+ 查询结果只包含左边表的数据,所以只能SELECT左表中的列。
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- 查询在纽约办公的所有员工信息
|
||||||
|
SELECT emp.*
|
||||||
|
FROM emp LEFT SEMI JOIN dept
|
||||||
|
ON emp.deptno = dept.deptno AND dept.loc="NEW YORK";
|
||||||
|
|
||||||
|
--上面的语句就等价于
|
||||||
|
SELECT emp.* FROM emp
|
||||||
|
WHERE emp.deptno IN (SELECT deptno FROM dept WHERE loc="NEW YORK");
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.6 JOIN
|
||||||
|
|
||||||
|
笛卡尔积连接,这个连接日常的开发中可能很少遇到,且性能消耗会比较大,基于这个原因,如果在严格模式下(hive.mapred.mode = strict),Hive会阻止用户执行此操作。
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT * FROM emp JOIN dept;
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## 四、JOIN优化
|
||||||
|
|
||||||
|
### 4.1 STREAMTABLE
|
||||||
|
|
||||||
|
在多表进行联结的时候,如果每个ON字句都使用到共同的列(如下面的`b.key1`),此时Hive会进行优化,将多表join在同一个map / reduce作业上进行。同时假定查询的最后一个表(如下面的 c 表)是最大的一个表,在对每行记录进行join操作时,它将尝试将其他的表缓存起来,然后扫描最后那个表进行计算。因为用户需要保证连续查询的表的大小从左到右是依次增加的。
|
||||||
|
|
||||||
|
```sql
|
||||||
|
`SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)`
|
||||||
|
```
|
||||||
|
|
||||||
|
然后,用户并非需要总是把最大的表放在查询语句的最后面,Hive提供了`/*+ STREAMTABLE() */`标志,用于标识最大的表,示例如下:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT /*+ STREAMTABLE(d) */ e.*,d.*
|
||||||
|
FROM emp e JOIN dept d
|
||||||
|
ON e.deptno = d.deptno
|
||||||
|
WHERE job='CLERK';
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### 4.2 MAPJOIN
|
||||||
|
|
||||||
|
如果所有表中只有一张表是小表,那么完全可以把这张小表加载到内存中。这时候程序会在map阶段直接拿另外一个表的数据和内存中表数据做匹配,由于在map就进行了join操作,从而可以省略reduce过程,这样效率可以提升很多。Hive中提供了`/*+ MAPJOIN() */`来标记小表,示例如下:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT /*+ MAPJOIN(d) */ e.*,d.*
|
||||||
|
FROM emp e JOIN dept d
|
||||||
|
ON e.deptno = d.deptno
|
||||||
|
WHERE job='CLERK';
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## 五、SELECT的其他用途
|
||||||
|
|
||||||
|
查看当前数据库:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT current_database()
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## 六、本地模式
|
||||||
|
|
||||||
|
在上面演示的语句中,大多数都会触发MapReduce, 少部分不会触发,比如`select * from emp limit 5`就不会触发MR,此时Hive只是简单的读取数据文件中的内容,然后格式化后进行输出。在需要执行MapReduce的查询中,你会发现执行时间可能会很长,这时候你可以选择开启本地模式。
|
||||||
|
|
||||||
|
```sql
|
||||||
|
--本地模式默认关闭,需要手动开启此功能
|
||||||
|
SET hive.exec.mode.local.auto=true;
|
||||||
|
```
|
||||||
|
|
||||||
|
启用后,Hive将分析查询中每个map-reduce作业的大小,如果满足以下条件,则可以在本地运行它:
|
||||||
|
|
||||||
|
- 作业的总输入大小低于:hive.exec.mode.local.auto.inputbytes.max(默认为128MB);
|
||||||
|
- map-tasks的总数小于:hive.exec.mode.local.auto.tasks.max(默认为4);
|
||||||
|
- 所需的reduce任务总数为1或0。
|
||||||
|
|
||||||
|
因为我们测试的数据集很小,所以你再次去执行上面涉及MR操作的查询,你会发现速度会有显著的提升。
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## 参考资料
|
||||||
|
|
||||||
|
1. [LanguageManual Select](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select)
|
||||||
|
|
||||||
|
2. [LanguageManual Joins](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins)
|
||||||
|
3. [LanguageManual GroupBy](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+GroupBy)
|
||||||
|
4. [LanguageManual SortBy](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
BIN
pictures/hive-order-by.png
Normal file
BIN
pictures/hive-order-by.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 7.2 KiB |
BIN
pictures/hive-right-join.png
Normal file
BIN
pictures/hive-right-join.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 39 KiB |
BIN
pictures/sql-join.jpg
Normal file
BIN
pictures/sql-join.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 49 KiB |
4
resources/dept.txt
Normal file
4
resources/dept.txt
Normal file
@ -0,0 +1,4 @@
|
|||||||
|
10 ACCOUNTING NEW YORK
|
||||||
|
20 RESEARCH DALLAS
|
||||||
|
30 SALES CHICAGO
|
||||||
|
40 OPERATIONS BOSTON
|
Loading…
x
Reference in New Issue
Block a user