Hadoop应用架构(.pdf)
译者序 ...................................................................................................................................................xiii
序 .............................................................................................................................................................xv
前言 .......................................................................................................................................................xvii
第一部分 考虑 Hadoop 应用的架构设计
第 1 章 Hadoop 数据建模 ...............................................................................................................2
1.1 数据存储选型 .............................................................................................................................3
1.1.1 标准文件格式 ................................................................................................................4
1.1.2 Hadoop 文件类型 ...........................................................................................................5
1.1.3 序列化存储格式 ............................................................................................................7
1.1.4 列式存储格式 ................................................................................................................8
1.1.5 压缩 ..............................................................................................................................10
1.2 HDFS 模式设计 ........................................................................................................................12
1.2.1 文件在 HDFS 中的位置 ..............................................................................................13
1.2.2 高级 HDFS 模式设计 ..................................................................................................14
1.2.3 HDFS 模式设计总结 ...................................................................................................16
1.3 HBase 模式设计 .......................................................................................................................17
1.3.1 行键 ..............................................................................................................................17
1.3.2 时间戳 ..........................................................................................................................19
1.3.3 hop ................................................................................................................................20
1.3.4 表和 Region ..................................................................................................................21
1.3.5 使用列 ..........................................................................................................................22
图灵社区会员 largelove(largelove@163.com) 专享 尊重版权
vi | 目录
1.3.6 列簇 ..............................................................................................................................23
1.3.7 TTL ...............................................................................................................................23
1.4 元数据管理 ...............................................................................................................................24
1.4.1 什么是元数据 ..............................................................................................................24
1.4.2 为什么元数据至关重要 ..............................................................................................25
1.4.3 元数据的存储位置 ......................................................................................................25
1.4.4 元数据管理举例 ..........................................................................................................26
1.4.5 Hive metastore 与 HCatalog 的局限性 ........................................................................26
1.4.6 其他存储元数据的方式 ..............................................................................................27
1.5 结论 ...........................................................................................................................................28
第 2 章 Hadoop 数据移动 .............................................................................................................29
2.1 数据采集考量 ...........................................................................................................................29
2.1.1 数据采集的时效性 ......................................................................................................30
2.1.2 增量更新 ......................................................................................................................31
2.1.3 访问模式 ......................................................................................................................32
2.1.4 数据源系统及数据结构 ..............................................................................................33
2.1.5 变换 ..............................................................................................................................35
2.1.6 网络瓶颈 ......................................................................................................................36
2.1.7 网络安全性 ..................................................................................................................36
2.1.8 被动推送与主动请求 ..................................................................................................36
2.1.9 错误处理 ......................................................................................................................37
2.1.10 复杂度 ........................................................................................................................38
2.2 数据采集选择 ...........................................................................................................................38
2.2.1 文件传输 ......................................................................................................................38
2.2.2 文件传输与其他采集方法的考量 ..............................................................................41
2.2.3 Sqoop:Hadoop 与关系数据库的批量传输...............................................................41
2.2.4 Flume:基于事件的数据收集及处理 ........................................................................46
2.2.5 Kafka .............................................................................................................................53
2.3 数据导出 ...................................................................................................................................57
2.4 小结 ...........................................................................................................................................58
第 3 章 Hadoop 数据处理 .............................................................................................................59
3.1 MapReduce ................................................................................................................................60
3.1.1 MapReduce 概述 ..........................................................................................................60
3.1.2 MapReduce 示例 ..........................................................................................................66
3.1.3 MapReduce 使用场景 ..................................................................................................71
3.2 Spark..........................................................................................................................................72
3.2.1 Spark 概述 ....................................................................................................................72
3.2.2 Spark 组件概述 ............................................................................................................73
图灵社区会员 largelove(largelove@163.com) 专享 尊重版权
目录 | vii
3.2.3 Spark 基本概念 ............................................................................................................73
3.2.4 Spark 的优点 ................................................................................................................76
3.2.5 Spark 示例 ....................................................................................................................77
3.2.6 Spark 使用场景 ............................................................................................................79
3.3 抽象层 .......................................................................................................................................80
3.3.1 Pig .................................................................................................................................81
3.3.2 Pig 示例 ........................................................................................................................81
3.3.3 Pig 使用场景 ................................................................................................................83
3.4 Crunch .......................................................................................................................................84
3.4.1 Crunch 示例 ..................................................................................................................85
3.4.2 Crunch 使用场景 ..........................................................................................................88
3.5 Cascading ..................................................................................................................................89
3.5.1 Cascading 示例 .............................................................................................................89
3.5.2 Cascading 使用场景 .....................................................................................................92
3.6 Hive ...........................................................................................................................................92
3.6.1 Hive 概述 ......................................................................................................................92
3.6.2 Hive 示例 ......................................................................................................................93
3.6.3 Hive 使用场景 ..............................................................................................................97
3.7 Impala ........................................................................................................................................98
3.7.1 Impala 概述 ..................................................................................................................98
3.7.2 面向高速查询的设计 ..................................................................................................99
3.7.3 Impala 示例 ................................................................................................................101
3.7.4 Impala 使用场景 ........................................................................................................102
3.8 小结 .........................................................................................................................................102
第 4 章 Hadoop 数据处理通用范式 .........................................................................................104
4.1 模式一:依主键移除重复记录 .............................................................................................104
4.1.1 去重示例的测试数据生成 ........................................................................................105
4.1.2 代码示例:使用 Scala 实现 Spark 去重 ..................................................................106
4.1.3 代码示例:使用 SQL 实现去重 ...............................................................................108
4.2 模式二:数据开窗分析 .........................................................................................................108
4.2.1 生成开窗分析的示例数据 ........................................................................................109
4.2.2 代码示例:使用 Spark 分析数据的高峰和低谷 .....................................................110
4.2.3 代码示例:使用 SQL 分析数据的高峰和低谷 .......................................................113
4.3 模式三:基于时间序列的更新 .............................................................................................115
4.3.1 利用 HBase 的版本特性 ............................................................................................116
4.3.2 以记录主键与开始时间作 HBase 的行键 ................................................................116
4.3.3 重写 HDFS 数据更新整个表 ....................................................................................116
4.3.4 利用 HDFS 上的分区存储当前记录和历史记录 ....................................................117
图灵社区会员 largelove(largelove@163.com) 专享 尊重版权
viii | 目录
4.3.5 生成时间序列的示例数据 ........................................................................................117
4.3.6 代码示例:使用 Spark 更新时间序列数据 .............................................................118
4.3.7 代码示例:使用 SQL 更新时间序列数据 ...............................................................120
4.4 小结 .........................................................................................................................................123
第 5 章 Hadoop 图处理................................................................................................................124
5.1 什么是图 .................................................................................................................................124
5.2 什么是图处理 .........................................................................................................................126
5.3 分布式系统中的图处理 .........................................................................................................127
5.3.1 块同步并行模型 ........................................................................................................127
5.3.2 BSP 举例 ....................................................................................................................128
5.4 Giraph ......................................................................................................................................129
5.4.1 数据的输入和分片 ....................................................................................................130
5.4.2 使用 BSP 批处理图 ...................................................................................................132
5.4.3 将图回写磁盘 ............................................................................................................136
5.4.4 整体流程控制 ............................................................................................................137
5.4.5 何时选用 Giraph ........................................................................................................138
5.5 GraphX....................................................................................................................................138
5.5.1 另一种 RDD ...............................................................................................................138
5.5.2 GraphX 的 Pregel 接口 ..............................................................................................140
5.5.3 vprog() .......................................................................................................................142
5.5.4 sendMessage() ...........................................................................................................142
5.5.5 mergeMessage() .........................................................................................................142
5.6 工具选择 .................................................................................................................................143
5.7 小结 .........................................................................................................................................143
第 6 章 协调调度 ............................................................................................................................144
6.1 工作流协调调度的必要性 .....................................................................................................144
6.2 脚本的局限性 .........................................................................................................................145
6.3 企业级任务调度器及 Hadoop ...............................................................................................146
6.4 Hadoop 生态系统中的工作流框架 .......................................................................................146
6.5 Oozie 术语 ..............................................................................................................................147
6.6 Oozie 概述 ..............................................................................................................................148
6.7 Oozie 工作流 ..........................................................................................................................150
6.8 工作流范式 .............................................................................................................................152
6.8.1 点对点式工作流 ........................................................................................................152
6.8.2 扇出式工作流 ............................................................................................................154
6.8.3 分支决策式工作流 ....................................................................................................156
6.9 工作流参数化 .........................................................................................................................159
图灵社区会员 largelove(largelove@163.com) 专享 尊重版权
目录 | ix
6.10 Classpath 定义 ......................................................................................................................160
6.11 调度模式 ...............................................................................................................................161
6.11.1 依频次调度 ............................................................................................................162
6.11.2 时间或数据触发式 ................................................................................................162
6.12 执行工作流 ...........................................................................................................................166
6.13 小结 .......................................................................................................................................166
第 7 章 Hadoop 近实时处理 .......................................................................................................167
7.1 流处理 .....................................................................................................................................169
7.2 Apache Storm ..........................................................................................................................170
7.2.1 Storm 高级架构 ..........................................................................................................171
7.2.2 Storm 拓扑 ..................................................................................................................172
7.2.3 元组及数据流 ............................................................................................................173
7.2.4 spout 和 bolt................................................................................................................173
7.2.5 数据流分组 ................................................................................................................174
7.2.6 Storm 应用的可靠性 ..................................................................................................175
7.2.7 仅处理一次机制 ........................................................................................................175
7.2.8 容错性 ........................................................................................................................176
7.2.9 Storm 与 HDFS 集成 .................................................................................................176
7.2.10 Storm 与 HBase 集成 ...............................................................................................176
7.2.11 Storm 示例:简单移动平均 ....................................................................................177
7.2.12 Storm 评估 ................................................................................................................183
7.3 Trident 接口 ............................................................................................................................183
7.3.1 Trident 示例:简单移动平均 ....................................................................................184
7.3.2 Trident 评估 ................................................................................................................186
7.4 Spark Streaming ......................................................................................................................186
7.4.1 Spark Streaming 概述 .................................................................................................187
7.4.2 Spark Streaming 示例:简单求和 .............................................................................187
7.4.3 Spark Streaming 示例:多路输入 .............................................................................188
7.4.4 Spark Streaming 示例:状态维护 .............................................................................189
7.4.5 Spark Streaming 示例:窗口函数 .............................................................................191
7.4.6 Spark Streaming 示例:Streaming 与 ETL 代码比较 ..............................................191
7.4.7 Spark Streaming 评估 .................................................................................................193
7.5 Flume 拦截器 ..........................................................................................................................193
7.6 工具选择 .................................................................................................................................194
7.6.1 低延迟的数据扩充、验证、报警及采集 ................................................................194
7.6.2 NRT 技术、滚动平均及迭代处理 ............................................................................195
7.6.3 复杂数据流 ................................................................................................................196
7.7 小结 .........................................................................................................................................197
图灵社区会员 largelove(largelove@163.com) 专享 尊重版权
x | 目录
第二部分 案例研究
第 8 章 点击流分析 .......................................................................................................................200
8.1 用例场景定义 .........................................................................................................................200
8.2 使用 Hadoop 进行点击流分析 ..............................................................................................202
8.3 设计概述 .................................................................................................................................202
8.4 数据存储 .................................................................................................................................203
8.5 数据采集 .................................................................................................................................205
8.5.1 客户端层 ....................................................................................................................208
8.5.2 收集器层 ....................................................................................................................210
8.6 数据处理 .................................................................................................................................212
8.6.1 数据去重 ....................................................................................................................214
8.6.2 会话生成 ....................................................................................................................215
8.7 数据分析 .................................................................................................................................217
8.8 协调调度 .................................................................................................................................218
8.9 小结 .........................................................................................................................................221
第 9 章 欺诈检测 ............................................................................................................................222
9.1 持续改善 .................................................................................................................................222
9.2 开始行动 .................................................................................................................................223
9.3 欺诈检测系统架构需求 .........................................................................................................223
9.4 用例介绍 .................................................................................................................................223
9.5 架构设计 .................................................................................................................................224
9.6 客户端架构 .............................................................................................................................226
9.7 画像存储及访问 .....................................................................................................................226
9.7.1 缓存 ............................................................................................................................227
9.7.2 HBase 数据定义 .........................................................................................................228
9.7.3 事务状态更新:通过或否决 ....................................................................................231
9.8 数据采集 .................................................................................................................................232
9.9 近实时处理与探索性分析 .....................................................................................................238
9.10 近实时处理 ...........................................................................................................................238
9.11 探索性分析 ...........................................................................................................................239
9.12 其他架构对比 .......................................................................................................................240
9.12.1 Flume 拦截器 ..........................................................................................................240
9.12.2 从 Kafka 到 Storm 或 Spark Streaming ..................................................................241
9.12.3 扩展的业务规则引擎 .............................................................................................241
9.13 小结 .......................................................................................................................................242
图灵社区会员 largelove(largelove@163.com) 专享 尊重版权
目录 | xi
第 10 章 数据仓库 .........................................................................................................................243
10.1 使用 Hadoop 构建数据仓库 ................................................................................................245
10.2 用例场景定义 .......................................................................................................................247
10.3 OLTP 模式 ............................................................................................................................248
10.4 数据仓库:术语介绍 ...........................................................................................................249
10.5 数据仓库的 Hadoop 实践 ....................................................................................................251
10.6 架构设计 ...............................................................................................................................251
10.6.1 数据建模及存储 .....................................................................................................252
10.6.2 数据采集 .................................................................................................................261
10.6.3 数据处理及访问 .....................................................................................................264
10.6.4 数据聚合 .................................................................................................................268
10.6.5 数据导出 .................................................................................................................269
10.6.6 流程调度 .................................................................................................................270
10.7 小结 .......................................................................................................................................272
附录 A Impala 中的关联 ..............................................................................................................273
作者简介 ..............................................................................................................................................277
封面介绍 ..............................................................................................................................................278
**** Hidden Message *****
啥也不说了,感谢楼主分享哇! 啥也不说了,感谢楼主分享哇! 啥也不说了,感谢楼主分享哇! 啥也不说了,感谢楼主分享哇! 正需要,支持楼主大人了! 啥也不说了,感谢楼主分享哇! 正需要,支持楼主大人了! 正需要,支持楼主大人了! 回的人少,我来小顶一下
页:
[1]
2