dmz社区

 找回密码
 立即注册

QQ登录

只需一步,快速开始

查看: 1703|回复: 17

[大数据&云计算] Hadoop应用架构(.pdf)

[复制链接]
  • TA的每日心情
    奋斗
    2023-5-5 00:22
  • 签到天数: 32 天

    [LV.5]常住居民I

    307

    主题

    280

    帖子

    1284

    积分

    荣誉会员

    积分
    1284

    发表于 2022-8-14 22:00:08 | 显示全部楼层 |阅读模式

    本站资源全部免费,回复即可查看下载地址!

    您需要 登录 才可以下载或查看,没有帐号?立即注册

    x
    1.png

    译者序 ...................................................................................................................................................xiii
    序 .............................................................................................................................................................xv
    前言 .......................................................................................................................................................xvii
    第一部分 考虑 Hadoop 应用的架构设计
    第 1 章 Hadoop 数据建模 ...............................................................................................................2
    1.1 数据存储选型 .............................................................................................................................3
    1.1.1 标准文件格式 ................................................................................................................4
    1.1.2 Hadoop 文件类型 ...........................................................................................................5
    1.1.3 序列化存储格式 ............................................................................................................7
    1.1.4 列式存储格式 ................................................................................................................8
    1.1.5 压缩 ..............................................................................................................................10
    1.2 HDFS 模式设计 ........................................................................................................................12
    1.2.1 文件在 HDFS 中的位置 ..............................................................................................13
    1.2.2 高级 HDFS 模式设计 ..................................................................................................14
    1.2.3 HDFS 模式设计总结 ...................................................................................................16
    1.3 HBase 模式设计 .......................................................................................................................17
    1.3.1 行键 ..............................................................................................................................17
    1.3.2 时间戳 ..........................................................................................................................19
    1.3.3 hop ................................................................................................................................20
    1.3.4 表和 Region ..................................................................................................................21
    1.3.5 使用列 ..........................................................................................................................22
    图灵社区会员 largelove(largelove@163.com) 专享 尊重版权
    vi | 目录
    1.3.6 列簇 ..............................................................................................................................23
    1.3.7 TTL ...............................................................................................................................23
    1.4 元数据管理 ...............................................................................................................................24
    1.4.1 什么是元数据 ..............................................................................................................24
    1.4.2 为什么元数据至关重要 ..............................................................................................25
    1.4.3 元数据的存储位置 ......................................................................................................25
    1.4.4 元数据管理举例 ..........................................................................................................26
    1.4.5 Hive metastore 与 HCatalog 的局限性 ........................................................................26
    1.4.6 其他存储元数据的方式 ..............................................................................................27
    1.5 结论 ...........................................................................................................................................28
    第 2 章 Hadoop 数据移动 .............................................................................................................29
    2.1 数据采集考量 ...........................................................................................................................29
    2.1.1 数据采集的时效性 ......................................................................................................30
    2.1.2 增量更新 ......................................................................................................................31
    2.1.3 访问模式 ......................................................................................................................32
    2.1.4 数据源系统及数据结构 ..............................................................................................33
    2.1.5 变换 ..............................................................................................................................35
    2.1.6 网络瓶颈 ......................................................................................................................36
    2.1.7 网络安全性 ..................................................................................................................36
    2.1.8 被动推送与主动请求 ..................................................................................................36
    2.1.9 错误处理 ......................................................................................................................37
    2.1.10 复杂度 ........................................................................................................................38
    2.2 数据采集选择 ...........................................................................................................................38
    2.2.1 文件传输 ......................................................................................................................38
    2.2.2 文件传输与其他采集方法的考量 ..............................................................................41
    2.2.3 Sqoop:Hadoop 与关系数据库的批量传输...............................................................41
    2.2.4 Flume:基于事件的数据收集及处理 ........................................................................46
    2.2.5 Kafka .............................................................................................................................53
    2.3 数据导出 ...................................................................................................................................57
    2.4 小结 ...........................................................................................................................................58
    第 3 章 Hadoop 数据处理 .............................................................................................................59
    3.1 MapReduce ................................................................................................................................60
    3.1.1 MapReduce 概述 ..........................................................................................................60
    3.1.2 MapReduce 示例 ..........................................................................................................66
    3.1.3 MapReduce 使用场景 ..................................................................................................71
    3.2 Spark..........................................................................................................................................72
    3.2.1 Spark 概述 ....................................................................................................................72
    3.2.2 Spark 组件概述 ............................................................................................................73
    图灵社区会员 largelove(largelove@163.com) 专享 尊重版权
    目录 | vii
    3.2.3 Spark 基本概念 ............................................................................................................73
    3.2.4 Spark 的优点 ................................................................................................................76
    3.2.5 Spark 示例 ....................................................................................................................77
    3.2.6 Spark 使用场景 ............................................................................................................79
    3.3 抽象层 .......................................................................................................................................80
    3.3.1 Pig .................................................................................................................................81
    3.3.2 Pig 示例 ........................................................................................................................81
    3.3.3 Pig 使用场景 ................................................................................................................83
    3.4 Crunch .......................................................................................................................................84
    3.4.1 Crunch 示例 ..................................................................................................................85
    3.4.2 Crunch 使用场景 ..........................................................................................................88
    3.5 Cascading ..................................................................................................................................89
    3.5.1 Cascading 示例 .............................................................................................................89
    3.5.2 Cascading 使用场景 .....................................................................................................92
    3.6 Hive ...........................................................................................................................................92
    3.6.1 Hive 概述 ......................................................................................................................92
    3.6.2 Hive 示例 ......................................................................................................................93
    3.6.3 Hive 使用场景 ..............................................................................................................97
    3.7 Impala ........................................................................................................................................98
    3.7.1 Impala 概述 ..................................................................................................................98
    3.7.2 面向高速查询的设计 ..................................................................................................99
    3.7.3 Impala 示例 ................................................................................................................101
    3.7.4 Impala 使用场景 ........................................................................................................102
    3.8 小结 .........................................................................................................................................102
    第 4 章 Hadoop 数据处理通用范式 .........................................................................................104
    4.1 模式一:依主键移除重复记录 .............................................................................................104
    4.1.1 去重示例的测试数据生成 ........................................................................................105
    4.1.2 代码示例:使用 Scala 实现 Spark 去重 ..................................................................106
    4.1.3 代码示例:使用 SQL 实现去重 ...............................................................................108
    4.2 模式二:数据开窗分析 .........................................................................................................108
    4.2.1 生成开窗分析的示例数据 ........................................................................................109
    4.2.2 代码示例:使用 Spark 分析数据的高峰和低谷 .....................................................110
    4.2.3 代码示例:使用 SQL 分析数据的高峰和低谷 .......................................................113
    4.3 模式三:基于时间序列的更新 .............................................................................................115
    4.3.1 利用 HBase 的版本特性 ............................................................................................116
    4.3.2 以记录主键与开始时间作 HBase 的行键 ................................................................116
    4.3.3 重写 HDFS 数据更新整个表 ....................................................................................116
    4.3.4 利用 HDFS 上的分区存储当前记录和历史记录 ....................................................117
    图灵社区会员 largelove(largelove@163.com) 专享 尊重版权
    viii | 目录
    4.3.5 生成时间序列的示例数据 ........................................................................................117
    4.3.6 代码示例:使用 Spark 更新时间序列数据 .............................................................118
    4.3.7 代码示例:使用 SQL 更新时间序列数据 ...............................................................120
    4.4 小结 .........................................................................................................................................123
    第 5 章 Hadoop 图处理................................................................................................................124
    5.1 什么是图 .................................................................................................................................124
    5.2 什么是图处理 .........................................................................................................................126
    5.3 分布式系统中的图处理 .........................................................................................................127
    5.3.1 块同步并行模型 ........................................................................................................127
    5.3.2 BSP 举例 ....................................................................................................................128
    5.4 Giraph ......................................................................................................................................129
    5.4.1 数据的输入和分片 ....................................................................................................130
    5.4.2 使用 BSP 批处理图 ...................................................................................................132
    5.4.3 将图回写磁盘 ............................................................................................................136
    5.4.4 整体流程控制 ............................................................................................................137
    5.4.5 何时选用 Giraph ........................................................................................................138
    5.5 GraphX....................................................................................................................................138
    5.5.1 另一种 RDD ...............................................................................................................138
    5.5.2 GraphX 的 Pregel 接口 ..............................................................................................140
    5.5.3 vprog() .......................................................................................................................142
    5.5.4 sendMessage() ...........................................................................................................142
    5.5.5 mergeMessage() .........................................................................................................142
    5.6 工具选择 .................................................................................................................................143
    5.7 小结 .........................................................................................................................................143
    第 6 章 协调调度 ............................................................................................................................144
    6.1 工作流协调调度的必要性 .....................................................................................................144
    6.2 脚本的局限性 .........................................................................................................................145
    6.3 企业级任务调度器及 Hadoop ...............................................................................................146
    6.4 Hadoop 生态系统中的工作流框架 .......................................................................................146
    6.5 Oozie 术语 ..............................................................................................................................147
    6.6 Oozie 概述 ..............................................................................................................................148
    6.7 Oozie 工作流 ..........................................................................................................................150
    6.8 工作流范式 .............................................................................................................................152
    6.8.1 点对点式工作流 ........................................................................................................152
    6.8.2 扇出式工作流 ............................................................................................................154
    6.8.3 分支决策式工作流 ....................................................................................................156
    6.9 工作流参数化 .........................................................................................................................159
    图灵社区会员 largelove(largelove@163.com) 专享 尊重版权
    目录 | ix
    6.10 Classpath 定义 ......................................................................................................................160
    6.11 调度模式 ...............................................................................................................................161
    6.11.1 依频次调度 ............................................................................................................162
    6.11.2 时间或数据触发式 ................................................................................................162
    6.12 执行工作流 ...........................................................................................................................166
    6.13 小结 .......................................................................................................................................166
    第 7 章 Hadoop 近实时处理 .......................................................................................................167
    7.1 流处理 .....................................................................................................................................169
    7.2 Apache Storm ..........................................................................................................................170
    7.2.1 Storm 高级架构 ..........................................................................................................171
    7.2.2 Storm 拓扑 ..................................................................................................................172
    7.2.3 元组及数据流 ............................................................................................................173
    7.2.4 spout 和 bolt................................................................................................................173
    7.2.5 数据流分组 ................................................................................................................174
    7.2.6 Storm 应用的可靠性 ..................................................................................................175
    7.2.7 仅处理一次机制 ........................................................................................................175
    7.2.8 容错性 ........................................................................................................................176
    7.2.9 Storm 与 HDFS 集成 .................................................................................................176
    7.2.10 Storm 与 HBase 集成 ...............................................................................................176
    7.2.11 Storm 示例:简单移动平均 ....................................................................................177
    7.2.12 Storm 评估 ................................................................................................................183
    7.3 Trident 接口 ............................................................................................................................183
    7.3.1 Trident 示例:简单移动平均 ....................................................................................184
    7.3.2 Trident 评估 ................................................................................................................186
    7.4 Spark Streaming ......................................................................................................................186
    7.4.1 Spark Streaming 概述 .................................................................................................187
    7.4.2 Spark Streaming 示例:简单求和 .............................................................................187
    7.4.3 Spark Streaming 示例:多路输入 .............................................................................188
    7.4.4 Spark Streaming 示例:状态维护 .............................................................................189
    7.4.5 Spark Streaming 示例:窗口函数 .............................................................................191
    7.4.6 Spark Streaming 示例:Streaming 与 ETL 代码比较 ..............................................191
    7.4.7 Spark Streaming 评估 .................................................................................................193
    7.5 Flume 拦截器 ..........................................................................................................................193
    7.6 工具选择 .................................................................................................................................194
    7.6.1 低延迟的数据扩充、验证、报警及采集 ................................................................194
    7.6.2 NRT 技术、滚动平均及迭代处理 ............................................................................195
    7.6.3 复杂数据流 ................................................................................................................196
    7.7 小结 .........................................................................................................................................197
    图灵社区会员 largelove(largelove@163.com) 专享 尊重版权
    x | 目录
    第二部分 案例研究
    第 8 章 点击流分析 .......................................................................................................................200
    8.1 用例场景定义 .........................................................................................................................200
    8.2 使用 Hadoop 进行点击流分析 ..............................................................................................202
    8.3 设计概述 .................................................................................................................................202
    8.4 数据存储 .................................................................................................................................203
    8.5 数据采集 .................................................................................................................................205
    8.5.1 客户端层 ....................................................................................................................208
    8.5.2 收集器层 ....................................................................................................................210
    8.6 数据处理 .................................................................................................................................212
    8.6.1 数据去重 ....................................................................................................................214
    8.6.2 会话生成 ....................................................................................................................215
    8.7 数据分析 .................................................................................................................................217
    8.8 协调调度 .................................................................................................................................218
    8.9 小结 .........................................................................................................................................221
    第 9 章 欺诈检测 ............................................................................................................................222
    9.1 持续改善 .................................................................................................................................222
    9.2 开始行动 .................................................................................................................................223
    9.3 欺诈检测系统架构需求 .........................................................................................................223
    9.4 用例介绍 .................................................................................................................................223
    9.5 架构设计 .................................................................................................................................224
    9.6 客户端架构 .............................................................................................................................226
    9.7 画像存储及访问 .....................................................................................................................226
    9.7.1 缓存 ............................................................................................................................227
    9.7.2 HBase 数据定义 .........................................................................................................228
    9.7.3 事务状态更新:通过或否决 ....................................................................................231
    9.8 数据采集 .................................................................................................................................232
    9.9 近实时处理与探索性分析 .....................................................................................................238
    9.10 近实时处理 ...........................................................................................................................238
    9.11 探索性分析 ...........................................................................................................................239
    9.12 其他架构对比 .......................................................................................................................240
    9.12.1 Flume 拦截器 ..........................................................................................................240
    9.12.2 从 Kafka 到 Storm 或 Spark Streaming ..................................................................241
    9.12.3 扩展的业务规则引擎 .............................................................................................241
    9.13 小结 .......................................................................................................................................242
    图灵社区会员 largelove(largelove@163.com) 专享 尊重版权
    目录 | xi
    第 10 章 数据仓库 .........................................................................................................................243
    10.1 使用 Hadoop 构建数据仓库 ................................................................................................245
    10.2 用例场景定义 .......................................................................................................................247
    10.3 OLTP 模式 ............................................................................................................................248
    10.4 数据仓库:术语介绍 ...........................................................................................................249
    10.5 数据仓库的 Hadoop 实践 ....................................................................................................251
    10.6 架构设计 ...............................................................................................................................251
    10.6.1 数据建模及存储 .....................................................................................................252
    10.6.2 数据采集 .................................................................................................................261
    10.6.3 数据处理及访问 .....................................................................................................264
    10.6.4 数据聚合 .................................................................................................................268
    10.6.5 数据导出 .................................................................................................................269
    10.6.6 流程调度 .................................................................................................................270
    10.7 小结 .......................................................................................................................................272
    附录 A Impala 中的关联 ..............................................................................................................273
    作者简介 ..............................................................................................................................................277
    封面介绍 ..............................................................................................................................................278


    游客,如果您要查看本帖隐藏内容请回复

    温馨提示:
    1、本站所有内容均为互联网收集或网友分享或网络购买,本站不破解、不翻录任何视频!
    2、如本帖侵犯到任何版权问题,请立即告知本站,本站将及时予与删除并致以最深的歉意!
    3、本站资源仅供本站会员学习参考,不得传播及用于其他用途,学习完后请在24小时内自行删除.
    4、本站资源质量虽均经精心审查,但也难保万无一失,若发现资源有问题影响学习请一定及时点此进行问题反馈,我们会第一时间改正!
    5、若发现链接失效了请联系管理员,管理员会在2小时内修复
    6、如果有任何疑问,请加客服QQ:1300822626 2小时内回复你!
    回复

    使用道具 举报

  • TA的每日心情
    开心
    10 小时前
  • 签到天数: 1181 天

    [LV.10]以坛为家III

    1

    主题

    2678

    帖子

    8579

    积分

    超凡入圣

    Rank: 10Rank: 10Rank: 10

    积分
    8579

    发表于 2022-8-14 22:08:04 | 显示全部楼层
    啥也不说了,感谢楼主分享哇!
    回复 支持 反对

    使用道具 举报

  • TA的每日心情
    擦汗
    昨天 09:46
  • 签到天数: 1298 天

    [LV.10]以坛为家III

    1

    主题

    4725

    帖子

    1万

    积分

    超凡入圣

    Rank: 10Rank: 10Rank: 10

    积分
    13033

    发表于 2022-8-14 23:14:09 | 显示全部楼层
    啥也不说了,感谢楼主分享哇!
    回复 支持 反对

    使用道具 举报

  • TA的每日心情
    开心
    2024-6-15 14:54
  • 签到天数: 53 天

    [LV.5]常住居民I

    0

    主题

    142

    帖子

    518

    积分

    技冠群雄

    Rank: 6Rank: 6

    积分
    518

    发表于 2022-8-15 08:04:57 | 显示全部楼层
    啥也不说了,感谢楼主分享哇!
    回复 支持 反对

    使用道具 举报

  • TA的每日心情
    奋斗
    昨天 12:38
  • 签到天数: 958 天

    [LV.10]以坛为家III

    4

    主题

    2341

    帖子

    7845

    积分

    深不可测

    Rank: 9Rank: 9Rank: 9

    积分
    7845

    发表于 2022-8-15 17:03:27 | 显示全部楼层
    啥也不说了,感谢楼主分享哇!
    回复 支持 反对

    使用道具 举报

  • TA的每日心情
    开心
    2024-7-22 11:43
  • 签到天数: 1081 天

    [LV.10]以坛为家III

    0

    主题

    1705

    帖子

    6280

    积分

    深不可测

    Rank: 9Rank: 9Rank: 9

    积分
    6280

    发表于 2022-8-15 21:53:04 | 显示全部楼层
    正需要,支持楼主大人了!
    回复 支持 反对

    使用道具 举报

  • TA的每日心情
    开心
    2024-12-14 19:07
  • 签到天数: 372 天

    [LV.9]以坛为家II

    0

    主题

    893

    帖子

    2834

    积分

    傲视群雄

    Rank: 8Rank: 8

    积分
    2834

    发表于 2022-8-16 05:26:09 | 显示全部楼层
    啥也不说了,感谢楼主分享哇!
    回复 支持 反对

    使用道具 举报

  • TA的每日心情

    前天 11:40
  • 签到天数: 341 天

    [LV.8]以坛为家I

    1

    主题

    1439

    帖子

    4005

    积分

    傲视群雄

    Rank: 8Rank: 8

    积分
    4005

    发表于 2022-8-16 05:35:51 | 显示全部楼层
    正需要,支持楼主大人了!
    回复 支持 反对

    使用道具 举报

  • TA的每日心情

    2024-11-14 15:12
  • 签到天数: 220 天

    [LV.7]常住居民III

    0

    主题

    428

    帖子

    1508

    积分

    一代宗师

    Rank: 7Rank: 7Rank: 7

    积分
    1508

    发表于 2022-8-16 09:35:14 | 显示全部楼层
    正需要,支持楼主大人了!
    回复 支持 反对

    使用道具 举报

  • TA的每日心情

    3 小时前
  • 签到天数: 149 天

    [LV.7]常住居民III

    0

    主题

    302

    帖子

    1103

    积分

    技冠群雄

    Rank: 6Rank: 6

    积分
    1103

    发表于 2022-8-18 09:55:53 | 显示全部楼层
    回的人少,我来小顶一下
    回复 支持 反对

    使用道具 举报

    您需要登录后才可以回帖 登录 | 立即注册

    本版积分规则

    QQ|Archiver|小黑屋|本站代理|dmz社区

    GMT+8, 2024-12-23 12:10 , Processed in 0.101209 second(s), 41 queries .

    Powered by Discuz! X3.4 Licensed

    Copyright © 2001-2021, Tencent Cloud.

    快速回复 返回顶部 返回列表