搬运工 发表于 2022-8-4 00:00:00

Spark高级数据分析(第2版)



推荐序.....................................................................................................................................................ix
译者序.....................................................................................................................................................xi
序............................................................................................................................................................xiii
前言.........................................................................................................................................................xv
第 1 章 大数据分析 ...........................................................................................................................1
1.1 数据科学面临的挑战 .................................................................................................................2
1.2 认识 Apache Spark .....................................................................................................................4
1.3 关于本书 .....................................................................................................................................5
1.4 第 2 版说明 .................................................................................................................................6
第 2 章 用 Scala 和 Spark 进行数据分析 ...................................................................................8
2.1 数据科学家的 Scala ...................................................................................................................9
2.2 Spark 编程模型.........................................................................................................................10
2.3 记录关联问题 ...........................................................................................................................10
2.4 小试牛刀:Spark shell 和 SparkContext ................................................................................11
2.5 把数据从集群上获取到客户端 ...............................................................................................16
2.6 把代码从客户端发送到集群 ...................................................................................................19
2.7 从 RDD 到 DataFrame ..............................................................................................................20
2.8 用 DataFrame API 来分析数据 ................................................................................................23
2.9 DataFrame 的统计信息 ............................................................................................................27
2.10 DataFrame 的转置和重塑 ......................................................................................................29
2.11 DataFrame 的连接和特征选择 ..............................................................................................32
2.12 为生产环境准备模型 .............................................................................................................33
2.13 评估模型 .................................................................................................................................35
2.14 小结 .........................................................................................................................................36
vi | 目录
第 3 章 音乐推荐和 Audioscrobbler 数据集 ............................................................................37
3.1 数据集 .......................................................................................................................................38
3.2 交替最小二乘推荐算法 ...........................................................................................................39
3.3 准备数据 ...................................................................................................................................41
3.4 构建第一个模型 .......................................................................................................................44
3.5 逐个检查推荐结果 ...................................................................................................................47
3.6 评价推荐质量 ...........................................................................................................................50
3.7 计算 AUC .................................................................................................................................51
3.8 选择超参数 ...............................................................................................................................53
3.9 产生推荐 ...................................................................................................................................55
3.10 小结 .........................................................................................................................................56
第 4 章 用决策树算法预测森林植被 ..........................................................................................58
4.1 回归简介 ...................................................................................................................................59
4.2 向量和特征 ...............................................................................................................................59
4.3 样本训练 ...................................................................................................................................60
4.4 决策树和决策森林 ...................................................................................................................61
4.5 Covtype 数据集 ........................................................................................................................63
4.6 准备数据 ...................................................................................................................................64
4.7 第一棵决策树 ...........................................................................................................................66
4.8 决策树的超参数 .......................................................................................................................72
4.9 决策树调优 ...............................................................................................................................73
4.10 重谈类别型特征 .....................................................................................................................77
4.11 随机决策森林 .........................................................................................................................79
4.12 进行预测 .................................................................................................................................81
4.13 小结 .........................................................................................................................................82
第 5 章 基于 K 均值聚类的网络流量异常检测 ........................................................................84
5.1 异常检测 ...................................................................................................................................85
5.2 K 均值聚类 ...............................................................................................................................85
5.3 网络入侵 ...................................................................................................................................86
5.4 KDD Cup 1999 数据集.............................................................................................................86
5.5 初步尝试聚类 ...........................................................................................................................87
5.6 k 的选择 ....................................................................................................................................90
5.7 基于 SparkR 的可视化 .............................................................................................................92
5.8 特征的规范化 ...........................................................................................................................96
5.9 类别型变量 ...............................................................................................................................98
5.10 利用标号的熵信息 .................................................................................................................99
5.11 聚类实战 ...............................................................................................................................100
5.12 小结 .......................................................................................................................................102
目录 | vii
第 6 章 基于潜在语义分析算法分析维基百科 ......................................................................104
6.1 文档 - 词项矩阵 .....................................................................................................................105
6.2 获取数据 .................................................................................................................................106
6.3 分析和准备数据 .....................................................................................................................107
6.4 词形归并 .................................................................................................................................109
6.5 计算 TF-IDF ...........................................................................................................................110
6.6 奇异值分解 .............................................................................................................................111
6.7 找出重要的概念 .....................................................................................................................113
6.8 基于低维近似的查询和评分 .................................................................................................117
6.9 词项 - 词项相关度 .................................................................................................................117
6.10 文档 - 文档相关度 ...............................................................................................................119
6.11 文档 - 词项相关度 ...............................................................................................................121
6.12 多词项查询 ...........................................................................................................................122
6.13 小结 .......................................................................................................................................123
第 7 章 用 GraphX 分析伴生网络 .............................................................................................124
7.1 对 MEDLINE 文献引用索引的网络分析 .............................................................................125
7.2 获取数据 .................................................................................................................................126
7.3 用 Scala XML 工具解析 XML 文档 .....................................................................................128
7.4 分析 MeSH 主要主题及其伴生关系 ....................................................................................130
7.5 用 GraphX 来建立一个伴生网络 ..........................................................................................132
7.6 理解网络结构 .........................................................................................................................135
7.6.1 连通组件 ....................................................................................................................136
7.6.2 度的分布 ....................................................................................................................138
7.7 过滤噪声边 .............................................................................................................................140
7.7.1 处理 EdgeTriplet ......................................................................................................141
7.7.2 分析去掉噪声边的子图 ............................................................................................142
7.8 小世界网络 .............................................................................................................................144
7.8.1 系和聚类系数 ............................................................................................................144
7.8.2 用 Pregel 计算平均路径长度 ....................................................................................145
7.9 小结 .........................................................................................................................................150
第 8 章 纽约出租车轨迹的空间和时间数据分析 ..................................................................151
8.1 数据的获取 .............................................................................................................................152
8.2 基于 Spark 的第三方库分析..................................................................................................153
8.3 基于 Esri Geometry API 和 Spray 的地理空间数据处理 .....................................................153
8.3.1 认识 Esri Geometry API .............................................................................................154
8.3.2 GeoJSON 简介 ...........................................................................................................155
8.4 纽约市出租车客运数据的预处理 .........................................................................................157
8.4.1 大规模数据中的非法记录处理 ................................................................................159
viii | 目录
8.4.2 地理空间分析 ............................................................................................................162
8.5 基于 Spark 的会话分析..........................................................................................................165
8.6 小结 .........................................................................................................................................168
第 9 章 基于蒙特卡罗模拟的金融风险评估 ...........................................................................170
9.1 术语 .........................................................................................................................................171
9.2 VaR 计算方法.........................................................................................................................172
9.2.1 方差 - 协方差法 ........................................................................................................172
9.2.2 历史模拟法 ................................................................................................................172
9.2.3 蒙特卡罗模拟法 ........................................................................................................172
9.3 我们的模型 .............................................................................................................................173
9.4 获取数据 .................................................................................................................................173
9.5 数据预处理 .............................................................................................................................174
9.6 确定市场因素的权重 .............................................................................................................177
9.7 采样 .........................................................................................................................................179
9.8 运行试验 .................................................................................................................................182
9.9 回报分布的可视化 .................................................................................................................185
9.10 结果的评估 ...........................................................................................................................186
9.11 小结 .......................................................................................................................................188
第 10 章 基因数据分析和 BDG 项目 .......................................................................................190
10.1 分离存储与模型 ...................................................................................................................191
10.2 用 ADAM CLI 导入基因学数据 .........................................................................................193
10.3 从 ENCODE 数据预测转录因子结合位点 .........................................................................201
10.4 查询 1000 Genomes 项目中的基因型 .................................................................................207
10.5 小结 .......................................................................................................................................210
第 11 章 基于 PySpark 和 Thunder 的神经图像数据分析 ................................................211
11.1 PySpark 简介 ........................................................................................................................212
11.2 Thunder 工具包概况和安装 ................................................................................................215
11.3 用 Thunder 加载数据 ...........................................................................................................215
11.4 用 Thunder 对神经元进行分类 ...........................................................................................221
11.5 小结 .......................................................................................................................................225
作者介绍..............................................................................................................................................226
封面介绍..............................................................................................................................................226


**** Hidden Message *****

taipingyang2021 发表于 2022-8-4 04:19:34

啥也不说了,感谢楼主分享哇!

17770767379 发表于 2022-8-4 08:25:28

啥也不说了,感谢楼主分享哇!

李才哥 发表于 2022-8-4 23:06:02

啥也不说了,感谢楼主分享哇!

neun 发表于 2022-8-4 23:12:45

啥也不说了,感谢楼主分享哇!

hgzhou6 发表于 2022-8-6 10:23:25

啥也不说了,感谢楼主分享哇!

kaidada 发表于 2022-8-9 23:22:13

啥也不说了,感谢楼主分享哇!

mayongz2023 发表于 2023-1-8 10:03:33

啥也不说了,感谢楼主分享哇!

kk12345678 发表于 2023-1-11 01:52:31

啥也不说了,感谢楼主分享哇!

牛聪聪345 发表于 2024-5-10 10:54:02

啥也不说了,感谢楼主分享哇!
页: [1] 2
查看完整版本: Spark高级数据分析(第2版)