TL Python数据处理(.pdf)
前言 ........................................................................................................................................................xiii
第 1 章 Python 简介 ..........................................................................................................................1
1.1 为什么选择 Python.....................................................................................................................4
1.2 开始使用 Python.........................................................................................................................4
1.2.1 Python 版本选择 ............................................................................................................5
1.2.2 安装 Python ....................................................................................................................6
1.2.3 测试 Python ....................................................................................................................9
1.2.4 安装 pip ........................................................................................................................11
1.2.5 安装代码编辑器 ..........................................................................................................12
1.2.6 安装 IPython(可选) ...................................................................................................13
1.3 小结 ...........................................................................................................................................13
第 2 章 Python 基础 ........................................................................................................................14
2.1 基本数据类型 ...........................................................................................................................15
2.1.1 字符串 ..........................................................................................................................15
2.1.2 整数和浮点数 ..............................................................................................................15
2.2 数据容器 ...................................................................................................................................18
2.2.1 变量 ..............................................................................................................................18
2.2.2 列表 ..............................................................................................................................21
2.2.3 字典 ..............................................................................................................................22
2.3 各种数据类型的用途 ...............................................................................................................23
2.3.1 字符串方法:字符串能做什么 ..................................................................................24
2.3.2 数值方法:数字能做什么 ..........................................................................................25
2.3.3 列表方法:列表能做什么 ..........................................................................................26
2.3.4 字典方法:字典能做什么 ..........................................................................................27
viii | 目录
2.4 有用的工具:type、dir 和 help ............................................................................................28
2.4.1 type ...............................................................................................................................28
2.4.2 dir .................................................................................................................................28
2.4.3 help ...............................................................................................................................30
2.5 综合运用 ...................................................................................................................................31
2.6 代码的含义 ...............................................................................................................................32
2.7 小结 ...........................................................................................................................................33
第 3 章 供机器读取的数据 ............................................................................................................34
3.1 CSV 数据 ..................................................................................................................................35
3.1.1 如何导入 CSV 数据 .....................................................................................................36
3.1.2 将代码保存到文件中并在命令行中运行 ..................................................................39
3.2 JSON 数据 ................................................................................................................................41
3.3 XML 数据 .................................................................................................................................44
3.4 小结 ...........................................................................................................................................56
第 4 章 处理 Excel 文件 .................................................................................................................58
4.1 安装 Python 包..........................................................................................................................58
4.2 解析 Excel 文件 ........................................................................................................................59
4.3 开始解析 ...................................................................................................................................60
4.4 小结 ...........................................................................................................................................71
第 5 章 处理 PDF 文件,以及用 Python 解决问题 ...............................................................73
5.1 尽量不要用 PDF.......................................................................................................................73
5.2 解析 PDF 的编程方法..............................................................................................................74
5.2.1 利用 slate 库打开并读取 PDF ..................................................................................75
5.2.2 将 PDF 转换成文本 .....................................................................................................77
5.3 利用 pdfminer 解析 PDF .........................................................................................................78
5.4 学习解决问题的方法 ...............................................................................................................92
5.4.1 练习:使用表格提取,换用另一个库 ......................................................................94
5.4.2 练习:手动清洗数据 ..................................................................................................98
5.4.3 练习:试用另一种工具 ..............................................................................................98
5.5 不常见的文件类型 .................................................................................................................101
5.6 小结 .........................................................................................................................................101
第 6 章 数据获取与存储 ..............................................................................................................103
6.1 并非所有数据生而平等 .........................................................................................................103
6.2 真实性核查 .............................................................................................................................104
6.3 数据可读性、数据清洁度和数据寿命 .................................................................................105
6.4 寻找数据 .................................................................................................................................105
6.4.1 打电话 ........................................................................................................................105
目录 | ix
6.4.2 美国政府数据 ............................................................................................................106
6.4.3 全球政府和城市开放数据 ........................................................................................107
6.4.4 组织数据和非政府组织数据 ....................................................................................109
6.4.5 教育数据和大学数据 ................................................................................................109
6.4.6 医学数据和科学数据 ................................................................................................109
6.4.7 众包数据和 API .........................................................................................................110
6.5 案例研究:数据调查实例 .....................................................................................................111
6.5.1 埃博拉病毒危机 ........................................................................................................ 111
6.5.2 列车安全 .................................................................................................................... 111
6.5.3 足球运动员的薪水 ....................................................................................................112
6.5.4 童工 ............................................................................................................................112
6.6 数据存储 .................................................................................................................................113
6.7 数据库简介 .............................................................................................................................113
6.7.1 关系型数据库:MySQL 和 PostgreSQL ..................................................................114
6.7.2 非关系型数据库:NoSQL ........................................................................................116
6.7.3 用 Python 创建本地数据库 .......................................................................................117
6.8 使用简单文件 .........................................................................................................................118
6.8.1 云存储和 Python ........................................................................................................118
6.8.2 本地存储和 Python ....................................................................................................119
6.9 其他数据存储方式 .................................................................................................................119
6.10 小结 .......................................................................................................................................119
第 7 章 数据清洗:研究、匹配与格式化 ...............................................................................121
7.1 为什么要清洗数据 .................................................................................................................121
7.2 数据清洗基础知识 .................................................................................................................122
7.2.1 找出需要清洗的数据 ................................................................................................123
7.2.2 数据格式化 ................................................................................................................131
7.2.3 找出离群值和不良数据 ............................................................................................135
7.2.4 找出重复值 ................................................................................................................140
7.2.5 模糊匹配 ....................................................................................................................143
7.2.6 正则表达式匹配 ........................................................................................................146
7.2.7 如何处理重复记录 ....................................................................................................150
7.3 小结 .........................................................................................................................................151
第 8 章 数据清洗:标准化和脚本化 ........................................................................................153
8.1 数据归一化和标准化 .............................................................................................................153
8.2 数据存储 .................................................................................................................................154
8.3 找到适合项目的数据清洗方法 .............................................................................................156
8.4 数据清洗脚本化 .....................................................................................................................157
8.5 用新数据测试 .........................................................................................................................170
8.6 小结 .........................................................................................................................................172
x | 目录
第 9 章 数据探索和分析 ..............................................................................................................173
9.1 探索数据 .................................................................................................................................173
9.1.1 导入数据 ....................................................................................................................174
9.1.2 探索表函数 ................................................................................................................179
9.1.3 联结多个数据集 ........................................................................................................182
9.1.4 识别相关性 ................................................................................................................186
9.1.5 找出离群值 ................................................................................................................187
9.1.6 创建分组 ....................................................................................................................189
9.1.7 深入探索 ....................................................................................................................192
9.2 分析数据 .................................................................................................................................193
9.2.1 分离和聚焦数据 ........................................................................................................194
9.2.2 你的数据在讲什么 ....................................................................................................196
9.2.3 描述结论 ....................................................................................................................196
9.2.4 将结论写成文档 ........................................................................................................197
9.3 小结 .........................................................................................................................................197
第 10 章 展示数据 .........................................................................................................................199
10.1 避免讲故事陷阱 ...................................................................................................................199
10.1.1 怎样讲故事 .............................................................................................................200
10.1.2 了解听众 .................................................................................................................200
10.2 可视化数据 ...........................................................................................................................201
10.2.1 图表 .........................................................................................................................201
10.2.2 时间相关数据 .........................................................................................................207
10.2.3 地图 .........................................................................................................................208
10.2.4 交互式元素 .............................................................................................................211
10.2.5 文字 .........................................................................................................................212
10.2.6 图片、视频和插画 .................................................................................................212
10.3 展示工具 ...............................................................................................................................213
10.4 发布数据 ...............................................................................................................................213
10.4.1 使用可用站点 .........................................................................................................213
10.4.2 开源平台:创建一个新网站 .................................................................................215
10.4.3 Jupyter(曾名 IPython notebook) ..........................................................................216
10.5 小结 .......................................................................................................................................219
第 11 章 网页抓取:获取并存储网络数据 .............................................................................221
11.1 抓取什么和如何抓取 ...........................................................................................................221
11.2 分析网页 ...............................................................................................................................223
11.2.1 检视:标记结构 .....................................................................................................224
11.2.2 网络 / 时间线:页面是如何加载的 ......................................................................230
11.2.3 控制台:同 JavaScript 交互 ..................................................................................232
11.2.4 页面的深入分析 .....................................................................................................236
目录 | xi
11.3 得到页面:如何通过互联网发出请求 ...............................................................................237
11.4 使用 Beautiful Soup 读取网页 .............................................................................................238
11.5 使用 lxml 读取网页 .............................................................................................................241
11.6 小结 .......................................................................................................................................249
第 12 章 高级网页抓取:屏幕抓取器与爬虫 ........................................................................251
12.1 基于浏览器的解析 ...............................................................................................................251
12.1.1 使用 Selenium 进行屏幕读取 ................................................................................252
12.1.2 使用 Ghost.py 进行屏幕读取 ................................................................................260
12.2 爬取网页 ...............................................................................................................................266
12.2.1 使用 Scrapy 创建一个爬虫 ....................................................................................266
12.2.2 使用 Scrapy 爬取整个网站 ....................................................................................273
12.3 网络:互联网的工作原理,以及为什么它会让脚本崩溃 ...............................................281
12.4 变化的互联网(或脚本为什么崩溃) .................................................................................283
12.5 几句忠告 ...............................................................................................................................284
12.6 小结 .......................................................................................................................................284
第 13 章 应用编程接口 ................................................................................................................286
13.1 API 特性 ...............................................................................................................................287
13.1.1 REST API 与流式 API............................................................................................287
13.1.2 频率限制 .................................................................................................................287
13.1.3 分级数据卷 .............................................................................................................288
13.1.4 API key 和 token .....................................................................................................289
13.2 一次简单的 Twitter REST API 数据拉取 ...........................................................................290
13.3 使用 Twitter REST API 进行高级数据收集 .......................................................................292
13.4 使用 Twitter 流式 API 进行高级数据收集 .........................................................................295
13.5 小结 .......................................................................................................................................297
第 14 章 自动化和规模化 ............................................................................................................298
14.1 为什么要自动化 ...................................................................................................................298
14.2 自动化步骤 ...........................................................................................................................299
14.3 什么会出错 ...........................................................................................................................301
14.4 在哪里自动化 .......................................................................................................................302
14.5 自动化的特殊工具 ...............................................................................................................303
14.5.1 使用本地文件、参数及配置文件 .........................................................................303
14.5.2 在数据处理中使用云 .............................................................................................308
14.5.3 使用并行处理 .........................................................................................................310
14.5.4 使用分布式处理 .....................................................................................................312
14.6 简单的自动化 .......................................................................................................................313
14.6.1 CronJobs ..................................................................................................................314
14.6.2 Web 接口 .................................................................................................................316
xii | 目录
14.6.3 Jupyter notebook .....................................................................................................316
14.7 大规模自动化 .......................................................................................................................317
14.7.1 Celery:基于队列的自动化 ....................................................................................317
14.7.2 Ansible:操作自动化 ..............................................................................................318
14.8 监控自动化程序 ...................................................................................................................319
14.8.1 Python 日志 .............................................................................................................320
14.8.2 添加自动化信息 .....................................................................................................322
14.8.3 上传和其他报告 .....................................................................................................326
14.8.4 日志和监控服务 .....................................................................................................327
14.9 没有万无一失的系统 ...........................................................................................................328
14.10 小结 .....................................................................................................................................328
第 15 章 结论 ..................................................................................................................................330
15.1 数据处理者的职责 ...............................................................................................................330
15.2 数据处理之上 .......................................................................................................................331
15.2.1 成为一名更优秀的数据分析师 .............................................................................331
15.2.2 成为一名更优秀的开发者 .....................................................................................331
15.2.3 成为一名更优秀的视觉化讲故事者 .....................................................................332
15.2.4 成为一名更优秀的系统架构师 .............................................................................332
15.3 下一步做什么 .......................................................................................................................332
附录 A 编程语言对比 ...................................................................................................................334
附录 B 初学者的 Python 学习资源...........................................................................................336
附录 C 学习命令行 ........................................................................................................................338
附录 D 高级 Python 设置 ............................................................................................................349
附录 E Python 陷阱 ......................................................................................................................361
附录 F IPython 指南 .....................................................................................................................370
附录 G 使用亚马逊网络服务 ......................................................................................................374
关于作者 ..............................................................................................................................................378
关于封面 ..............................................................................................................................................378
**** Hidden Message *****
确实是难得好帖啊,顶先 啥也不说了,感谢楼主分享哇! 啥也不说了,感谢楼主分享哇! 啥也不说了,感谢楼主分享哇! 啥也不说了,感谢楼主分享哇! 啥也不说了,感谢楼主分享哇! 啥也不说了,感谢楼主分享哇! 啥也不说了,感谢楼主分享哇! 啥也不说了,感谢楼主分享哇!