• 如何保持胃肠道健康年轻态 来听听这位消化科医生球迷的经验 2019-08-22
  • 十多次告病危 2岁高危白血病男童盼来“生命火种” 2019-08-18
  • 清新 —频道 春城壹网 七彩云南 一网天下 2019-08-16
  • 抚州市融媒体“中央厨房”建设正式启动 2019-08-15
  • 昨夜涝洰河公园星光璀璨 首届尧都文化旅游节圆满闭幕 2019-07-30
  • 主编周记:名酒股市抢眼是改革开放40年使然?白酒 企业 2019-07-29
  • 昆明母婴室地图出炉啦!公众场合喂奶不再羞答答 春城壹网 七彩云南 一网天下 2019-07-26
  • 十大城市二手房市场“整体降温”态势难改 2019-07-24
  • 宪法修正案专题记者会 2019-07-24
  • 再投517亿助力湾区建设 钱江世纪城32个大项目开工 2019-07-23
  • 网约车谨防四类风险 小心遗落手机被司机私自转账 2019-07-23
  • 中国虚拟现实创新创业大赛南昌赛区颁奖仪式举行 2019-07-09
  • 国科大“科教融合” 科学家上讲台做导师 2019-07-08
  • 复刻汉锦 重现千年华美 2019-07-08
  • 康师傅控股有限公司获第十二届人民企业社会责任奖年度扶贫奖 2019-07-04
  • 腾讯分分彩是统一的吗:每周一书《Spark与Hadoop大数据分析》分享!

    来源:SOHU  [  作者:中科院计算所培训中心   ]  责编:吕秀玲  |  侵权/违法举报

    分分彩软件手机版 www.xpmw.net

    原标题:每周一书《Spark与Hadoop大数据分析》分享!

    Spark与Hadoop大数据分析比较系统地讲解了利用Hadoop和Spark及其生态系统里的一系列工具进行大数据分析的方法,既涵盖ApacheSpark和Hadoop的基础知识,又深入探讨所有Spark组件——SparkCore、SparkSQL、DataFrame、DataSet、普通流、结构化流、MLlib、Graphx,以及Hadoop的核心组件(HDFS、MapReduce和Yarn)等,并配套详细的实现示例,是快速掌握大数据分析基础架构及其实施方法的详实参考。

    全书共10章,第1章从宏观的角度讲解大数据分析的概念,并介绍在Hadoop和Spark平台上使用的工具和技术,以及一些*常见的用例;第2章介绍Hadoop和Spark平台的基础知识;第3章深入探讨并学习Spark;第4章主要介绍DataSourcesAPI、DataFrameAPI和新的DatasetAPI;第5章讲解如何用SparkStreaming进行实时分析;第6章介绍Spark和Hadoop配套的笔记本和数据流;第7章讲解Spark和Hadoop上的机器学习技术;第8章介绍如何构建推荐系统;第9章介绍如何使用GraphX进行图分析;第10章介绍如何使用SparkR。

    目录:

    第1章 从宏观视角看大数据分析··········1

    1.1 大数据分析以及Hadoop和Spark

    在其中承担的角色····························3

    1.1.1 典型大攻据分析项目的

    生名周期.....................4

    1.1.2 Hadoop中Spark承担的角色·············6

    1.2 大数据札学以及Hadoop和

    Spark在其中承扣的角色…………6

    1.2.1 从数据分析到数据科学的

    根本性转变···························6

    1.2.2 典型数据科学项目的生命周期··········8

    1.2.3 Hadoop和Spark承担的角色·················9

    1.3 工具和技术··························9

    1.4 实际环境中的用例·············11

    1.5 小结········································12

    第2章 Apache Hadoop和ApacheSpark 入门····13

    2.1 Apache Hadoop概述..…………13

    2.1.1 Hadoop分布式文件系统····14

    2.1.2 HDFS的特性·······························15

    2.1.3 MapReduce··························16

    2.1.4 MapReduce的特性······················17

    2.1.5 MapReduce v 1与

    MapRcduce v2 对比······················17

    2.1.6 YARN··································18

    2.1.7 Hadoop上的存储选择······················20

    2.2 Apache Spark概述···························24

    2.2.1 Spark的发展历史······················24

    2.2.2 Apache Spark是什么······················25

    2.2.3 Apache Spark不是什么·······26

    2.2.4 MapReduce的问题······················27

    2.2.5 Spark的架构························28

    2.3 为何把Hadoop和Spark结合使用·······31

    2.3.1 Hadoop的持性······················31

    2.3.2 Spark的特性·······························31

    2.4 安装Hadoop和Spark集群···············33

    2.5 小结··················································36

    第3章 深入剖析Apache Spark ··········37

    3.1 启动Spark守护进程·······························37

    3.1.1 使用CDH ····························38

    3.1.2 使用HDP 、MapR和Spark预制软件包··············38

    3.2 学习Spark的核心概念························39

    3.2.1 使用Spark的方法.··························39

    3.2.2 弹性分布式数据集······················41

    3.2.3 Spark环境································13

    3.2.4 变换和动作..........................44

    3.2.5 ROD中的并行度·························46

    3.2.6 延迟评估·······························49

    3.2.7 谱系图··································50

    3.2.8 序列化·································51

    3.2.9 在Spark 中利用Hadoop文件格式····52

    3.2.10 数据的本地性··················53

    3.2.11 共享变量........................... 54

    3.2.12 键值对RDD ······················55

    3.3 Spark 程序的生命周期………………55

    3.3.1 流水线............................... 57

    3.3.2 Spark执行的摘要....………58

    3.4 Spark应用程序······························59

    3.4.1 Spark Shell和Spark应用程序·········59

    3.4.2 创建Spark环境…….............59

    3.4.3 SparkConf·························59

    3.4.4 SparkSubmit ························60

    3.4.5 Spark 配置项的优先顺序····61

    3.4.6 重要的应用程序配置··········61

    3. 5 持久化与缓存··························62

    3.5.1 存储级别............................. 62

    3.5.2 应该选择哪个存储级别·····63

    3.6 Spark 资源管理器: Standalone 、

    YARN和Mesos·······························63

    3.6.1 本地和集群模式··················63

    3.6.2 集群资源管理器························64

    3.7 小结·················································67

    第4章 利用Spark SQL 、DataFrame

    和Dataset 进行大数据分析····················69

    4.1 Spark SQL的发展史····························70

    4.2 Spark SQL的架构·······················71

    4.3 介绍Spark SQL的四个组件················72

    4.4 DataFrame和Dataset的演变············74

    4.4.1 ROD 有什么问题····························74

    4.4.2 ROD 变换与Dataset和

    DataFramc 变换....................75

    4.5 为什么要使用Dataset和Dataframe·····75

    4.5.1 优化·····································76

    4.5.2 速度·····································76

    4.5.3 自动模式发现························77

    4.5.4 多数据源,多种编程语言··················77

    4.5.5 ROD和其包API之间的互操作性.......77

    4.5.6 仅选择和读取为要的数据···········78

    4.6 何时使用ROD 、Dataset

    和DataFrame·············78

    4.7 利用DataFraine进行分析.......……78

    4.7.1 创建SparkSession …………...79

    4.7.2 创建DataFrame·····························79

    4.7.3 把DataFrame转换为RDD·············82

    4.7.4 常用的Dataset DataFrame操作······83

    4.7.5 缓存数据··································84

    4.7.6 性能优化·····························84

    4.8 利用DatasetAPl进行分析················85

    4.8.1 创建Dataset·····························85

    4.8.2 把Dataframe转换为Dataset····86

    4.8.3 利用数据字典访问元数据···············87

    4.9 Data Sources API ............................87

    4.9.1 读和写函数································88

    4.9.2 内置数据库····································88

    4.9.3 外部数据源··························93

    4.10 把Spark SQL作为分布式SQL引擎····97

    4.10.1 把Spark SQL的Thrift服务器

    用于JDBC/ODBC访问............97

    4.10.2 使用beeline客户端查询数据·········98

    4.10.3 使用spark-sqI CLI从Hive查询数据....99

    4.10.4 与BI工具集成··························100

    4.11 Hive on Spark ...........................…100

    4.12 小结..............................................100

    第5章 利用Spark Streaming和Structured Streaming 进行

    实时分析···102

    5.1 实时处理概述··························103

    5.1.1 Spark Streaming 的优缺点...104

    5.1.2 Spark Strcruning的发展史····104

    5.2 Spark Streaming的架构···············104

    5.2.1 Spark Streaming应用程序流··········106

    5.2.2 无状态和有状态的准处理·················107

    5.3 Spark Streaming的变换和动作········109

    5.3.1 union·································· 109

    5.3.2 join···························109

    5.3.3 transform操作··························109

    5.3.4 updateStateByKey·····················109

    5.3.5 mapWithState ····················110

    5.3.6 窗口操作······ ·····················110

    5.3.7 输出操作........................... 1 11

    5.4 输人数据源和输出存储·············111

    5.4.1 基本数据源·······112

    5.4.2 高级数据源····················112

    5.4.3 自定义数据源.···················112

    5.4.4 接收器的可靠性························ 112

    5.4.5 输出存储··························113

    5.5 使用Katlca和HBase的SparkStreaming···113

    5.5.1 基于接收器的方法·······················114

    5.5.2 直接方法(无接收器······················116

    5.5.3 与HBase集成···························117

    5.6 Spark Streaming的高级概念·········118

    5.6.1 使用DataF rame······················118

    5.6.2 MLlib操作·······················119

    5.6.3 缓存/持久化·······················119

    5.6.4 Spark Streaming中的容错机制······119

    5.6.5 Spark Streaming应用程序的

    性能调优············121

    5.7 监控应用程序·······························122

    5.8 结构化流概述································123

    5.8.1 结构化流应用程序的工作流··········123

    5.8.2 流式Dataset和流式DataFrame·····125

    5.8.3 流式Dataset和流式

    DataFrame的操作·················126

    5.9 小结········································129

    第6章 利用Spark 和Hadoop的

    笔记本与数据流····················130

    6.1 基下网络的笔记本概述·····················130

    6.2 Jupyter概述..·························· 131

    6.2.1 安装Jupyter···················132

    6.2.2 用Jupyter进行分析···················134

    6.3 Apache Zeppelin 概述····················· 135

    6.3.1 Jupyter和Zeppelin对比····136

    6.3.2 安装ApacheZeppelin···················137

    6.3.3 使用Zeppelin进行兮析····139

    6.4 Livy REST作业服务器和Hue笔记本····140

    6.4.1 安装设置Livy服务器和Hue········141

    6.4.2 使用Livy服务器····················1 42

    6.4.3 Livy和Hue笔记本搭配使用·········145

    6.4.4 Livy和Zeppelin搭配使用·············148

    6.5 用于数据流的ApacheNiFi概述········148

    6.5.1 安装ApacheNiFi··················148

    6.5.2 把N iF1用干数据流和分析·····149

    6.6 小结·····························152

    第7章 利用Spark 和Hadoop 进行机器学习...153

    7.1 机器学习概述........….................... 153

    7.2 在Spark和Hadoop上进行机器学习.....154

    7.3 机器学习算法··················155

    7.3.1 有监督学习........…............. 156

    7.3.2 无监督学习···················156

    7.3.3 推荐系统…................…..... 157

    7.3.4 特征提取和变换……...…157

    7.3.5 优化...................................158

    7.3.6 Spark MLlib的数据类型…158

    7.4 机器学习算法示例·················160

    7.5 构建机器学习流水线·················163

    7.5.1 流水线工作流的一个示例···········163

    7.5.2 构建一个ML流水线··················164

    7.5.3 保存和加载模型··················166

    7.6 利用H2O和Spark进行机器学习·····167

    7.6.1 为什么使用SparklingWatcr······167

    7.6.2 YARN上的一个应用程序流.........167

    7 .6.3 Sparkling Water入门........168

    7.7 Hivemall概述……..…………..169

    7.8 Hivemall for Spark概述.. ……........170

    7.9 小结······························170

    第8章 利用Spark和Mahout构建推荐系统...171

    8.1 构建推荐系统..............…171

    8.1.1 基干内容的过滤························172

    8.1.2 协同过滤······························ 172

    8.2 推荐系统的局限性··························· 173

    8.3 用MLlib实现推荐系统·······················173

    8.3.1 准备环境·······················174

    8.3.2 创建RDD······················175

    8.3.3 利用DataFrame探索数据·······176

    8.3.4 创建训练和测试数据集················178

    8.3.5 创建一个模型···················178

    8.3.6 做出预测··························179

    8.3.7 利用测试数据对模型进行评估·······179

    8.3.8 检查误型的准确度……......180

    8.3.9 显式和隐式反馈····················181

    8.4 Mahout和Spark的集成·····················181

    8.4.1 安装Mahout····················181

    8.4.2 探索Mahout shell ·····················182

    8.4.3 利可Mahout和搜索工具

    构建一个通用的推荐系统········186

    8.5 小结····················189

    第9章 利用GraphX进行图分析···190

    9.1 图处理概述···································190

    9.1.1 图是什么···························191

    9.1.2 图数据库和图处理系统····191

    9.1.3 GraphX概述·······················192

    9.1.4 图算法···································192

    9.2 GraphX入门·······················193

    9.2.1 GraphX的基本操作·······················193

    9.2.2 图的变换·············198

    9.2.3 GraphX算法·······················202

    9.3 利用GraphX分析航班数据···········205

    9.4 GraphFrames概述························209

    9.4.1 模式发现··························· 211

    9.4.2 加载和保存Graphframes···212

    9.5 小结...............................................212

    第10章 利用SparkR进行交互式分析······213

    10.1 R语言和Spark.R概述·······················213

    10.1.1 R语言是什么.··························214

    10.1.2 SparkR慨述.....................214

    10.1.3 SparkR架构..................... 216

    10.2 SparkR入门·······················216

    10.2.1 安装和配置R·························216

    10.2.2 使用SparkR shell··········218

    10.2.3 使甲Spark.R脚本·······················222

    10.3 在 SparkR里使用Dataframe······223

    10.4 在RStudio里使用SparkR···········228

    10.5 利用SparkR进行机器学习·······230

    10.5.1 利用朴素贝叶斯模型······230

    10.5.2 利用K均值模型·······················232

    10.6 在Zeppelin里使用SparkR·······233

    10.7 小结·······················234

    如果想得到下载地址,请微信搜索关注“中科院计算所培训中心”公众号,添加中科院计算所培训中心助教“zhongkeyuanjss666”,帮助进入中科院IT技术分享群,群里有地址分享。

    分分彩软件手机版 www.xpmw.net true //www.xpmw.net/seduzx/628522/301424675.html report 18529 为您提供全方面的每周一书《Spark与Hadoop大数据分析》分享!相关信息,根据用户需求提供每周一书《Spark与Hadoop大数据分析》分享!最新最全信息,解决用户的每周一书《Spark与Hadoop大数据分析》分享!需求,原标题:每周一书《Spark与Hadoop大数据分析》分享!Spark与Hadoop大数据分析比较系统地讲解了利用Hadoop和Spark及其生态系统里的一系列工具进行大数据分析的方法,既涵盖ApacheSpark和Hadoop的基础知识,又深入探...
    • 猜你喜欢
      • 24小时热文
      • 本周热评
        图文推荐
        • 最新添加
        • 最热文章
          精彩推荐
          读过此文的还读过
          • 如何保持胃肠道健康年轻态 来听听这位消化科医生球迷的经验 2019-08-22
          • 十多次告病危 2岁高危白血病男童盼来“生命火种” 2019-08-18
          • 清新 —频道 春城壹网 七彩云南 一网天下 2019-08-16
          • 抚州市融媒体“中央厨房”建设正式启动 2019-08-15
          • 昨夜涝洰河公园星光璀璨 首届尧都文化旅游节圆满闭幕 2019-07-30
          • 主编周记:名酒股市抢眼是改革开放40年使然?白酒 企业 2019-07-29
          • 昆明母婴室地图出炉啦!公众场合喂奶不再羞答答 春城壹网 七彩云南 一网天下 2019-07-26
          • 十大城市二手房市场“整体降温”态势难改 2019-07-24
          • 宪法修正案专题记者会 2019-07-24
          • 再投517亿助力湾区建设 钱江世纪城32个大项目开工 2019-07-23
          • 网约车谨防四类风险 小心遗落手机被司机私自转账 2019-07-23
          • 中国虚拟现实创新创业大赛南昌赛区颁奖仪式举行 2019-07-09
          • 国科大“科教融合” 科学家上讲台做导师 2019-07-08
          • 复刻汉锦 重现千年华美 2019-07-08
          • 康师傅控股有限公司获第十二届人民企业社会责任奖年度扶贫奖 2019-07-04
          • 安徽11选5开奖结 广东十透码惠泽社群 36选7玩法 大乐透19054 47333最快开奖结果l3 辽宁快乐12走势图跨度 东莞快乐十分赌博 体彩老11选5 大乐透中2十o有奖吗 篮球推荐搜狐彩票 网上pc蛋蛋怎么赚钱 三门代表什么生肖 足彩半全场负胜是什么意思 福建快3走执图 五分6合规律