linux运行pyspark,pyspark使用方法-程序员宅基地

技术标签: linux运行pyspark  

在pycharm上配置pyspark

在pycharm上配置pyspark

在windows上下面的错误,linux上应该正常

C:\ProgramData\Anaconda3\envs\tensorflow\python.exe E:/github/data-analysis/tf/SparkTest.py

2018-07-19 10:35:41 ERROR Shell:397 - Failed to locate the winutils binary in the hadoop binary path

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)

at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)

at org.apache.hadoop.util.Shell.(Shell.java:387)

at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)

at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)

at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)

at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)

2018-07-19 10:35:41 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Binarizer output with Threshold = 5.100000

[Stage 0:> (0 + 1) / 1]2018-07-19 10:35:51 ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)

java.io.IOException: Cannot run program "python": CreateProcess error=2, ϵͳ�Ҳ���ָ�����ļ���

at java.lang.ProcessBuilder.start(Unknown Source)

at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:133)

at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:76)

at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)

at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:86)

Caused by: java.io.IOException: CreateProcess error=2, ϵͳ�Ҳ���ָ�����ļ���

at java.lang.ProcessImpl.create(Native Method)

at java.lang.ProcessImpl.(Unknown Source)

at java.lang.ProcessImpl.start(Unknown Source)

... 35 more

2018-07-19 10:35:51 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.io.IOException: Cannot run program "python": CreateProcess error=2, ϵͳ�Ҳ���ָ�����ļ���

at java.lang.ProcessBuilder.start(Unknown Source)

at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:133)

at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:76)

at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)

at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:86)

at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

Caused by: java.io.IOException: CreateProcess error=2, ϵͳ�Ҳ���ָ�����ļ���

at java.lang.ProcessImpl.create(Native Method)

at java.lang.ProcessImpl.(Unknown Source)

at java.lang.ProcessImpl.start(Unknown Source)

... 35 more

2018-07-19 10:35:51 ERROR TaskSetManager:70 - Task 0 in stage 0.0 failed 1 times; aborting job

Traceback (most recent call last):

File "E:/github/data-analysis/tf/SparkTest.py", line 21, in

binarizedDataFrame.show()

File "E:\hadoop-common\spark-2.3.1-bin-hadoop2.7\python\pyspark\sql\dataframe.py", line 350, in show

print(self._jdf.showString(n, 20, vertical))

File "E:\hadoop-common\spark-2.3.1-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__

File "E:\hadoop-common\spark-2.3.1-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco

return f(*a, **kw)

File "E:\hadoop-common\spark-2.3.1-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o49.showString.

: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.io.IOException: Cannot run program "python": CreateProcess error=2, 系统找不到指定的文件。

at java.lang.ProcessBuilder.start(Unknown Source)

at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:133)

at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:76)

at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)

at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:86)

at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

Caused by: java.io.IOException: CreateProcess error=2, 系统找不到指定的文件。

at java.lang.ProcessImpl.create(Native Method)

at java.lang.ProcessImpl.(Unknown Source)

at java.lang.ProcessImpl.start(Unknown Source)

... 35 more

Driver stacktrace:

at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)

at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)

at

Caused by: java.io.IOException: Cannot run program "python": CreateProcess error=2, 系统找不到指定的文件。

at java.lang.ProcessBuilder.start(Unknown Source)

at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:133)

at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:76)

at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)

at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:86)

at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

Caused by: java.io.IOException: CreateProcess error=2, 系统找不到指定的文件。

Process finished with exit code

windows操作系统的原因

Anaconda上jupyter notebook使用pyspark

打开anaconda navigator,可以安装pyspark。可以运行,不用配置hadoop

d2188251b707

image.png

另外参考

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/weixin_33962326/article/details/116823068

智能推荐

html标签之视频各种标签_html 实现视频详情多tag标签-程序员宅基地

文章浏览阅读1.4k次。html标签之Object标签详解的定义和用法定义一个嵌入的对象。请使用此元素向您的 XHTML 页面添加多媒体。此元素允许您规定插入 HTML 文档中的对象的数据和参数,以及可用来显示和操作数据的代码。 标签用于包含对象,比如图像、音频、视频、Java applets、ActiveX、PDF 以及 Flash。object 的初衷是取代 img 和 applet 元素。不_html 实现视频详情多tag标签

(3)组合数学--鸽巢原理之最简单形式_鸽巢原理的三个公式-程序员宅基地

文章浏览阅读220次。定理:把n+1个物体放进n个盒子中,至少有一个盒子中含有两个物体理解:ai为第i天下的总棋盘数,显然an为递增序列,对an做部分和序列:如上图所示,上面77项,下面77项,共154项,153个盒子,所有存在aj+21 = ai,所以21 = aj - ai = bi + bi+1 + … + bj相当于19个物体,18个盒子五个点,四个三角形反证:Li <..._鸽巢原理的三个公式

Qt开发笔记之QCustomPlot:QCustomPlot介绍、编译与使用_qcustomplot编译-程序员宅基地

文章浏览阅读6.6w次,点赞20次,收藏194次。欢迎技术交流和帮助,提供所有IT相关的服务,有需要请联系博主QQ: 21497936,若该文为原创文章,未经允许不得转载原博主博客地址:http://blog.csdn.net/qq21497936本文章博客地址:http://blog.csdn.net/qq21497936/article/details/77847820目录效果 ​Demo下载地址QCustom..._qcustomplot编译

微信小程序的动态显示字体颜色_小程序 color if-程序员宅基地

文章浏览阅读1k次。<text wx:if="{{item.data.status=='待打印'}}" style="color:red">{{item.data.status}}</text><text wx:if="{{item.data.status=='已打印'}}" style="color:green">{{item.data.status}}</text>_小程序 color if

线上PHP问题排查思路与实践-程序员宅基地

文章浏览阅读210次。转载:http://www.bo56.com/%E7%BA%BF%E4%B8%8Aphp%E9%97%AE%E9%A2%98%E6%8E%92%E6%9F%A5%E6%80%9D%E8%B7%AF%E4%B8%8E%E5%AE%9E%E8%B7%B5/前几天,在一淘网,腾讯网媒和微博商业技术联合组织的技术分享大会上,我分享了《在线PHP问题排查思路与实践》。此博文除了对PPT..._在php开发过程中,线上代码怎么查哪段代码有问题,且不影响线上运行

linux 内核升级-程序员宅基地

文章浏览阅读841次,点赞28次,收藏9次。centos 7.x 升级内核 3.x 至 5.x

随便推点

Python获取操作系统版本信息_python 获取当前os版本-程序员宅基地

文章浏览阅读1.3w次。 最近,想在我的YouMoney(http://code.google.com/p/youmoney/)里面增加提取用户操作系统版本信息。比如windows用户,可能要返回Windows XP ,或者Windows 2003, 苹果用户应该返回Mac OS X 10.5.8。用了很多办法,包括在mac系统里调用系统命令,取环境变量,等等。最后无意发现,原来python里里面有个platfor_python 获取当前os版本

【MaixPy快速上手】屏幕和摄像头的使用_maixpy reset failed-程序员宅基地

文章浏览阅读1.3k次。第一个程序: 使用屏幕和摄像头开发板有配套的摄像头和屏幕,请在上电前检查硬件连接是否正确然后上电,打开串口终端, 按键盘Ctrl+E,然后粘贴以下代码:import sensor, lcdsensor.reset()sensor.set_pixformat(sensor.RGB565)sensor.set_framesize(sensor.QVGA)sensor.run(1)sensor.skip_frames()lcd.init(freq=15000000)while(True)_maixpy reset failed

【系统性学习】Linux Shell易忘重点整理_shell赋值保留换行-程序员宅基地

文章浏览阅读1.1k次。本文主要基于《实用Linux Shell编程》总结,并加入一些网上查询资料和博主自己的推断。其中命令相关的,已抽取出来在另一篇中,可以一起使用。_shell赋值保留换行

构造函数和析构函数-程序员宅基地

文章浏览阅读4.2k次,点赞10次,收藏25次。构造函数、初始化列表、析构函数_构造函数和析构函数

分布式架构知识体系-程序员宅基地

文章浏览阅读281次。1.问题 1、何为分布式何为微服务? 2、为什么需要分布式? 3、分布式核心理论基础,节点、网络、时间、顺序,一致性? 4、分布式是系统有哪些设计模式? 5、分布式有哪些类型? 6、如何实现分布式? 2.关键词节点,时间,一致性,CAP,ACID,BASE,P2P,机器伸缩,网络变更,负载均衡,限流,鉴权,服务发现,服务编排,降级,熔断,幂等,分库分表,分片分区,自动运维,容错处理,全栈监控,故障恢复,性能调优3.全文概要随着移动互联网_分布式架构知识体系

深信服AF防火墙配置SSL VPN_深信服ssl配置教程-程序员宅基地

文章浏览阅读1.5k次,点赞6次,收藏13次。深信服防火墙配置SSL VN_深信服ssl配置教程

推荐文章

热门文章

相关标签