I'm new with apache spark and apparently I installed apache-spark with homebrew in my macbook: <pre class="prettyprint"><code>Last login: Fri Jan 8 12:52:04 on console user@MacBook-Pro-de-User-2:~$ pyspark Python 2.7.10 (default, Jul 13 2015, 12:05:58) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/01/08 14:46:44 INFO SparkContext: Running Spark version 1.5.1 16/01/08 14:46:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/01/08 14:46:47 INFO SecurityManager: Changing view acls to: user 16/01/08 14:46:47 INFO SecurityManager: Changing modify acls to: user 16/01/08 14:46:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user) 16/01/08 14:46:50 INFO Slf4jLogger: Slf4jLogger started 16/01/08 14:46:50 INFO Remoting: Starting remoting 16/01/08 14:46:51 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.64:50199] 16/01/08 14:46:51 INFO Utils: Successfully started service 'sparkDriver' on port 50199. 16/01/08 14:46:51 INFO SparkEnv: Registering MapOutputTracker 16/01/08 14:46:51 INFO SparkEnv: Registering BlockManagerMaster 16/01/08 14:46:51 INFO DiskBlockManager: Created local directory at /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/blockmgr-769e6f91-f0e7-49f9-b45d-1b6382637c95 16/01/08 14:46:51 INFO MemoryStore: MemoryStore started with capacity 530.0 MB 16/01/08 14:46:52 INFO HttpFileServer: HTTP File server directory is /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/spark-8e4749ea-9ae7-4137-a0e1-52e410a8e4c5/httpd-1adcd424-c8e9-4e54-a45a-a735ade00393 16/01/08 14:46:52 INFO HttpServer: Starting HTTP Server 16/01/08 14:46:52 INFO Utils: Successfully started service 'HTTP file server' on port 50200. 16/01/08 14:46:52 INFO SparkEnv: Registering OutputCommitCoordinator 16/01/08 14:46:52 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/01/08 14:46:52 INFO SparkUI: Started SparkUI at http://192.168.1.64:4040 16/01/08 14:46:53 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. 16/01/08 14:46:53 INFO Executor: Starting executor ID driver on host localhost 16/01/08 14:46:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50201. 16/01/08 14:46:53 INFO NettyBlockTransferService: Server created on 50201 16/01/08 14:46:53 INFO BlockManagerMaster: Trying to register BlockManager 16/01/08 14:46:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:50201 with 530.0 MB RAM, BlockManagerId(driver, localhost, 50201) 16/01/08 14:46:53 INFO BlockManagerMaster: Registered BlockManager Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.5.1 /_/ Using Python version 2.7.10 (default, Jul 13 2015 12:05:58) SparkContext available as sc, HiveContext available as sqlContext. >>> </code></pre> I would like start playing in order to learn more about MLlib. However, I use Pycharm to write scripts in python. The problem is: when I go to Pycharm and try to call pyspark, Pycharm can not found the module. I tried adding the path to Pycharm as follows: <img src="https://i.stack.imgur.com/SCMrY.png" alt="cant link pycharm with spark"> Then from a blog I tried this: <pre class="prettyprint"><code>import os import sys # Path for spark source folder os.environ['SPARK_HOME']="/Users/user/Apps/spark-1.5.2-bin-hadoop2.4" # Append pyspark to Python Path sys.path.append("/Users/user/Apps/spark-1.5.2-bin-hadoop2.4/python/pyspark") try: from pyspark import SparkContext from pyspark import SparkConf print ("Successfully imported Spark Modules") except ImportError as e: print ("Can not import Spark Modules", e) sys.exit(1) </code></pre> And still can not start using PySpark with Pycharm, any idea of how to "link" PyCharm with apache-pyspark?. Update: Then I search for apache-spark and python path in order to set the environment variables of Pycharm: apache-spark path: <pre class="prettyprint"><code>user@MacBook-Pro-User-2:~$ brew info apache-spark apache-spark: stable 1.6.0, HEAD Engine for large-scale data processing https://spark.apache.org/ /usr/local/Cellar/apache-spark/1.5.1 (649 files, 302.9M) * Poured from bottle From: https://github.com/Homebrew/homebrew/blob/master/Library/Formula/apache-spark.rb </code></pre> python path: <pre class="prettyprint"><code>user@MacBook-Pro-User-2:~$ brew info python python: stable 2.7.11 (bottled), HEAD Interpreted, interactive, object-oriented programming language https://www.python.org /usr/local/Cellar/python/2.7.10_2 (4,965 files, 66.9M) * </code></pre> Then with the above information I tried to set the environment variables as follows: <img src="https://i.stack.imgur.com/TOsDo.png" alt="configuration 1"> Any idea of how to correctly link Pycharm with pyspark? Then when I run a python script with the above configuration I have this exception: <pre class="prettyprint"><code>/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/spark_examples/test_1.py Traceback (most recent call last): File "/Users/user/PycharmProjects/spark_examples/test_1.py", line 1, in <module> from pyspark import SparkContext ImportError: No module named pyspark </code></pre> UPDATE: Then I tried this configurations proposed by @zero323 Configuration 1: <pre class="prettyprint"><code>/usr/local/Cellar/apache-spark/1.5.1/ </code></pre> <img src="https://i.stack.imgur.com/i9dZu.png" alt="conf 1"> out: <pre class="prettyprint"><code> user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1$ ls CHANGES.txt NOTICE libexec/ INSTALL_RECEIPT.json README.md LICENSE bin/ </code></pre> Configuration 2: <pre class="prettyprint"><code>/usr/local/Cellar/apache-spark/1.5.1/libexec </code></pre> <img src="https://i.stack.imgur.com/Bq2YP.png" alt="enter image description here"> out: <pre class="prettyprint"><code>user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1/libexec$ ls R/ bin/ data/ examples/ python/ RELEASE conf/ ec2/ lib/ sbin/ </code></pre>

<h3>With PySpark package (Spark 2.2.0 and later)</h3> With SPARK-1267 being merged you should be able to simplify the process by <code>pip</code> installing Spark in the environment you use for PyCharm development. <ol> <li>Go to File -> Settings -> Project Interpreter </li> <li> Click on install button and search for PySpark <img src="https://i.stack.imgur.com/9llgG.png" alt="enter image description here"> </li> <li>Click on install package button.</li> </ol> <h3>Manually with user provided Spark installation</h3> Create Run configuration: <ol> <li>Go to Run -> Edit configurations </li> <li>Add new Python configuration</li> <li>Set Script path so it points to the script you want to execute</li> <li> Edit Environment variables field so it contains at least: <ul> <li> <code>SPARK_HOME</code> - it should point to the directory with Spark installation. It should contain directories such as <code>bin</code> (with <code>spark-submit</code>, <code>spark-shell</code>, etc.) and <code>conf</code> (with <code>spark-defaults.conf</code>, <code>spark-env.sh</code>, etc.)</li> <li> <code>PYTHONPATH</code> - it should contain <code>$SPARK_HOME/python</code> and optionally <code>$SPARK_HOME/python/lib/py4j-some-version.src.zip</code> if not available otherwise. <code>some-version</code> should match Py4J version used by a given Spark installation (0.8.2.1 - 1.5, 0.9 - 1.6, 0.10.3 - 2.0, 0.10.4 - 2.1, 0.10.4 - 2.2, 0.10.6 - 2.3, 0.10.7 - 2.4) <img src="https://i.stack.imgur.com/LwAJ6.png" alt="enter image description here"> </li> </ul> </li> <li>Apply the settings</li> </ol> Add PySpark library to the interpreter path (required for code completion): <ol> <li>Go to File -> Settings -> Project Interpreter </li> <li>Open settings for an interpreter you want to use with Spark</li> <li>Edit interpreter paths so it contains path to <code>$SPARK_HOME/python</code> (an Py4J if required)</li> <li>Save the settings</li> </ol> <h3>Optionally</h3> <ol> <li>Install or add to path type annotations matching installed Spark version to get better completion and static error detection (Disclaimer - I am an author of the project).</li> </ol> <h3>Finally</h3> Use newly created configuration to run your script.

How to link PyCharm with PySpark?

Tags:

python

homebrew

pycharm

apache-spark

pyspark

I'm new with apache spark and apparently I installed apache-spark with homebrew in my macbook:

Last login: Fri Jan  8 12:52:04 on console user@MacBook-Pro-de-User-2:~$ pyspark Python 2.7.10 (default, Jul 13 2015, 12:05:58) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/01/08 14:46:44 INFO SparkContext: Running Spark version 1.5.1 16/01/08 14:46:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/01/08 14:46:47 INFO SecurityManager: Changing view acls to: user 16/01/08 14:46:47 INFO SecurityManager: Changing modify acls to: user 16/01/08 14:46:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user) 16/01/08 14:46:50 INFO Slf4jLogger: Slf4jLogger started 16/01/08 14:46:50 INFO Remoting: Starting remoting 16/01/08 14:46:51 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:50199] 16/01/08 14:46:51 INFO Utils: Successfully started service 'sparkDriver' on port 50199. 16/01/08 14:46:51 INFO SparkEnv: Registering MapOutputTracker 16/01/08 14:46:51 INFO SparkEnv: Registering BlockManagerMaster 16/01/08 14:46:51 INFO DiskBlockManager: Created local directory at /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/blockmgr-769e6f91-f0e7-49f9-b45d-1b6382637c95 16/01/08 14:46:51 INFO MemoryStore: MemoryStore started with capacity 530.0 MB 16/01/08 14:46:52 INFO HttpFileServer: HTTP File server directory is /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/spark-8e4749ea-9ae7-4137-a0e1-52e410a8e4c5/httpd-1adcd424-c8e9-4e54-a45a-a735ade00393 16/01/08 14:46:52 INFO HttpServer: Starting HTTP Server 16/01/08 14:46:52 INFO Utils: Successfully started service 'HTTP file server' on port 50200. 16/01/08 14:46:52 INFO SparkEnv: Registering OutputCommitCoordinator 16/01/08 14:46:52 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/01/08 14:46:52 INFO SparkUI: Started SparkUI at http://192.168.1.64:4040 16/01/08 14:46:53 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. 16/01/08 14:46:53 INFO Executor: Starting executor ID driver on host localhost 16/01/08 14:46:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50201. 16/01/08 14:46:53 INFO NettyBlockTransferService: Server created on 50201 16/01/08 14:46:53 INFO BlockManagerMaster: Trying to register BlockManager 16/01/08 14:46:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:50201 with 530.0 MB RAM, BlockManagerId(driver, localhost, 50201) 16/01/08 14:46:53 INFO BlockManagerMaster: Registered BlockManager Welcome to       ____              __      / __/__  ___ _____/ /__     _\ \/ _ \/ _ `/ __/  '_/    /__ / .__/\_,_/_/ /_/\_\   version 1.5.1       /_/  Using Python version 2.7.10 (default, Jul 13 2015 12:05:58) SparkContext available as sc, HiveContext available as sqlContext. >>>

I would like start playing in order to learn more about MLlib. However, I use Pycharm to write scripts in python. The problem is: when I go to Pycharm and try to call pyspark, Pycharm can not found the module. I tried adding the path to Pycharm as follows:

cant link pycharm with spark

Then from a blog I tried this:

import os import sys  # Path for spark source folder os.environ['SPARK_HOME']="/Users/user/Apps/spark-1.5.2-bin-hadoop2.4"  # Append pyspark  to Python Path sys.path.append("/Users/user/Apps/spark-1.5.2-bin-hadoop2.4/python/pyspark")  try:     from pyspark import SparkContext     from pyspark import SparkConf     print ("Successfully imported Spark Modules")  except ImportError as e:     print ("Can not import Spark Modules", e)     sys.exit(1)

And still can not start using PySpark with Pycharm, any idea of how to "link" PyCharm with apache-pyspark?.

Update:

Then I search for apache-spark and python path in order to set the environment variables of Pycharm:

apache-spark path:

user@MacBook-Pro-User-2:~$ brew info apache-spark apache-spark: stable 1.6.0, HEAD Engine for large-scale data processing https://spark.apache.org/ /usr/local/Cellar/apache-spark/1.5.1 (649 files, 302.9M) *   Poured from bottle From: https://github.com/Homebrew/homebrew/blob/master/Library/Formula/apache-spark.rb

python path:

user@MacBook-Pro-User-2:~$ brew info python python: stable 2.7.11 (bottled), HEAD Interpreted, interactive, object-oriented programming language https://www.python.org /usr/local/Cellar/python/2.7.10_2 (4,965 files, 66.9M) *

Then with the above information I tried to set the environment variables as follows:

configuration 1

Any idea of how to correctly link Pycharm with pyspark?

Then when I run a python script with the above configuration I have this exception:

/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/spark_examples/test_1.py Traceback (most recent call last):   File "/Users/user/PycharmProjects/spark_examples/test_1.py", line 1, in <module>     from pyspark import SparkContext ImportError: No module named pyspark

UPDATE: Then I tried this configurations proposed by @zero323

Configuration 1:

/usr/local/Cellar/apache-spark/1.5.1/

conf 1

out:

 user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1$ ls CHANGES.txt           NOTICE                libexec/ INSTALL_RECEIPT.json  README.md LICENSE               bin/

Configuration 2:

/usr/local/Cellar/apache-spark/1.5.1/libexec

enter image description here

out:

user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1/libexec$ ls R/        bin/      data/     examples/ python/ RELEASE   conf/     ec2/      lib/      sbin/

325

asked Jan 08 '16 20:01

tumbleweed

2 Answers

With PySpark package (Spark 2.2.0 and later)

With SPARK-1267 being merged you should be able to simplify the process by pip installing Spark in the environment you use for PyCharm development.

Go to File -> Settings -> Project Interpreter
Click on install button and search for PySpark
Click on install package button.

Manually with user provided Spark installation

Create Run configuration:

Go to Run -> Edit configurations
Add new Python configuration
Set Script path so it points to the script you want to execute
Edit Environment variables field so it contains at least:
- SPARK_HOME - it should point to the directory with Spark installation. It should contain directories such as bin (with spark-submit, spark-shell, etc.) and conf (with spark-defaults.conf, spark-env.sh, etc.)
- PYTHONPATH - it should contain $SPARK_HOME/python and optionally $SPARK_HOME/python/lib/py4j-some-version.src.zip if not available otherwise. some-version should match Py4J version used by a given Spark installation (0.8.2.1 - 1.5, 0.9 - 1.6, 0.10.3 - 2.0, 0.10.4 - 2.1, 0.10.4 - 2.2, 0.10.6 - 2.3, 0.10.7 - 2.4)
Apply the settings

Add PySpark library to the interpreter path (required for code completion):

Go to File -> Settings -> Project Interpreter
Open settings for an interpreter you want to use with Spark
Edit interpreter paths so it contains path to $SPARK_HOME/python (an Py4J if required)
Save the settings

Optionally

Install or add to path type annotations matching installed Spark version to get better completion and static error detection (Disclaimer - I am an author of the project).

Finally

Use newly created configuration to run your script.

135

answered Sep 21 '22 00:09

zero323

Here's how I solved this on mac osx.

brew install apache-spark

Add this to ~/.bash_profile

export SPARK_VERSION=`ls /usr/local/Cellar/apache-spark/ | sort | tail -1` export SPARK_HOME="/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec" export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH

Add pyspark and py4j to content root (use the correct Spark version):

/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip /usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip