In the current setup I use a Jupyter notebook server that has a pyspark
profile to use Spark. This all works great. I'm however working on a pretty big project and the notebook environment is lacking a bit for me. I found out that PyCharm allows you to run notebooks inside the IDE, giving you more of the advantages of a full IDE as opposed to Jupyter.
In the best case scenario I would run PyCharm locally as opposed to remote desktop on the gateway but using the gateway would be an acceptable alternative.
I'm trying first to get it to work on the gateway. If I have my (spark) Jupyter server running, the IP address set correctly 127.0.0.1:8888
and I create an .ipynb
file, after I enter a line and press enter (not running it, just add a newline) I get the following error in the terminal I started pycharm from:
ERROR - pplication.impl.LaterInvocator - Not a stub type: Py:IPNB_TARGET in class org.jetbrains.plugins.ipnb.psi.IpnbPyTargetExpression
Googling doesn't get me anywhere.
Below is a table of differences between Jupyter and Pycharm. S.No. Jupyter. Pycharm. 1. Jupyter notebook is a web-based interactive computing platform. Pycharm is a smart code editor. 2. The notebook combines live code, equations, narrative text, visualizations, interactive dashboards and other media.
To start working with Jupyter notebooks in PyCharm: Create a new Python project, specify a virtual environment, and install the jupyter package. Open or create an .ipynb file. Add and edit source cells. Execute any of the code cells to launch the Jupyter server. Analyze execution results in the Preview pane.
How to set up Spark for PyCharm? Navigate to Project Structure -> Click on ‘Add Content Root’ -> Go to folder where Spark is setup -> Select python folder Again click on Add Content Root -> Go to Spark Folder -> expand python -> expand lib -> select py4j-0.9-src.zip and apply the changes and wait for the indexing to be done
Now, we can directly launch a Jupyter Notebook instance by running the pyspark command in the terminal. Important note:Always make sure to refresh the terminal environment; otherwise, the newly added environment variables will not be recognized. Now visit the provided URL, and you are ready to interact with Spark via the Jupyter Notebook.
I was able to get all three working by installing spark via terminal on OS X. Then I added the following packages to PyCharm project interpreter: findspark, pyspark.
Tested it out with
import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()
outputting: 3.14160028
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With