Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Connect Python to Spark Session and Keep RDDs Alive

How do I get a small Python script to hook into an existing instance of Spark and do operations on existing RDDs?

I'm in the early stages of working with Spark on Windows 10, trying scripts on a "Local" instance. I'm working with the latest stable build of Spark (Spark 2.0.1 for Hadoop 2.7). I've installed and set environment variables for Hadoop 2.7.3. I'm experimenting with both the Pyspark shell and Visual Studio 2015 Community with Python.

I'm trying to build a large engine, on which I'll run individual scripts to load, massage, format, and access the data. I'm sure there's a normal way to do that; isn't that the point of Spark?

Anyway, here's the experience I have so far. This is generally to be expected. When I build a small Spark script in Python and run it using Visual Studio, the script runs, does its job, and exits. In the process of exiting, it also exits the Spark Context it was using.

So I had the following thought: What if I started a persistent Spark Context in Pyspark and then set my SparkConf and SparkContext in each Python script to connect to that Spark Context? So, looking up online what the defaults are for Pyspark, I tried the following:

conf = SparkConf().setMaster("local[*]").setAppName("PySparkShell")
sc = SparkContext(conf = conf)

I started Pyspark. In a separate script in Visual Studio, I used this code for SparkContext. I loaded a text file into an RDD named RDDFromFilename . But I couldn't access that RDD in the Pyspark shell once the script had run.

How do I start a persistent Spark Context, create an RDD in it in one Python script, and access that RDD from subsequent Python scripts? Particularly in Windows?

like image 401
Joseph Pride Avatar asked Nov 06 '16 02:11

Joseph Pride


2 Answers

There is no solution in Spark. You may consider:

  • To keep persistent RDDs:

    • Apache Ignite
  • To keep persistent shared context:

    • spark-jobserver
    • livy - https://github.com/cloudera/livy
    • mist - https://github.com/Hydrospheredata/mist
  • To share context for with notebooks:

    • Apache Zeppelin

I think that out of these only Zeppelin officially supports Windows.

like image 78
8cf0a2f5 Avatar answered Oct 16 '22 12:10

8cf0a2f5


For those who may follow: I've recently discovered SnappyData.

SnappyData is still fairly young and there's a bit of a learning curve, but what it promises to do is make a persistent mutable SQL collection that can be shared between multiple Spark jobs and can be accessed natively as RDDs and DataFrames. It has a job server that you can dump concurrent jobs onto.

It's essentially a combination of a GemFire in-memory database with Spark clusters that are local in the same JVM, so (when I get decent at managing it) I can do large tasks without single-machine bottlenecks to pipe data in and out of Spark, or I can even do live data manipulation while another Spark program is running on the same data.

I know this is my own answer, but I'm probably not going to mark it as the answer until I get sophisticated enough to have opinions on how well it solves my problems.

like image 45
Joseph Pride Avatar answered Oct 16 '22 11:10

Joseph Pride