How to save/insert each DStream into a permanent table

Question

I've been facing a problem with "Spark Streaming" about the insertion of output Dstream into a permanent SQL table. I'd like to insert every output DStream (coming from single batch that spark processes) into a unique table. I've been using Python with a Spark version 1.6.2.

At this part of my code I have a Dstream made of one or more RDD that i'd like to permanently insert/store into a SQL table without losing any result for each processed batch.

rr = feature_and_label.join(result_zipped)\
                      .map(lambda x: (x[1][0][0], x[1][1]) )

Each Dstream here is represented for instance like this tuple: (4.0, 0). I can't use SparkSQL because the way Spark treats the 'table', that is, like a temporary table, therefore loosing the result at every batch.

This is an example of output:

Time: 2016-09-23 00:57:00

(0.0, 2)

Time: 2016-09-23 00:57:01

(4.0, 0)

Time: 2016-09-23 00:57:02

(4.0, 0)

...

As shown above, each batch is made by only one Dstream. As I said before, I'd like to permanently store these results into a table saved somewhere, and possibly querying it at later time. So my question is: is there a way to do it ?
I'd appreciate whether somebody can help me out with it but especially telling me whether it is possible or not. Thank you.

plamb · Accepted Answer

Vanilla Spark does not provide a way to persist data unless you've downloaded the version packaged with HDFS (although they appear to be playing with the idea in Spark 2.0). One way to store the results to a permanent table and query those results later is to use one of the various databases in the Spark Database Ecosystem. There are pros and cons to each and your use case matters. I'll provide something close to a master list. These are segmented by:

Type of data managment, form data is stored in, connection to Spark

Database, SQL, Integrated

SnappyData

Database, SQL, Connector

MemSQL
Hana
Kudu
FiloDB
DB2
SQLServer (JDBC)
Oracle (JDBC)
MySQL (JDBC)

Database, NoSQL, Connector

Cassandra
HBase
Druid
Ampool
Riak
Aerospike
Cloudant

Database, Document, Connector

MongoDB
Couchbase

Database, Graph, Connector

Neo4j
OrientDB

Search, Document, Connector

Elasticsearch
Solr

Data grid, SQL, Connector

Ignite

Data grid, NoSQL, Connector

Infinispan
Hazelcast
Redis

File System, Files, Integrated

HDFS

File System, Files, Connector

S3
Alluxio

Datawarehouse, SQL, Connector

Redshift
Snowflake
BigQuery
Aster

How to save/insert each DStream into a permanent table

Tags:

apache-spark

apache-spark-sql

pyspark

spark-streaming

Time: 2016-09-23 00:57:00

Time: 2016-09-23 00:57:01

Time: 2016-09-23 00:57:02

Davide Nardone

1 Answers

Type of data managment, form data is stored in, connection to Spark

Database, SQL, Integrated

Database, SQL, Connector

Database, NoSQL, Connector

Database, Document, Connector

Database, Graph, Connector

Search, Document, Connector

Data grid, SQL, Connector

Data grid, NoSQL, Connector

File System, Files, Integrated

File System, Files, Connector

Datawarehouse, SQL, Connector

plamb

Recent Activity

Donate For Us

How to save/insert each DStream into a permanent table

Tags:

apache-spark

apache-spark-sql

pyspark

spark-streaming

Time: 2016-09-23 00:57:00

Time: 2016-09-23 00:57:01

Time: 2016-09-23 00:57:02

Davide Nardone

1 Answers

Type of data managment, form data is stored in, connection to Spark

Database, SQL, Integrated

Database, SQL, Connector

Database, NoSQL, Connector

Database, Document, Connector

Database, Graph, Connector

Search, Document, Connector

Data grid, SQL, Connector

Data grid, NoSQL, Connector

File System, Files, Integrated

File System, Files, Connector

Datawarehouse, SQL, Connector

plamb

Related questions

Recent Activity

Donate For Us