Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save/insert each DStream into a permanent table

I've been facing a problem with "Spark Streaming" about the insertion of output Dstream into a permanent SQL table. I'd like to insert every output DStream (coming from single batch that spark processes) into a unique table. I've been using Python with a Spark version 1.6.2.

At this part of my code I have a Dstream made of one or more RDD that i'd like to permanently insert/store into a SQL table without losing any result for each processed batch.

rr = feature_and_label.join(result_zipped)\
                      .map(lambda x: (x[1][0][0], x[1][1]) )

Each Dstream here is represented for instance like this tuple: (4.0, 0). I can't use SparkSQL because the way Spark treats the 'table', that is, like a temporary table, therefore loosing the result at every batch.

This is an example of output:


Time: 2016-09-23 00:57:00

(0.0, 2)


Time: 2016-09-23 00:57:01

(4.0, 0)


Time: 2016-09-23 00:57:02

(4.0, 0)

...

As shown above, each batch is made by only one Dstream. As I said before, I'd like to permanently store these results into a table saved somewhere, and possibly querying it at later time. So my question is: is there a way to do it ?
I'd appreciate whether somebody can help me out with it but especially telling me whether it is possible or not. Thank you.

like image 878
Davide Nardone Avatar asked Mar 12 '23 01:03

Davide Nardone


1 Answers

Vanilla Spark does not provide a way to persist data unless you've downloaded the version packaged with HDFS (although they appear to be playing with the idea in Spark 2.0). One way to store the results to a permanent table and query those results later is to use one of the various databases in the Spark Database Ecosystem. There are pros and cons to each and your use case matters. I'll provide something close to a master list. These are segmented by:

Type of data managment, form data is stored in, connection to Spark

Database, SQL, Integrated

  • SnappyData

Database, SQL, Connector

  • MemSQL
  • Hana
  • Kudu
  • FiloDB
  • DB2
  • SQLServer (JDBC)
  • Oracle (JDBC)
  • MySQL (JDBC)

Database, NoSQL, Connector

  • Cassandra
  • HBase
  • Druid
  • Ampool
  • Riak
  • Aerospike
  • Cloudant

Database, Document, Connector

  • MongoDB
  • Couchbase

Database, Graph, Connector

  • Neo4j
  • OrientDB

Search, Document, Connector

  • Elasticsearch
  • Solr

Data grid, SQL, Connector

  • Ignite

Data grid, NoSQL, Connector

  • Infinispan
  • Hazelcast
  • Redis

File System, Files, Integrated

  • HDFS

File System, Files, Connector

  • S3
  • Alluxio

Datawarehouse, SQL, Connector

  • Redshift
  • Snowflake
  • BigQuery
  • Aster
like image 147
plamb Avatar answered Mar 24 '23 21:03

plamb