Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Saving/Exporting the results of a Spark SQL Zeppelin query

We're using apache zeppelin to analyse our datasets. We have some queries that we would like to run that have a large number of results that come back from them and would like to run the query in zeppelin but save the results (display is limited to 1000). Is there an easy way to get zeppelin save all the results of a query to s3 bucket maybe?

like image 329
vcetinick Avatar asked Sep 07 '16 00:09

vcetinick


People also ask

How do you save output in PySpark?

Saving the text files: Spark consists of a function called saveAsTextFile(), which saves the path of a file and writes the content of the RDD to that file. The path is considered as a directory, and multiple outputs will be produced in that directory.

What is Zeppelin SQL?

SQL. Zeppelin lets you connect any JDBC data sources seamlessly. Postgresql, Mysql, MariaDB, Redshift, Apache Hive and so on. USE NOW. Python is supported with Matplotlib, Conda, Pandas SQL and PySpark integrations.


1 Answers

I managed to whip up a notebook that effectively does what i want using the scala interpreter.

z.load("com.databricks:spark-csv_2.10:1.4.0")
val df= sqlContext.sql("""
select * from table
""")

df.repartition(1).write
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .save("s3://amazon.bucket.com/csv_output/")

Its worth mentioning that the z.load function seemed to work for me one day, but then i tried it again and for some reason i had to declare it in its own paragraph with the %dep interpreter, then the remaining code in the standard scala interpreter

like image 97
vcetinick Avatar answered Oct 19 '22 16:10

vcetinick