Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to specify the path where saveAsTable saves files to?

I am trying to save a DataFrame to S3 in pyspark in Spark1.4 using DataFrameWriter

df = sqlContext.read.format("json").load("s3a://somefile")
df_writer = pyspark.sql.DataFrameWriter(df)
df_writer.partitionBy('col1')\
         .saveAsTable('test_table', format='parquet', mode='overwrite')

The parquet files went to "/tmp/hive/warehouse/...." which is a local tmp directory on my driver.

I did setup hive.metastore.warehouse.dir in hive-site.xml to a "s3a://...." location, but spark doesn't seem to respect to my hive warehouse setting.

like image 689
ChromeHearts Avatar asked Jun 16 '15 18:06

ChromeHearts


People also ask

How do you write a DataFrame to a local file system?

1. Write a Single file using Spark coalesce() & repartition() When you are ready to write a DataFrame, first use Spark repartition() and coalesce() to merge data from all partitions into a single partition and then save it to a file.

How do I load data into Spark?

LOAD DATA statement loads the data into a table from the user specified directory or file. If a directory is specified then all the files from the directory are loaded. If a file is specified then only the single file is loaded. Additionally the LOAD DATA statement takes an optional partition specification.


1 Answers

Use path.

df_writer.partitionBy('col1')\
         .saveAsTable('test_table', format='parquet', mode='overwrite',
                      path='s3a://bucket/foo')
like image 164
ChromeHearts Avatar answered Sep 20 '22 11:09

ChromeHearts