Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read parquet data from S3 to spark dataframe Python?

Tags:

I am new to Spark and I am not able to find this... I have a lot of parquet files uploaded into s3 at location :

s3://a-dps/d-l/sco/alpha/20160930/parquet/ 

The total size of this folder is 20+ Gb,. How to chunk and read this into a dataframe How to load all these files into a dataframe?

Allocated memory to spark cluster is 6 gb.

    from pyspark import SparkContext     from pyspark.sql import SQLContext     from pyspark import SparkConf     from pyspark.sql import SparkSession     import pandas     # SparkConf().set("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.0.0-alpha3")     sc = SparkContext.getOrCreate()      sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", 'A')     sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", 's')      sqlContext = SQLContext(sc)     df2 = sqlContext.read.parquet("s3://sm/data/scor/alpha/2016/parquet/*") 

Error :

      Py4JJavaError: An error occurred while calling o33.parquet.     : java.io.IOException: No FileSystem for scheme: s3         at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)         at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)         at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:372)         at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)         at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)         at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)         at scala.collection.immutable.List.foreach(List.scala:381)         at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)         at scala.collection.immutable.List.flatMap(List.scala:344)   
like image 987
Viv Avatar asked Jun 19 '17 11:06

Viv


People also ask

How do I read a parquet file from S3 with spark?

Spark Read Parquet file from Amazon S3 into DataFrame Similar to write, DataFrameReader provides parquet() function ( spark. read. parquet ) to read the parquet files from the Amazon S3 bucket and creates a Spark DataFrame. In this example snippet, we are reading data from an apache parquet file we have written before.

How do I extract data from a parquet file in Python?

With the query results stored in a DataFrame, we can use petl to extract, transform, and load the Parquet data. In this example, we extract Parquet data, sort the data by the Column1 column, and load the data into a CSV file.

How does AWS spark read data S3?

Accessing S3 Bucket through Spark Well, it is not very easy to read S3 bucket by just adding Spark-core dependencies to your Spark project and use spark. read to read you data from S3 Bucket. Add Aws-Java-SDK along with Hadoop-AWS package to your spark-shell as written in the below command.


1 Answers

You've to use SparkSession instead of sqlContext since Spark 2.0

spark = SparkSession.builder                         .master("local")                                      .appName("app name")                                      .config("spark.some.config.option", true).getOrCreate()  df = spark.read.parquet("s3://path/to/parquet/file.parquet") 
like image 176
Artem Avatar answered Oct 14 '22 11:10

Artem