I am new to Spark and I am not able to find this... I have a lot of parquet files uploaded into s3
at location :
s3://a-dps/d-l/sco/alpha/20160930/parquet/
The total size of this folder is 20+ Gb
,. How to chunk and read this into a dataframe How to load all these files into a dataframe?
Allocated memory to spark cluster is 6 gb.
from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark import SparkConf from pyspark.sql import SparkSession import pandas # SparkConf().set("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.0.0-alpha3") sc = SparkContext.getOrCreate() sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", 'A') sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", 's') sqlContext = SQLContext(sc) df2 = sqlContext.read.parquet("s3://sm/data/scor/alpha/2016/parquet/*")
Error :
Py4JJavaError: An error occurred while calling o33.parquet. : java.io.IOException: No FileSystem for scheme: s3 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:372) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344)
Spark Read Parquet file from Amazon S3 into DataFrame Similar to write, DataFrameReader provides parquet() function ( spark. read. parquet ) to read the parquet files from the Amazon S3 bucket and creates a Spark DataFrame. In this example snippet, we are reading data from an apache parquet file we have written before.
With the query results stored in a DataFrame, we can use petl to extract, transform, and load the Parquet data. In this example, we extract Parquet data, sort the data by the Column1 column, and load the data into a CSV file.
Accessing S3 Bucket through Spark Well, it is not very easy to read S3 bucket by just adding Spark-core dependencies to your Spark project and use spark. read to read you data from S3 Bucket. Add Aws-Java-SDK along with Hadoop-AWS package to your spark-shell as written in the below command.
You've to use SparkSession instead of sqlContext since Spark 2.0
spark = SparkSession.builder .master("local") .appName("app name") .config("spark.some.config.option", true).getOrCreate() df = spark.read.parquet("s3://path/to/parquet/file.parquet")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With