How to read parquet data from S3 to spark dataframe Python?

Tags:

I am new to Spark and I am not able to find this... I have a lot of parquet files uploaded into s3 at location :

s3://a-dps/d-l/sco/alpha/20160930/parquet/

The total size of this folder is 20+ Gb,. How to chunk and read this into a dataframe How to load all these files into a dataframe?

Allocated memory to spark cluster is 6 gb.

    from pyspark import SparkContext     from pyspark.sql import SQLContext     from pyspark import SparkConf     from pyspark.sql import SparkSession     import pandas     # SparkConf().set("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.0.0-alpha3")     sc = SparkContext.getOrCreate()      sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", 'A')     sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", 's')      sqlContext = SQLContext(sc)     df2 = sqlContext.read.parquet("s3://sm/data/scor/alpha/2016/parquet/*")

Error :

      Py4JJavaError: An error occurred while calling o33.parquet.     : java.io.IOException: No FileSystem for scheme: s3         at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)         at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)         at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:372)         at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)         at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)         at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)         at scala.collection.immutable.List.foreach(List.scala:381)         at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)         at scala.collection.immutable.List.flatMap(List.scala:344)

987

asked Jun 19 '17 11:06

Viv

1 Answers

You've to use SparkSession instead of sqlContext since Spark 2.0

spark = SparkSession.builder                         .master("local")                                      .appName("app name")                                      .config("spark.some.config.option", true).getOrCreate()  df = spark.read.parquet("s3://path/to/parquet/file.parquet")

176

answered Oct 14 '22 11:10

Artem

Related questions
                            
                                BeautifulSoup: what's the difference between 'lxml' and 'html.parser' and 'html5lib' parsers?
                            
                                ASP.NET Core 2 + Get instance of db context
                            
                                How to get a random element from a list with stream api?
                            
                                Custom back indicator image and iOS 11
                            
                                Error:Could not find com.android.tools.build:gradle:3.3. Issue raise after upgrading gradle version for splunk:mint-android-sdk
                            
                                Android dependency '..' has different version for the compile (..) and runtime (..) classpath
                            
                                YouTube quotas exceeded
                            
                                input's event.target is null within this.setState [React.js]
                            
                                Undefined behaviour in repeated use of prefix ++ operator
                            
                                CRAN check warning: Dependence on R version '3.4.3' not with patchlevel 0
                            
                                WebAPI Core routing issues
                            
                                NPM WARN: [email protected] requires a peer of popper.js

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read parquet data from S3 to spark dataframe Python?

Tags:

Viv

People also ask

1 Answers

Artem

Recent Activity

Donate For Us