Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading parquet files in AWS Glue

I'm a AWS Glue newbie that is trying to read some parquet objects that I have in S3 but I fail by ClassNotFoundException. This is my attempt so far based on the minimal documentation of Glue:

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.sql.SparkSession

val gc: GlueContext = new GlueContext(sc)

val spark_session : SparkSession = gc.getSparkSession

val source = gc.getSource("s3", JsonOptions(Map("paths" -> Set("s3://path-to-parquet"))))

val parquetSource = source.withFormat("parquet")

parquetSource.getDynamicFrame().show(1)

And the exception:

   18/06/11 13:39:11 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID 266, ip-172-31-8-179.eu-west-1.compute.internal, executor 16): java.lang.ClassNotFoundException: Failed to load format with name parquet
    at com.amazonaws.services.glue.util.ClassUtils$.loadByFullName(ClassUtils.scala:28)
    at com.amazonaws.services.glue.util.ClassUtils$.getClassByName(ClassUtils.scala:43)
    at com.amazonaws.services.glue.util.ClassUtils$.newInstanceByName(ClassUtils.scala:54)
    at com.amazonaws.services.glue.readers.DynamicRecordStreamReader$.apply(DynamicRecordReader.scala:187)
    ...
Caused by: java.lang.ClassNotFoundException: parquet
    at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:82)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at com.amazonaws.services.glue.util.ClassUtils$$anonfun$1.apply(ClassUtils.scala:25)
    at com.amazonaws.services.glue.util.ClassUtils$$anonfun$1.apply(ClassUtils.scala:25)
    at scala.util.Try$.apply(Try.scala:192)
    at com.amazonaws.services.glue.util.ClassUtils$.loadByFullName(ClassUtils.scala:25)
    ... 28 more
like image 430
selle Avatar asked Jun 11 '18 14:06

selle


People also ask

Can glue read parquet files?

AWS Glue offers an optimized Apache Parquet writer when using DynamicFrames to improve performance. Apache Parquet format is generally faster for reads than writes because of its columnar storage layout and a pre-computed schema that is written with the data into the files.

How do I view a parquet file?

parquet file formats. You can open a file by selecting from file picker, dragging on the app or double-clicking a . parquet file on disk. This utility is free forever and needs you feedback to continue improving.

What file formats does AWS Glue support?

For input data, AWS Glue DataBrew supports commonly used file formats, such as comma-separated values (. csv), JSON and nested JSON, Apache Parquet and nested Apache Parquet, and Excel sheets.

Can AWS Glue convert to parquet?

When you create the AWS Glue jobs, you can use either the IAM role that is attached or an existing role. The Python code uses the Pandas and PyArrow libraries to convert data to Parquet. The Pandas library is already available.


1 Answers

I solved the issue. I had specified the wrong connectionType for 'getSource': it should be "parquet" and not "s3":

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.sql.SparkSession

val gc: GlueContext = new GlueContext(sc)

val spark_session : SparkSession = gc.getSparkSession

val source = gc.getSource("parquet", JsonOptions(Map("paths" -> Set("s3://path-to-parquet"))))

source.getDynamicFrame().show(1)

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-parquet

Hopefully this helps somebody!

like image 191
selle Avatar answered Sep 22 '22 15:09

selle