How read table with non utf-8 encoding in aws gllue?

Tags:

Here is a code snipped for reading csv files (scala):

val input = glueContext
  .getCatalogSource(database = "my_database", tableName = "my_table")
  .getDynamicFrame()

Which failed with unclear error:

com.amazonaws.services.glue.util.FatalException: Unable to parse file: my_file_20170101.csv.gz
at com.amazonaws.services.glue.readers.JacksonReader.hasNextFailSafe(JacksonReader.scala:91)
at com.amazonaws.services.glue.readers.JacksonReader.hasNext(JacksonReader.scala:36)
at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReader.nextKeyValue(TapeHadoopRecordReader.scala:63)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

The code works for other scv files, but this one has ANSI encoding. Is there a way to tell glue (or may to spark internals) to read file with different encoding?

460

asked Jan 16 '18 08:01

Cherry

1 Answers

Can use the underlying spark functionality to import a spark df from a non UTF-8 file (I used python as below):

# imports
from pyspark.context import SparkContext
from awsglue.context import GlueContext

...

# set contexts
glueContext = GlueContext(SparkContext.getOrCreate())

....

# import file
df = glueContext.read.load(my_file,
                           format="csv",
                           sep="|",
                           header="true",
                           encoding='my_encoding')

answered Sep 29 '22 09:09

fmcmac

Related questions
                            
                                ssl - Error self signed certificate getting chain
                            
                                Get ParameterizedType from scala's Type?
                            
                                how to ignore test utility methods when scalatest detects failures?
                            
                                How to give predicted and label columns in BinaryClassificationMetrics evaluation for Naive Bayes model
                            
                                Scala one-liner to generate MD5 Hash from string
                            
                                Scala Play Json JSResultException Validation Error
                            
                                Prediction.io - pio train fails
                            
                                How to convert RDD to DataFrame in Spark Streaming, not just Spark
                            
                                Gradle Scala Plugin - how to specify zincClasspath
                            
                                Apache Toree and Spark Scala Not Working in Jupyter
                            
                                porting python to scala
                            
                                why scala Map does not implement unapply?
                            
                                How to implement a ScalaTest FunSuite to avoid boilerplate Spark code and import implicits
                            
                                How to log malformed rows from Scala Spark DataFrameReader csv
                            
                                Can the subflows of groupBy depend on the keys they were generated from ?
                            
                                Unsupported literal type class in Apache Spark in scala
                            
                                Scala behaviour when assigning literals or variables to Char
                            
                                How to automatically generate a function to match a sealed case class family with implicit instances?
                            
                                Spark Streaming Guarantee Specific Start Window Time
                            
                                Understanding IO monad in Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How read table with non utf-8 encoding in aws gllue?

Tags:

character-encoding

amazon-web-services

scala

apache-spark

aws-glue

Cherry

People also ask

1 Answers

fmcmac

Recent Activity

Donate For Us