Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How read table with non utf-8 encoding in aws gllue?

Here is a code snipped for reading csv files (scala):

val input = glueContext
  .getCatalogSource(database = "my_database", tableName = "my_table")
  .getDynamicFrame()

Which failed with unclear error:

com.amazonaws.services.glue.util.FatalException: Unable to parse file: my_file_20170101.csv.gz
at com.amazonaws.services.glue.readers.JacksonReader.hasNextFailSafe(JacksonReader.scala:91)
at com.amazonaws.services.glue.readers.JacksonReader.hasNext(JacksonReader.scala:36)
at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReader.nextKeyValue(TapeHadoopRecordReader.scala:63)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

The code works for other scv files, but this one has ANSI encoding. Is there a way to tell glue (or may to spark internals) to read file with different encoding?

like image 460
Cherry Avatar asked Jan 16 '18 08:01

Cherry


People also ask

What data formats does AWS Glue support?

Compressed CSV, JSON, ORC, and Parquet files are supported, but CSV and JSON files must include the compression codec as the file extension. If you are importing a folder, all files in the folder must be of the same file type.

How do I read a csv file in AWS Glue?

Configuration: In your function options, specify format="csv" . In your connection_options , use the paths key to specify s3path . You can configure how the reader interacts with S3 in connection_options . For details, see Connection types and options for ETL in AWS Glue: Amazon S3 connection.

Which of the following formats are accepted in target of AWS Glue till the current version )?

AWS Glue can recognize and interpret this data format from an Apache Kafka, Amazon Managed Streaming for Apache Kafka or Amazon Kinesis message stream. We expect streams to present data in a consistent format, so they are read in as DataFrames .


1 Answers

Can use the underlying spark functionality to import a spark df from a non UTF-8 file (I used python as below):

# imports
from pyspark.context import SparkContext
from awsglue.context import GlueContext

...

# set contexts
glueContext = GlueContext(SparkContext.getOrCreate())

....

# import file
df = glueContext.read.load(my_file,
                           format="csv",
                           sep="|",
                           header="true",
                           encoding='my_encoding')
like image 88
fmcmac Avatar answered Sep 29 '22 09:09

fmcmac