Here is a code snipped for reading csv files (scala):
val input = glueContext
.getCatalogSource(database = "my_database", tableName = "my_table")
.getDynamicFrame()
Which failed with unclear error:
com.amazonaws.services.glue.util.FatalException: Unable to parse file: my_file_20170101.csv.gz
at com.amazonaws.services.glue.readers.JacksonReader.hasNextFailSafe(JacksonReader.scala:91)
at com.amazonaws.services.glue.readers.JacksonReader.hasNext(JacksonReader.scala:36)
at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReader.nextKeyValue(TapeHadoopRecordReader.scala:63)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The code works for other scv files, but this one has ANSI
encoding. Is there a way to tell glue (or may to spark internals) to read file with different encoding?
Compressed CSV, JSON, ORC, and Parquet files are supported, but CSV and JSON files must include the compression codec as the file extension. If you are importing a folder, all files in the folder must be of the same file type.
Configuration: In your function options, specify format="csv" . In your connection_options , use the paths key to specify s3path . You can configure how the reader interacts with S3 in connection_options . For details, see Connection types and options for ETL in AWS Glue: Amazon S3 connection.
AWS Glue can recognize and interpret this data format from an Apache Kafka, Amazon Managed Streaming for Apache Kafka or Amazon Kinesis message stream. We expect streams to present data in a consistent format, so they are read in as DataFrames .
Can use the underlying spark functionality to import a spark df from a non UTF-8 file (I used python as below):
# imports
from pyspark.context import SparkContext
from awsglue.context import GlueContext
...
# set contexts
glueContext = GlueContext(SparkContext.getOrCreate())
....
# import file
df = glueContext.read.load(my_file,
format="csv",
sep="|",
header="true",
encoding='my_encoding')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With