I'm setting up a AWS GLUE job for my customers. Their files are excel with xls/xlsx extension and have multiple sheets and they don't want to do any convert job before uploading. How do I extract data from xls/xlsx file directly or can GLUE convert xls/xlsx file to csv file?
xls/xlsx is not supported in GLUE as of now. The easiest way is to write a python shell job in glue to convert excel to csv and then run crawler over it. Below is sample in python
import xlrd
import csv
def csv_from_excel():
wb = xlrd.open_workbook('your_workbook.xls')
sh = wb.sheet_by_name('Sheet1')
your_csv_file = open('your_csv_file.csv', 'wb')
wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)
for rownum in xrange(sh.nrows):
wr.writerow(sh.row_values(rownum))
your_csv_file.close()
Excel read is possible in glue version 2 not in glue version 3.
Steps:
code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import pandas as pd
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
excel_path= r"s3://input/employee.xlsx"
df_xl_op = pd.read_excel(excel_path,sheet_name = "Sheet1")
df=df_xl_op.applymap(str)
input_df = spark.createDataFrame(df)
input_df.printSchema()
job.commit()
Save script
Goto Action - Edit Job - Select Glue version2 and set key value under security configuration
key : --additional-python-modules
value : pandas==1.2.4,xlrd==1.2.0,numpy==1.20.1,fsspec==0.7.4
Save and run the job
It will resolve your error and you will be able to read the excel file
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With