AWS GLUE import xls/xlsx file

Question

I'm setting up a AWS GLUE job for my customers. Their files are excel with xls/xlsx extension and have multiple sheets and they don't want to do any convert job before uploading. How do I extract data from xls/xlsx file directly or can GLUE convert xls/xlsx file to csv file?

Krunal Barot · Accepted Answer

xls/xlsx is not supported in GLUE as of now. The easiest way is to write a python shell job in glue to convert excel to csv and then run crawler over it. Below is sample in python

 import xlrd
 import csv

 def csv_from_excel():

    wb = xlrd.open_workbook('your_workbook.xls')
    sh = wb.sheet_by_name('Sheet1')
    your_csv_file = open('your_csv_file.csv', 'wb')
    wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)

    for rownum in xrange(sh.nrows):
        wr.writerow(sh.row_values(rownum))

    your_csv_file.close()

SparkDataEng · Answer

Excel read is possible in glue version 2 not in glue version 3.

Steps:

Go to glue job and edit script with below code

code:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import pandas as pd

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)


excel_path= r"s3://input/employee.xlsx"
df_xl_op = pd.read_excel(excel_path,sheet_name = "Sheet1")
df=df_xl_op.applymap(str)
input_df = spark.createDataFrame(df)
input_df.printSchema()

job.commit()

Save script
Goto Action - Edit Job - Select Glue version2 and set key value under security configuration

key : --additional-python-modules
value : pandas==1.2.4,xlrd==1.2.0,numpy==1.20.1,fsspec==0.7.4
Save and run the job

It will resolve your error and you will be able to read the excel file

AWS GLUE import xls/xlsx file

Tags:

amazon-web-services

aws-glue

Lawrence

2 Answers

Krunal Barot

SparkDataEng

Recent Activity

Donate For Us

AWS GLUE import xls/xlsx file

Tags:

amazon-web-services

aws-glue

Lawrence

2 Answers

Krunal Barot

SparkDataEng

Related questions

Recent Activity

Donate For Us