Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS GLUE import xls/xlsx file

I'm setting up a AWS GLUE job for my customers. Their files are excel with xls/xlsx extension and have multiple sheets and they don't want to do any convert job before uploading. How do I extract data from xls/xlsx file directly or can GLUE convert xls/xlsx file to csv file?

like image 545
Lawrence Avatar asked Mar 03 '23 14:03

Lawrence


2 Answers

xls/xlsx is not supported in GLUE as of now. The easiest way is to write a python shell job in glue to convert excel to csv and then run crawler over it. Below is sample in python

 import xlrd
 import csv

 def csv_from_excel():

    wb = xlrd.open_workbook('your_workbook.xls')
    sh = wb.sheet_by_name('Sheet1')
    your_csv_file = open('your_csv_file.csv', 'wb')
    wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)

    for rownum in xrange(sh.nrows):
        wr.writerow(sh.row_values(rownum))

    your_csv_file.close()
like image 96
Krunal Barot Avatar answered Mar 11 '23 20:03

Krunal Barot


Excel read is possible in glue version 2 not in glue version 3.

Steps:

  1. Go to glue job and edit script with below code

code:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import pandas as pd

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)


excel_path= r"s3://input/employee.xlsx"
df_xl_op = pd.read_excel(excel_path,sheet_name = "Sheet1")
df=df_xl_op.applymap(str)
input_df = spark.createDataFrame(df)
input_df.printSchema()

job.commit()
  1. Save script

  2. Goto Action - Edit Job - Select Glue version2 and set key value under security configuration

    key : --additional-python-modules
    value : pandas==1.2.4,xlrd==1.2.0,numpy==1.20.1,fsspec==0.7.4

  3. Save and run the job

It will resolve your error and you will be able to read the excel file

like image 40
SparkDataEng Avatar answered Mar 11 '23 18:03

SparkDataEng