Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read files with .xlsx and .xls extension in Azure data factory?

I am trying to read and excel file in Azure Blob Storage with .xlsx extension in my azure data factory dataset. it throws following error

Error found when processing 'Csv/Tsv Format Text' source 'Filename.xlsx' with row number 3: found more columns than expected column count: 1.

What are the right Column and row delimiters for excel files to be read in azure Data factory

like image 243
vikas shivakumar Avatar asked Sep 26 '18 09:09

vikas shivakumar


People also ask

Can Azure Data Factory read Excel file?

The service supports both ". xls" and ". xlsx". Excel format is supported for the following connectors: Amazon S3, Amazon S3 Compatible Storage, Azure Blob, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Azure Files, File System, FTP, Google Cloud Storage, HDFS, HTTP, Oracle Cloud Storage and SFTP.

How does Azure read Excel files?

In Azure SQL Database, you cannot import directly from Excel. You must first export the data to a text (CSV) file. Before you can run a distributed query, you have to enable the ad hoc distributed queries server configuration option, as shown in the following example.

How do I view XLXS files?

An XLSX file is a Microsoft Excel Open XML Format Spreadsheet file. Open one with Excel, Excel Viewer, Google Sheets, or another spreadsheet program.


1 Answers

Update March 2022: ADF now has better support for Excel via Mapping Data Flows:

https://docs.microsoft.com/en-us/azure/data-factory/format-excel

Excel files have a proprietary format and are not simple delimited files. As indicated here, Azure Data Factory does not have a direct option to import Excel files, eg you cannot create a Linked Service to an Excel file and read it easily. Your options are:

  1. Export or convert the data as flat files eg before transfer to cloud, as .csv, tab-delimited, pipe-delimited etc are easier to read than Excel files. This is your simplest option although obviously requires a change in process.
  2. Try shredding the XML - create a custom task to open the Excel file as XML and extract your data as suggested here.
  3. SSIS packages are now supported in Azure Data Factory (with the Execute SSIS package activity) and have better support for Excel files, eg a Connection Manager. So it may be an option to create an SSIS package to deal with the Excel and host it in ADFv2. Warning! I have not tested this, I am only speculating it is possible. Also there is the overhead of creating an Integration Runtime (IR) for running SSIS in ADFv2.
  4. Try some other custom activity, eg there is a custom U-SQL Extractor for shredding XML on github here.
  5. Try and read the Excel using Databricks, some examples here although spinning up a Spark cluster to read a few Excel files does seem somewhat overkill. This might be a good option if Spark is already in your architecture.

Let us know how you get on.

like image 103
wBob Avatar answered Nov 14 '22 21:11

wBob