Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling embedded new lines when creating/selecting External Tables in SQL Data Warehouse

In SQL Data Warehouse (editors please don't change this, it is the actual name see: here) I have a JobCandidate_ext external table that looks like this.

CREATE EXTERNAL TABLE [HumanResources].[JobCandidate_ext](
    [JobCandidateID] int,
    [BusinessEntityID] int,
    [Resume] Varchar(8000),
    [ModifiedDate] Datetime
)
WITH (
    LOCATION='/[HumanResources].[JobCandidate]/data.txt',
    DATA_SOURCE=AzureStorage,
    FILE_FORMAT=TextFile)
GO

The column [Resume] was an XML type in SQL Server but in SQL Data Warehouse XML types should be converted to varchar(8000) as described here.

I am using a flat file data.txt to export the data to a blob and then create an external table from it.

The [Resume] column has carriage returns in it (as expected from an XML file), and so when you run a SELECT * FROM [HumanResources].[JobCandidate_ext] you get an error. In this case:

Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 2 rows processed.
(/[HumanResources].[JobCandidate]/data.txt)Column ordinal: 0, Expected data type: INT, Offending value: some text .... (Column Conversion Error), Error: Error converting data type NVARCHAR to INT.

I know that I cannot configure a row delimiter when creating external tables as described here.

The row delimiter must be UTF-8 and supported by Hadoop’s LineRecordReader. The row delimiter must be either '\r', '\n', or '\r\n'. These are not user-configurable.

And if you try to put quotes on each column field you get this error while selecting rows from the external table: No closing string delimiter.

Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.
(/[HumanResources].[JobCandidate]/data.txt)Column ordinal: 2, Expected data type: VARCHAR(8000) collate SQL_Latin1_General_CP1_CI_AS, Offending value: 'ShaiBassli (Tokenization failed), Error: No closing string delimiter.

Is there a way to get around this issue?

like image 441
santiaago Avatar asked Mar 21 '16 16:03

santiaago


1 Answers

Today, PolyBase does not allow for row or field delimiters inside fields i.e. it does not allow you to escape these characters. As Greg pointed out, you can vote for this functionality here: https://feedback.azure.com/forums/307516-sql-data-warehouse/suggestions/10600132-polybase-allow-line-ends-within-qualified-text-f

To workaround this limitation, you can either pre-process the data (using sed or tr for example) to replace unwanted characters before reading it with PolyBase. Or you can switch to other polybase supported file formats RCFile/ORC/Parquet to avoid dealing with row and field delimiters completely.

like image 95
Sahaj Saini - MSFT Avatar answered Sep 23 '22 02:09

Sahaj Saini - MSFT