Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

first data row skipped in import

I'm using a XML format file to import a CSV file, and the first data row is getting skipped. I can't figure out why.

Format file

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
  <FIELD ID="1" xsi:type="CharTerm" TERMINATOR='","' />
  <FIELD ID="2" xsi:type="CharTerm" TERMINATOR='\n' />
 </RECORD>
 <ROW>
  <COLUMN SOURCE="1" NAME="COLUMN1" xsi:type="SQLVARYCHAR" />
  <COLUMN SOURCE="2" NAME="COLUMN2" xsi:type="SQLVARYCHAR" />
 </ROW>
</BCPFORMAT>

CSV

COLUMN1,COLUMN2
"ABC","ABC123456"
"TNT","TNT123456"

Query

SELECT *
FROM OPENROWSET(BULK 'C:\sample.csv',
FORMATFILE='C:\sample.xml',
FIRSTROW = 2) AS a

Result

COLUMN1 COLUMN2
------- ----------
"TNT    TNT123456"

(1 row(s) affected)

If FIRSTROW is changed to 1, the result becomes:

COLUMN1                COLUMN2
---------------------  ----------
COLUMN1,COLUMN2  "ABC  ABC123456" 
"TNT                   TNT123456"

If the header row is removed from the CSV and FIRSTROW is changed to 1, the result returns as expected:

COLUMN1 COLUMN2
------- ----------
"ABC    ABC123456" 
"TNT    TNT123456"

Since this is an automated report that is delivered with headers, are there any other options to remedy this?

like image 611
Kermit Avatar asked Sep 11 '13 20:09

Kermit


2 Answers

There are a couple of problems here:

  1. I suspect there isn't a valid \n on the first line. Otherwise SQL Server wouldn't munge the first two rows when you change to FIRSTROW = 1.

  2. Using "," as the column delimiter works great for all of the columns except the first and the last column. This leaves a leading " on the first column, and a trailing " on the last column. You can deal with the latter by changing your ROWTERMINATOR to "\n, but that will only work if you can also add a trailing " to the header row (during the process of ensuring that there is a valid \n there). At that point you may as well make sure that the header row matches the data rows in all aspects, so:

    "COLUMN1","COLUMN2"
    -------------------^ this character has to be \n
    

In all honesty, I think you could spend a week fighting with all of these nitty-gritty BCP and BULK INSERT issues, and still not have a perfect solution that doesn't require post-op actions (such as trimming leading/trailing " characters from certain columns). My recommendation: spend 20 minutes and write a parser in C# that will automatically correct these files - removing the header row, ensuring the right delimiters are in place, removing all the stupid ", etc. before SQL Server ever sees the file. Cleaning the file up will be a lot less hassle than the hoops you're jumping through now. I'm sure there are solutions to this but IIRC you've been wrestling with it for quite some time...

like image 150
Aaron Bertrand Avatar answered Nov 20 '22 17:11

Aaron Bertrand


The terminator of the first field should be ',' only, not '","'.

Replace with the following line, it's gonna work:

<FIELD ID="1" xsi:type="CharTerm" TERMINATOR=',' />

Here's what happens with your original file format...

The first columns is terminating with : ","... That means, SQL server parses the first line, then reads the seconds line and get the first field:

COLUMN1,COLUMN2  "ABC

The it continues to read and get the second field (remember, we are still on the second line of the file) :

ABC123456"

It now has the first row...

It then reads the next row:

"TNT                   TNT123456"

So when you skip first row, it indeeds skips the first row because your first line is not using quotes...

Hope that helps..

like image 2
Baral Avatar answered Nov 20 '22 18:11

Baral