Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Server Bulk insert of CSV file with inconsistent quotes

Is it possible to BULK INSERT (SQL Server) a CSV file in which the fields are only OCCASSIONALLY surrounded by quotes? Specifically, quotes only surround those fields that contain a ",".

In other words, I have data that looks like this (the first row contain headers):

id, company, rep, employees
729216,INGRAM MICRO INC.,"Stuart, Becky",523
729235,"GREAT PLAINS ENERGY, INC.","Nelson, Beena",114
721177,GEORGE WESTON BAKERIES INC,"Hogan, Meg",253

Because the quotes aren't consistent, I can't use '","' as a delimiter, and I don't know how to create a format file that accounts for this.

I tried using ',' as a delimter and loading it into a temporary table where every column is a varchar, then using some kludgy processing to strip out the quotes, but that doesn't work either, because the fields that contain ',' are split into multiple columns.

Unfortunately, I don't have the ability to manipulate the CSV file beforehand.

Is this hopeless?

Many thanks in advance for any advice.

By the way, i saw this post SQL bulk import from csv, but in that case, EVERY field was consistently wrapped in quotes. So, in that case, he could use ',' as a delimiter, then strip out the quotes afterwards.

like image 873
mattstuehler Avatar asked Apr 23 '09 15:04

mattstuehler


4 Answers

It isn't possible to do a bulk insert for this file, from MSDN:

To be usable as a data file for bulk import, a CSV file must comply with the following restrictions:

  • Data fields never contain the field terminator.
  • Either none or all of the values in a data field are enclosed in quotation marks ("").

(http://msdn.microsoft.com/en-us/library/ms188609.aspx)

Some simple text processing should be all that's required to get the file ready for import. Alternatively your users could be required to either format the file according to the se guidelines or use something other than a comma as a delimiter (e.g |)

like image 167
Macros Avatar answered Nov 05 '22 08:11

Macros


You are going to need to preprocess the file, period.

If you really really need to do this, here is the code. I wrote this because I absolutely had no choice. It is utility code and I'm not proud of it, but it works. The approach is not to get SQL to understand quoted fields, but instead manipulate the file to use an entirely different delimiter.

EDIT: Here is the code in a github repo. It's been improved and now comes with unit tests! https://github.com/chrisclark/Redelim-it

This function takes an input file and will replace all field-delimiting commas (NOT commas inside quoted-text fields, just the actual delimiting ones) with a new delimiter. You can then tell sql server to use the new field delimiter instead of a comma. In the version of the function here, the placeholder is <TMP> (I feel confident this will not appear in the original csv - if it does, brace for explosions).

Therefore after running this function you import in sql by doing something like:

BULK INSERT MyTable
FROM 'C:\FileCreatedFromThisFunction.csv'
WITH
(
FIELDTERMINATOR = '<*TMP*>',
ROWTERMINATOR = '\n'
)

And without further ado, the terrible, awful function that I apologize in advance for inflicting on you (edit - I've posted a working program that does this instead of just the function on my blog here):

Private Function CsvToOtherDelimiter(ByVal InputFile As String, ByVal OutputFile As String) As Integer

        Dim PH1 As String = "<*TMP*>"

        Dim objReader As StreamReader = Nothing
        Dim count As Integer = 0 'This will also serve as a primary key'
        Dim sb As New System.Text.StringBuilder

        Try
            objReader = New StreamReader(File.OpenRead(InputFile), System.Text.Encoding.Default)
        Catch ex As Exception
            UpdateStatus(ex.Message)
        End Try

        If objReader Is Nothing Then
            UpdateStatus("Invalid file: " & InputFile)
            count = -1
            Exit Function
        End If

        'grab the first line
    Dim line = reader.ReadLine()
    'and advance to the next line b/c the first line is column headings
    If hasHeaders Then
        line = Trim(reader.ReadLine)
    End If

    While Not String.IsNullOrEmpty(line) 'loop through each line

        count += 1

        'Replace commas with our custom-made delimiter
        line = line.Replace(",", ph1)

        'Find a quoted part of the line, which could legitimately contain commas.
        'In that case we will need to identify the quoted section and swap commas back in for our custom placeholder.
        Dim starti = line.IndexOf(ph1 & """", 0)
        If line.IndexOf("""",0) = 0 then starti=0

        While starti > -1 'loop through quoted fields

            Dim FieldTerminatorFound As Boolean = False

            'Find end quote token (originally  a ",)
            Dim endi As Integer = line.IndexOf("""" & ph1, starti)

            If endi < 0 Then
                FieldTerminatorFound = True
                If endi < 0 Then endi = line.Length - 1
            End If

            While Not FieldTerminatorFound

                'Find any more quotes that are part of that sequence, if any
                Dim backChar As String = """" 'thats one quote
                Dim quoteCount = 0
                While backChar = """"
                    quoteCount += 1
                    backChar = line.Chars(endi - quoteCount)
                End While

                If quoteCount Mod 2 = 1 Then 'odd number of quotes. real field terminator
                    FieldTerminatorFound = True
                Else 'keep looking
                    endi = line.IndexOf("""" & ph1, endi + 1)
                End If
            End While

            'Grab the quoted field from the line, now that we have the start and ending indices
            Dim source = line.Substring(starti + ph1.Length, endi - starti - ph1.Length + 1)

            'And swap the commas back in
            line = line.Replace(source, source.Replace(ph1, ","))

            'Find the next quoted field
            '                If endi >= line.Length - 1 Then endi = line.Length 'During the swap, the length of line shrinks so an endi value at the end of the line will fail
            starti = line.IndexOf(ph1 & """", starti + ph1.Length)

        End While

            line = objReader.ReadLine

        End While

        objReader.Close()

        SaveTextToFile(sb.ToString, OutputFile)

        Return count

    End Function
like image 34
5 revs Avatar answered Nov 05 '22 09:11

5 revs


I found the answer by Chris very helpful, but I wanted to run it from within SQL Server using T-SQL (and not using CLR), so I converted his code to T-SQL code. But then I took it one step further by wrapping everything up in a stored procedure that did the following:

  1. use bulk insert to initially import the CSV file
  2. clean up the lines using Chris's code
  3. return the results in a table format

For my needs, I further cleaned up the lines by removing quotes around values and converting two double quotes to one double quote (I think that's the correct method).

CREATE PROCEDURE SSP_CSVToTable

-- Add the parameters for the stored procedure here
@InputFile nvarchar(4000)
, @FirstLine int

AS

BEGIN

-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;

--convert the CSV file to a table
--clean up the lines so that commas are handles correctly

DECLARE @sql nvarchar(4000)
DECLARE @PH1 nvarchar(50)
DECLARE @LINECOUNT int -- This will also serve as a primary key
DECLARE @CURLINE int
DECLARE @Line nvarchar(4000)
DECLARE @starti int
DECLARE @endi int
DECLARE @FieldTerminatorFound bit
DECLARE @backChar nvarchar(4000)
DECLARE @quoteCount int
DECLARE @source nvarchar(4000)
DECLARE @COLCOUNT int
DECLARE @CURCOL int
DECLARE @ColVal nvarchar(4000)

-- new delimiter
SET @PH1 = '†'

-- create single column table to hold each line of file
CREATE TABLE [#CSVLine]([line] nvarchar(4000))

-- bulk insert into temp table
-- cannot use variable path with bulk insert
-- so we must run using dynamic sql
SET @Sql = 'BULK INSERT #CSVLine
FROM ''' + @InputFile + '''
WITH
(
FIRSTROW=' + CAST(@FirstLine as varchar) + ',
FIELDTERMINATOR = ''\n'',
ROWTERMINATOR = ''\n''
)'

-- run dynamic statement to populate temp table
EXEC(@sql)

-- get number of lines in table
SET @LINECOUNT = @@ROWCOUNT

-- add identity column to table so that we can loop through it
ALTER TABLE [#CSVLine] ADD [RowId] [int] IDENTITY(1,1) NOT NULL

IF @LINECOUNT > 0
BEGIN
    -- cycle through each line, cleaning each line
    SET @CURLINE = 1
    WHILE @CURLINE <= @LINECOUNT
    BEGIN
        -- get current line
        SELECT @line = line
          FROM #CSVLine
         WHERE [RowId] = @CURLINE

        -- Replace commas with our custom-made delimiter
        SET @Line = REPLACE(@Line, ',', @PH1)

        -- Find a quoted part of the line, which could legitimately contain commas.
        -- In that case we will need to identify the quoted section and swap commas back in for our custom placeholder.
        SET @starti = CHARINDEX(@PH1 + '"' ,@Line, 0)
        If CHARINDEX('"', @Line, 0) = 0 SET @starti = 0

        -- loop through quoted fields
        WHILE @starti > 0 
        BEGIN
            SET @FieldTerminatorFound = 0

            -- Find end quote token (originally  a ",)
            SET @endi = CHARINDEX('"' + @PH1, @Line, @starti)  -- sLine.IndexOf("""" & PH1, starti)

            IF @endi < 1
            BEGIN
                SET @FieldTerminatorFound = 1
                If @endi < 1 SET @endi = LEN(@Line) - 1
            END

            WHILE @FieldTerminatorFound = 0
            BEGIN
                -- Find any more quotes that are part of that sequence, if any
                SET @backChar = '"' -- thats one quote
                SET @quoteCount = 0

                WHILE @backChar = '"'
                BEGIN
                    SET @quoteCount = @quoteCount + 1
                    SET @backChar = SUBSTRING(@Line, @endi-@quoteCount, 1) -- sLine.Chars(endi - quoteCount)
                END

                IF (@quoteCount % 2) = 1
                BEGIN
                    -- odd number of quotes. real field terminator
                    SET @FieldTerminatorFound = 1
                END
                ELSE 
                BEGIN
                    -- keep looking
                    SET @endi = CHARINDEX('"' + @PH1, @Line, @endi + 1) -- sLine.IndexOf("""" & PH1, endi + 1)
                END

            END

            -- Grab the quoted field from the line, now that we have the start and ending indices
            SET @source = SUBSTRING(@Line, @starti + LEN(@PH1), @endi - @starti - LEN(@PH1) + 1) 
            -- sLine.Substring(starti + PH1.Length, endi - starti - PH1.Length + 1)

            -- And swap the commas back in
            SET @Line = REPLACE(@Line, @source, REPLACE(@source, @PH1, ','))
            --sLine.Replace(source, source.Replace(PH1, ","))

            -- Find the next quoted field
            -- If endi >= line.Length - 1 Then endi = line.Length 'During the swap, the length of line shrinks so an endi value at the end of the line will fail
            SET @starti = CHARINDEX(@PH1 + '"', @Line, @starti + LEN(@PH1))
            --sLine.IndexOf(PH1 & """", starti + PH1.Length)

        END

        -- get table based on current line
        IF OBJECT_ID('tempdb..#Line') IS NOT NULL
            DROP TABLE #Line

        -- converts a delimited list into a table
        SELECT *
        INTO #Line
        FROM dbo.iter_charlist_to_table(@Line,@PH1)

        -- get number of columns in line
        SET @COLCOUNT = @@ROWCOUNT

        -- dynamically create CSV temp table to hold CSV columns and lines
        -- only need to create once
        IF OBJECT_ID('tempdb..#CSV') IS NULL
        BEGIN
            -- create initial structure of CSV table
            CREATE TABLE [#CSV]([Col1] nvarchar(100))

            -- dynamically add a column for each column found in the first line
            SET @CURCOL = 1
            WHILE @CURCOL <= @COLCOUNT
            BEGIN
                -- first column already exists, don't need to add
                IF @CURCOL > 1 
                BEGIN
                    -- add field
                    SET @sql = 'ALTER TABLE [#CSV] ADD [Col' + Cast(@CURCOL as varchar) + '] nvarchar(100)'

                    --print @sql

                    -- this adds the fields to the temp table
                    EXEC(@sql)
                END

                -- go to next column
                SET @CURCOL = @CURCOL + 1
            END
        END

        -- build dynamic sql to insert current line into CSV table
        SET @sql = 'INSERT INTO [#CSV] VALUES('

        -- loop through line table, dynamically adding each column value
        SET @CURCOL = 1
        WHILE @CURCOL <= @COLCOUNT
        BEGIN
            -- get current column
            Select @ColVal = str 
              From #Line 
             Where listpos = @CURCOL

            IF LEN(@ColVal) > 0
            BEGIN
                -- remove quotes from beginning if exist
                IF LEFT(@ColVal,1) = '"'
                    SET @ColVal = RIGHT(@ColVal, LEN(@ColVal) - 1)

                -- remove quotes from end if exist
                IF RIGHT(@ColVal,1) = '"'
                    SET @ColVal = LEFT(@ColVal, LEN(@ColVal) - 1)
            END

            -- write column value
            -- make value sql safe by replacing single quotes with two single quotes
            -- also, replace two double quotes with a single double quote
            SET @sql = @sql + '''' + REPLACE(REPLACE(@ColVal, '''',''''''), '""', '"') + ''''

            -- add comma separater except for the last record
            IF @CURCOL <> @COLCOUNT
                SET @sql = @sql + ','

            -- go to next column
            SET @CURCOL = @CURCOL + 1
        END

        -- close sql statement
        SET @sql = @sql + ')'

        --print @sql

        -- run sql to add line to table
        EXEC(@sql)

        -- move to next line
        SET @CURLINE = @CURLINE + 1

    END

END

-- return CSV table
SELECT * FROM [#CSV]

END

GO

The stored procedure makes use of this helper function that parses a string into a table (thanks Erland Sommarskog!):

CREATE FUNCTION [dbo].[iter_charlist_to_table]
                (@list      ntext,
                 @delimiter nchar(1) = N',')
     RETURNS @tbl TABLE (listpos int IDENTITY(1, 1) NOT NULL,
                         str     varchar(4000),
                         nstr    nvarchar(2000)) AS

BEGIN
  DECLARE @pos      int,
          @textpos  int,
          @chunklen smallint,
          @tmpstr   nvarchar(4000),
          @leftover nvarchar(4000),
          @tmpval   nvarchar(4000)

  SET @textpos = 1
  SET @leftover = ''
  WHILE @textpos <= datalength(@list) / 2
  BEGIN
     SET @chunklen = 4000 - datalength(@leftover) / 2
     SET @tmpstr = @leftover + substring(@list, @textpos, @chunklen)
     SET @textpos = @textpos + @chunklen

     SET @pos = charindex(@delimiter, @tmpstr)

     WHILE @pos > 0
     BEGIN
        SET @tmpval = ltrim(rtrim(left(@tmpstr, @pos - 1)))
        INSERT @tbl (str, nstr) VALUES(@tmpval, @tmpval)
        SET @tmpstr = substring(@tmpstr, @pos + 1, len(@tmpstr))
        SET @pos = charindex(@delimiter, @tmpstr)
     END

     SET @leftover = @tmpstr
  END

  INSERT @tbl(str, nstr) VALUES (ltrim(rtrim(@leftover)), ltrim(rtrim(@leftover)))

RETURN

END

Here's how I call it from T-SQL. In this case, I'm inserting the results into a temp table, so I create the temp table first:

    -- create temp table for file import
CREATE TABLE #temp
(
    CustomerCode nvarchar(100) NULL,
    Name nvarchar(100) NULL,
    [Address] nvarchar(100) NULL,
    City nvarchar(100) NULL,
    [State] nvarchar(100) NULL,
    Zip nvarchar(100) NULL,
    OrderNumber nvarchar(100) NULL,
    TimeWindow nvarchar(100) NULL,
    OrderType nvarchar(100) NULL,
    Duration nvarchar(100) NULL,
    [Weight] nvarchar(100) NULL,
    Volume nvarchar(100) NULL
)

-- convert the CSV file into a table
INSERT #temp
EXEC [dbo].[SSP_CSVToTable]
     @InputFile = @FileLocation
    ,@FirstLine = @FirstImportRow

I haven't tested the performance much, but it works well for what I need - importing CSV files with less than 1000 rows. However, it might choke on really large files.

Hopefully someone else also finds it useful.

Cheers!

like image 8
murmy Avatar answered Nov 05 '22 07:11

murmy


I have also created a function to convert a CSV to a usable format for Bulk Insert. I used the answered post by Chris Clark as a starting point to create the following C# function.

I ended up using a regular expression to find the fields. I then recreated the file line by line, writing it to a new file as I went, thus avoiding having the entire file loaded into memory.

private void CsvToOtherDelimiter(string CSVFile, System.Data.Linq.Mapping.MetaTable tbl)
{
    char PH1 = '|';
    StringBuilder ln;

    //Confirm file exists. Else, throw exception
    if (File.Exists(CSVFile))
    {
        using (TextReader tr = new StreamReader(CSVFile))
        {
            //Use a temp file to store our conversion
            using (TextWriter tw = new StreamWriter(CSVFile + ".tmp"))
            {
                string line = tr.ReadLine();
                //If we have already converted, no need to reconvert.
                //NOTE: We make the assumption here that the input header file 
                //      doesn't have a PH1 value unless it's already been converted.
                if (line.IndexOf(PH1) >= 0)
                {
                    tw.Close();
                    tr.Close();
                    File.Delete(CSVFile + ".tmp");
                    return;
                }
                //Loop through input file
                while (!string.IsNullOrEmpty(line))
                {
                    ln = new StringBuilder();

                    //1. Use Regex expression to find comma separated values 
                    //using quotes as optional text qualifiers 
                    //(what MS EXCEL does when you import a csv file)
                    //2. Remove text qualifier quotes from data
                    //3. Replace any values of PH1 found in column data 
                    //with an equivalent character
                    //Regex:  \A[^,]*(?=,)|(?:[^",]*"[^"]*"[^",]*)+|[^",]*"[^"]*\Z|(?<=,)[^,]*(?=,)|(?<=,)[^,]*\Z|\A[^,]*\Z
                    List<string> fieldList = Regex.Matches(line, @"\A[^,]*(?=,)|(?:[^"",]*""[^""]*""[^"",]*)+|[^"",]*""[^""]*\Z|(?<=,)[^,]*(?=,)|(?<=,)[^,]*\Z|\A[^,]*\Z")
                            .Cast<Match>()
                            .Select(m => RemoveCSVQuotes(m.Value).Replace(PH1, '¦'))
                            .ToList<string>();

                    //Add the list of fields to ln, separated by PH1
                    fieldList.ToList().ForEach(m => ln.Append(m + PH1));

                    //Write to file. Don't include trailing PH1 value.
                    tw.WriteLine(ln.ToString().Substring(0, ln.ToString().LastIndexOf(PH1)));

                    line = tr.ReadLine();
                }


                tw.Close();
            }
            tr.Close();

            //Optional:  replace input file with output file
            File.Delete(CSVFile);
            File.Move(CSVFile + ".tmp", CSVFile);
        }
    }
    else
    {
        throw new ArgumentException(string.Format("Source file {0} not found", CSVFile));
    }
}
//The output file no longer needs quotes as a text qualifier, so remove them
private string RemoveCSVQuotes(string value)
{
    //if is empty string, then remove double quotes
    if (value == @"""""") value = "";
    //remove any double quotes, then any quotes on ends
    value = value.Replace(@"""""", @"""");
    if (value.Length >= 2)
        if (value.Substring(0, 1) == @"""")
            value = value.Substring(1, value.Length - 2);
    return value;
}
like image 5
VenerableAgents Avatar answered Nov 05 '22 07:11

VenerableAgents