Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Insert rows with Unicode characters using BCP

I'm using BCP to bulk upload data from a CSV file to SQL Azure (because BULK INSERT is not supported). This command runs and uploads the rows:

bcp [resource].dbo.TableName in C:\data.csv -t "," -r "0x0a" -c -U bcpuser@resource -S tcp:resource.database.windows.net

But data.csv is UTF8 encoded and contains non-ASCII strings. These get corrupted. I've tried changing the -c option to -w:

bcp [resource].dbo.TableName in C:\data.csv -t "," -r "0x0a" -w -U bcpuser@resource -S tcp:resource.database.windows.net

But then I get '0 rows copied'.

What am I doing wrong and how do I bulk insert Unicode characters using BCP?

like image 476
mtmacdonald Avatar asked Jan 06 '17 13:01

mtmacdonald


People also ask

How do I use Unicode characters in SQL?

SQL Server UNICODE() Function The UNICODE() function returns an integer value (the Unicode value), for the first character of the input expression.

How do I specify rows terminator in BCP?

To place each data record on its own line in the data file, specify the combination \r\n as the row terminator. When you use bcp interactively and specify \n (newline) as the row terminator, bcp automatically prefixes it with a \r (carriage return) character, which results in a row terminator of \r\n.

How do I add special characters in SSIS?

Basically you need to read the file using Codepage 65001 (UTF-8), then use Derived Column transformation where you would add a derived column for every text field in the csv file using cast expressions, for example (DT_STR,50,1252)LastName.


2 Answers

But data.csv is UTF8 encoded

The UTF-8 encoding is the primary issue. Using -w won't help because in Microsoft-land, the term "Unicode" nearly always refers to UTF-16 Little Endian.

The solution will depend on which version of BCP you are using as an option was added in the newest version (13.0 / 2016):

  • If you are using BCP that came with SQL Server prior to SQL Server 2016 (version 13.0) then you need to convert the csv file to UTF-16 Little Endian (LE) as that is what Windows / SQL Server / .NET use for all strings. And use the -w switch.

    I got this to work encoding a file as "UCS-2 LE BOM" in Notepad++, whereas that same import file failed using the -c switch.

  • If you are using BCP that came with SQL Server 2016 (version 13.0) or newer, then you can simply add -c -C 65001 to the command line. -C is for "code page", and 65001 is the code page for UTF-8.

The MSDN page for bcp Utility states (in the explanation of the -C switch):

Versions prior to version 13 (SQL Server 2016) do not support code page 65001 (UTF-8 encoding). Versions beginning with 13 can import UTF-8 encoding to earlier versions of SQL Server.

UPDATE

Support for UTF-8 / code page 65001 was added to SQL Server 2014 via SP2, as noted in this Microsoft KB article:

UTF-8 encoding support for the BCP utility and BULK INSERT Transact-SQL command in SQL Server 2014 SP2

like image 75
Solomon Rutzky Avatar answered Oct 07 '22 10:10

Solomon Rutzky


The answer from Solomon helped me in my struggle with Unicode and SQL Server 2014. I would like to share my experience about Unicode here. I hope this helps the next person who suffers from Unicode problems with BCP.

I have had a hard time figuring out the UTF and Unicode of SQL Server 2014. I am using Powershell to upload using BCP to a SQL Server 2014 SP2 database. My files are in Dutch, UTF-8 without BOM. I used Powershell to convert the files into microsoft's Unicode:

Get-ChildItem "C:\Documents\ProjectA" -filter *.CSV |
ForEach-Object {
    $path = $_.basename + '.unicode.CSV' 
    get-content $_ | Set-Content -Encoding Unicode -path $path 
}

Then I used BCP without format file:

Get-ChildItem "C:\Documents\ProjectA" -filter *.unicode.CSV |
 ForEach-Object { 
   try { $output = bcp ProjectA.dbo.auditlog in $_.FullName -w "-t," -T -F2 
            if ($LASTEXITCODE)
            {  throw $output
            }
    catch
    { $Output >> C:\Documents\ProjectA\BCPCommandFailed$(get-date -f yyyy-MM-dd).log
    }
}

The conversion into Unicode causes file sizes to double e.g. from 11,630KB into 23,259KB. Template file whether XML or non-XML did not work.

like image 25
Pho Avatar answered Oct 07 '22 09:10

Pho