Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle import of file with UTF-8 encoding, codepage = 65001, into SQL server

In Norway we have 3 highly annoying characters, æøå, that create all sorts of problems. Since sql server 2008, Microsoft decided to not support codepage 65001. I have found a manageable solution to the problem of importing a UTF-8 file into sql server with OPENROWSET(BULK) and keep the æøå tokens.

I created a powershell script that uses StreamReader and StreamWriter to convert the file from UTF-8 to default encoding, ANSI.

$filename = "C:\Test\UTF8_file.txt"
$outfile = "C:\Test\ANSI_file.txt"
$reader = new-object System.IO.StreamReader($filename, [System.Text.Encoding]::GetEncoding(65001))
$stream = new-object System.IO.StreamWriter($outfile, $false, [System.Text.Encoding]::Default)

I strip the file of the first line, the header row, in the same process.

$i=1
while(($line = $reader.ReadLine()) -ne $null) {
    if($i -gt 1) {
        $stream.WriteLine($line)
    }
    $i++
}
$reader.Close()
$stream.Close()

Then I am able to use OPENROWSET to import the ANSI file into sql server and manipulating data while doing so. Using codepage 1252, which equals danish_norwegian collation.

    insert into SomeDatabase.dbo.SomeTable
SELECT [companynumber]
, case [role] when 'Styreformann' then 'Styreleder' when 'Styrets leder' then 'Styreleder' else rolle end as 'role'
, case [representant] when 'Y' then '1' else '0' end as 'representant'
, left((RIGHT('0000'+ CONVERT(VARCHAR,postnr),5)),4) end as 'postnr' 
, income*1000 as income
, null as person2id 
FROM OPENROWSET( BULK 'C:\Test\ANSI_file.txt', 
FORMATFILE = 'C:\Test\FormatBulkInsert_file.xml'
, CODEPAGE =1252
, ROWS_PER_BATCH = 50000    
) as v 

This method secured that norwegian tokens were displayed correctly. The format file looks like this:

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <RECORD>
    <FIELD ID="1" xsi:type="CharTerm" TERMINATOR=';"' />
    <FIELD ID="2" xsi:type="CharTerm" TERMINATOR='";"' />
    <FIELD ID="3" xsi:type="CharTerm" TERMINATOR='";"' />
    <FIELD ID="4" xsi:type="CharTerm" TERMINATOR='";' />
    <FIELD ID="5" xsi:type="CharTerm" TERMINATOR=';' />
    <FIELD ID="6" xsi:type="CharTerm" TERMINATOR='\n' />
  </RECORD>
  <ROW>
    <COLUMN SOURCE="1" NAME="companynumber" xsi:type="SQLINT"/>
    <COLUMN SOURCE="2" NAME="role" xsi:type="SQLNVARCHAR"/>
    <COLUMN SOURCE="3" NAME="representant" xsi:type="SQLBIT"/>
    <COLUMN SOURCE="4" NAME="postnr" xsi:type="SQLNVARCHAR"/>
    <COLUMN SOURCE="5" NAME="income" xsi:type="SQLDECIMAL"/>
    <COLUMN SOURCE="6" NAME="person2id" xsi:type="SQLINT"/>
  </ROW>
</BCPFORMAT>

Hope this is helpful to someone else, because I spent quite a lot of time googleing before I found a way to solve this issue.

like image 453
Magnhild Hagen Avatar asked Feb 28 '13 14:02

Magnhild Hagen


People also ask

What character encoding does SQL Server use?

Microsoft™ SQL Server For NCHAR / NVARCHAR , the character encoding is UCS-2 or UTF-16, depending if the database collation is using the SC option. The length N in N[VAR]CHAR(N) defines a number of byte-pairs, not bytes.

What is 1252 code page in SSIS?

If it's a csv file, you can still use code page 1252 to process it. When you open the flat file connection manager it shows you the code page for the file, but you don't need to save that setting. If you have other changes to make in the connection manager, change the code page back to 1252 before you save the changes.

Does SQL Server support UTF-8?

SQL Server 2019 (15. x) introduces an additional option for UTF-8 encoding.


1 Answers

Convert into UTF16 instead. That is SQL Server's native NCHAR format, and allows full representation of Unicode values.

To make this work you will have to specify SQLNCHAR or SQLNVARCHAR in your format file, and also be aware of the caveat:

For a format file to work with a Unicode character data file, all the input fields must be Unicode text strings (that is, either fixed-size or character-terminated Unicode strings).

  • http://msdn.microsoft.com/en-us/library/ms178129.aspx

An alternative is to load it as binary data and use the CONVERT function to convert it from VARBINARY to NVARCHAR (which is UTF-16) and then to the desired codepage as VARCHAR.

like image 135
Ben Avatar answered Oct 02 '22 09:10

Ben