How to handle import of file with UTF-8 encoding, codepage = 65001, into SQL server

Tags:

In Norway we have 3 highly annoying characters, æøå, that create all sorts of problems. Since sql server 2008, Microsoft decided to not support codepage 65001. I have found a manageable solution to the problem of importing a UTF-8 file into sql server with OPENROWSET(BULK) and keep the æøå tokens.

I created a powershell script that uses StreamReader and StreamWriter to convert the file from UTF-8 to default encoding, ANSI.

$filename = "C:\Test\UTF8_file.txt"
$outfile = "C:\Test\ANSI_file.txt"
$reader = new-object System.IO.StreamReader($filename, [System.Text.Encoding]::GetEncoding(65001))
$stream = new-object System.IO.StreamWriter($outfile, $false, [System.Text.Encoding]::Default)

I strip the file of the first line, the header row, in the same process.

$i=1
while(($line = $reader.ReadLine()) -ne $null) {
    if($i -gt 1) {
        $stream.WriteLine($line)
    }
    $i++
}
$reader.Close()
$stream.Close()

Then I am able to use OPENROWSET to import the ANSI file into sql server and manipulating data while doing so. Using codepage 1252, which equals danish_norwegian collation.

    insert into SomeDatabase.dbo.SomeTable
SELECT [companynumber]
, case [role] when 'Styreformann' then 'Styreleder' when 'Styrets leder' then 'Styreleder' else rolle end as 'role'
, case [representant] when 'Y' then '1' else '0' end as 'representant'
, left((RIGHT('0000'+ CONVERT(VARCHAR,postnr),5)),4) end as 'postnr' 
, income*1000 as income
, null as person2id 
FROM OPENROWSET( BULK 'C:\Test\ANSI_file.txt', 
FORMATFILE = 'C:\Test\FormatBulkInsert_file.xml'
, CODEPAGE =1252
, ROWS_PER_BATCH = 50000    
) as v

This method secured that norwegian tokens were displayed correctly. The format file looks like this:

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <RECORD>
    <FIELD ID="1" xsi:type="CharTerm" TERMINATOR=';"' />
    <FIELD ID="2" xsi:type="CharTerm" TERMINATOR='";"' />
    <FIELD ID="3" xsi:type="CharTerm" TERMINATOR='";"' />
    <FIELD ID="4" xsi:type="CharTerm" TERMINATOR='";' />
    <FIELD ID="5" xsi:type="CharTerm" TERMINATOR=';' />
    <FIELD ID="6" xsi:type="CharTerm" TERMINATOR='\n' />
  </RECORD>
  <ROW>
    <COLUMN SOURCE="1" NAME="companynumber" xsi:type="SQLINT"/>
    <COLUMN SOURCE="2" NAME="role" xsi:type="SQLNVARCHAR"/>
    <COLUMN SOURCE="3" NAME="representant" xsi:type="SQLBIT"/>
    <COLUMN SOURCE="4" NAME="postnr" xsi:type="SQLNVARCHAR"/>
    <COLUMN SOURCE="5" NAME="income" xsi:type="SQLDECIMAL"/>
    <COLUMN SOURCE="6" NAME="person2id" xsi:type="SQLINT"/>
  </ROW>
</BCPFORMAT>

Hope this is helpful to someone else, because I spent quite a lot of time googleing before I found a way to solve this issue.

453

asked Feb 28 '13 14:02

Magnhild Hagen

1 Answers

Convert into UTF16 instead. That is SQL Server's native NCHAR format, and allows full representation of Unicode values.

To make this work you will have to specify SQLNCHAR or SQLNVARCHAR in your format file, and also be aware of the caveat:

For a format file to work with a Unicode character data file, all the input fields must be Unicode text strings (that is, either fixed-size or character-terminated Unicode strings).

http://msdn.microsoft.com/en-us/library/ms178129.aspx

An alternative is to load it as binary data and use the CONVERT function to convert it from VARBINARY to NVARCHAR (which is UTF-16) and then to the desired codepage as VARCHAR.

135

answered Oct 02 '22 09:10

Ben

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to handle import of file with UTF-8 encoding, codepage = 65001, into SQL server

Tags:

sql-server

powershell-2.0

Magnhild Hagen

People also ask

1 Answers

Ben

Recent Activity

Donate For Us

How to handle import of file with UTF-8 encoding, codepage = 65001, into SQL server

Tags:

sql-server

powershell-2.0

Magnhild Hagen

People also ask

1 Answers

Ben

Related questions

Recent Activity

Donate For Us