In Norway we have 3 highly annoying characters, æøå, that create all sorts of problems. Since sql server 2008, Microsoft decided to not support codepage 65001. I have found a manageable solution to the problem of importing a UTF-8 file into sql server with OPENROWSET(BULK) and keep the æøå tokens.
I created a powershell script that uses StreamReader and StreamWriter to convert the file from UTF-8 to default encoding, ANSI.
$filename = "C:\Test\UTF8_file.txt"
$outfile = "C:\Test\ANSI_file.txt"
$reader = new-object System.IO.StreamReader($filename, [System.Text.Encoding]::GetEncoding(65001))
$stream = new-object System.IO.StreamWriter($outfile, $false, [System.Text.Encoding]::Default)
I strip the file of the first line, the header row, in the same process.
$i=1
while(($line = $reader.ReadLine()) -ne $null) {
if($i -gt 1) {
$stream.WriteLine($line)
}
$i++
}
$reader.Close()
$stream.Close()
Then I am able to use OPENROWSET to import the ANSI file into sql server and manipulating data while doing so. Using codepage 1252, which equals danish_norwegian collation.
insert into SomeDatabase.dbo.SomeTable
SELECT [companynumber]
, case [role] when 'Styreformann' then 'Styreleder' when 'Styrets leder' then 'Styreleder' else rolle end as 'role'
, case [representant] when 'Y' then '1' else '0' end as 'representant'
, left((RIGHT('0000'+ CONVERT(VARCHAR,postnr),5)),4) end as 'postnr'
, income*1000 as income
, null as person2id
FROM OPENROWSET( BULK 'C:\Test\ANSI_file.txt',
FORMATFILE = 'C:\Test\FormatBulkInsert_file.xml'
, CODEPAGE =1252
, ROWS_PER_BATCH = 50000
) as v
This method secured that norwegian tokens were displayed correctly. The format file looks like this:
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<RECORD>
<FIELD ID="1" xsi:type="CharTerm" TERMINATOR=';"' />
<FIELD ID="2" xsi:type="CharTerm" TERMINATOR='";"' />
<FIELD ID="3" xsi:type="CharTerm" TERMINATOR='";"' />
<FIELD ID="4" xsi:type="CharTerm" TERMINATOR='";' />
<FIELD ID="5" xsi:type="CharTerm" TERMINATOR=';' />
<FIELD ID="6" xsi:type="CharTerm" TERMINATOR='\n' />
</RECORD>
<ROW>
<COLUMN SOURCE="1" NAME="companynumber" xsi:type="SQLINT"/>
<COLUMN SOURCE="2" NAME="role" xsi:type="SQLNVARCHAR"/>
<COLUMN SOURCE="3" NAME="representant" xsi:type="SQLBIT"/>
<COLUMN SOURCE="4" NAME="postnr" xsi:type="SQLNVARCHAR"/>
<COLUMN SOURCE="5" NAME="income" xsi:type="SQLDECIMAL"/>
<COLUMN SOURCE="6" NAME="person2id" xsi:type="SQLINT"/>
</ROW>
</BCPFORMAT>
Hope this is helpful to someone else, because I spent quite a lot of time googleing before I found a way to solve this issue.
Microsoft™ SQL Server For NCHAR / NVARCHAR , the character encoding is UCS-2 or UTF-16, depending if the database collation is using the SC option. The length N in N[VAR]CHAR(N) defines a number of byte-pairs, not bytes.
If it's a csv file, you can still use code page 1252 to process it. When you open the flat file connection manager it shows you the code page for the file, but you don't need to save that setting. If you have other changes to make in the connection manager, change the code page back to 1252 before you save the changes.
SQL Server 2019 (15. x) introduces an additional option for UTF-8 encoding.
Convert into UTF16 instead. That is SQL Server's native NCHAR format, and allows full representation of Unicode values.
To make this work you will have to specify SQLNCHAR or SQLNVARCHAR in your format file, and also be aware of the caveat:
For a format file to work with a Unicode character data file, all the input fields must be Unicode text strings (that is, either fixed-size or character-terminated Unicode strings).
An alternative is to load it as binary data and use the CONVERT
function to convert it from VARBINARY
to NVARCHAR
(which is UTF-16) and then to the desired codepage as VARCHAR
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With