Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find non-ASCII characters in varchar columns using SQL Server

How can rows with non-ASCII characters be returned using SQL Server?
If you can show how to do it for one column would be great.

I am doing something like this now, but it is not working

select *
from Staging.APARMRE1 as ar
where ar.Line like '%[^!-~ ]%'

For extra credit, if it can span all varchar columns in a table, that would be outstanding! In this solution, it would be nice to return three columns:

  • The identity field for that record. (This will allow the whole record to be reviewed with another query.)
  • The column name
  • The text with the invalid character
 Id | FieldName | InvalidText       |
----+-----------+-------------------+
 25 | LastName  | Solís             |
 56 | FirstName | François          |
100 | Address1  | 123 Ümlaut street |

Invalid characters would be any outside the range of SPACE (3210) through ~ (12710)

like image 949
Gerhard Weiss Avatar asked Oct 08 '10 14:10

Gerhard Weiss


People also ask

How do I find a non-ASCII character in SQL?

Alternatively, you can also use regular expressions to find non-ASCII characters. ASCII character set is captured using regex [A-Za-z0-9]. You can use this regex in your query as shown below, to find non-ASCII characters. mysql> SELECT * FROM data WHERE full_name NOT REGEXP '[A-Za-z0-9]';

How do I find the characters in a column in SQL?

LEN() function calculates the number of characters of an input string, excluding the trailing spaces. It is an expression that can be a constant, variable, or column of either character or binary data. Returns : It returns the number of characters of an input string, excluding the trailing spaces.


2 Answers

Here is a solution for the single column search using PATINDEX.
It also displays the StartPosition, InvalidCharacter and ASCII code.

select line,
  patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line) as [Position],
  substring(line,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line),1) as [InvalidCharacter],
  ascii(substring(line,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line),1)) as [ASCIICode]
from  staging.APARMRE1
where patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,Line) >0
like image 67
Gerhard Weiss Avatar answered Oct 19 '22 17:10

Gerhard Weiss


I've been running this bit of code with success

declare @UnicodeData table (
     data nvarchar(500)
)
insert into 
    @UnicodeData
values 
    (N'Horse�')
    ,(N'Dog')
    ,(N'Cat')

select
    data
from
    @UnicodeData 
where
    data collate LATIN1_GENERAL_BIN != cast(data as varchar(max))

Which works well for known columns.

For extra credit, I wrote this quick script to search all nvarchar columns in a given table for Unicode characters.

declare 
    @sql    varchar(max)    = ''
    ,@table sysname         = 'mytable' -- enter your table here

;with ColumnData as (
    select
        RowId               = row_number() over (order by c.COLUMN_NAME)
        ,c.COLUMN_NAME
        ,ColumnName         = '[' + c.COLUMN_NAME + ']'
        ,TableName          = '[' + c.TABLE_SCHEMA + '].[' + c.TABLE_NAME + ']' 
    from
        INFORMATION_SCHEMA.COLUMNS c
    where
        c.DATA_TYPE         = 'nvarchar'
        and c.TABLE_NAME    = @table
)
select
    @sql = @sql + 'select FieldName = ''' + c.ColumnName + ''',         InvalidCharacter = [' + c.COLUMN_NAME + ']  from ' + c.TableName + ' where ' + c.ColumnName + ' collate LATIN1_GENERAL_BIN != cast(' + c.ColumnName + ' as varchar(max)) '  +  case when c.RowId <> (select max(RowId) from ColumnData) then  ' union all ' else '' end + char(13)
from
    ColumnData c

-- check
-- print @sql
exec (@sql)

I'm not a fan of dynamic SQL but it does have its uses for exploratory queries like this.

like image 28
Vash Avatar answered Oct 19 '22 17:10

Vash