I need to filter out (remove) extended ASCII characters from a SELECT statement in T-SQL.
I'm using a stored procedure to do so.
Expected input:
ËËËËeeeeËËËË
Expected output:
eeee
All that I've found is for MySQL.
I'm using :
Microsoft SQL Server Management Studio 11.0.2100.60
Microsoft .NET Framework 4.0.30319.17929
SQL Server TRIM() Function The TRIM() function removes the space character OR other specified characters from the start or end of a string. By default, the TRIM() function removes leading and trailing spaces from a string. Note: Also look at the LTRIM() and RTRIM() functions.
In some situations, we may not want a space or another special character, so we can include the special character that we want to exclude in our not regular expression, such as [^A-Z0-9 ] to exclude a space.
Use braces to escape a string of characters or symbols. Everything within a set of braces in considered part of the escape sequence. When you use braces to escape a single character, the escaped character becomes a separate token in the query. Use the backslash character to escape a single character or symbol.
OK, give this a try. It seems the same issue they have. Anyway you need to modify it based on your requirements.
CREATE FUNCTION RemoveNonASCII
(
@nstring nvarchar(255)
)
RETURNS varchar(255)
AS
BEGIN
DECLARE @Result varchar(255)
SET @Result = ''
DECLARE @nchar nvarchar(1)
DECLARE @position int
SET @position = 1
WHILE @position <= LEN(@nstring)
BEGIN
SET @nchar = SUBSTRING(@nstring, @position, 1)
--Unicode & ASCII are the same from 1 to 255.
--Only Unicode goes beyond 255
--0 to 31 are non-printable characters
IF UNICODE(@nchar) between 32 and 255
SET @Result = @Result + @nchar
SET @position = @position + 1
END
RETURN @Result
END
GO
Check it out at SqlServerCentral
The accepted answer is using a loop which should be avoided...
My solution is completely inlineable, it's easy to create an UDF (or maybe even better: an inline TVF) from this.
The idea: Create a set of running numbers (here it's limited with the count of objects in sys.objects, but there are tons of example how to create a numbers tally on the fly). In the second CTE the strings are splitted to single characters. The final select comes back with the cleaned string.
DECLARE @tbl TABLE(ID INT IDENTITY, EvilString NVARCHAR(100));
INSERT INTO @tbl(EvilString) VALUES('ËËËËeeeeËËËË'),('ËaËËbËeeeeËËËcË');
WITH RunningNumbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nmbr
FROM sys.objects
)
,SingleChars AS
(
SELECT tbl.ID,rn.Nmbr,SUBSTRING(tbl.EvilString,rn.Nmbr,1) AS Chr
FROM @tbl AS tbl
CROSS APPLY (SELECT TOP(LEN(tbl.EvilString)) Nmbr FROM RunningNumbers) AS rn
)
SELECT ID,EvilString
,(
SELECT '' + Chr
FROM SingleChars AS sc
WHERE sc.ID=tbl.ID AND ASCII(Chr)<128
ORDER BY sc.Nmbr
FOR XML PATH('')
) AS GoodString
FROM @tbl As tbl
The result
1 ËËËËeeeeËËËË eeee
2 ËaËËbËeeeeËËËcË abeeeec
Here is another answer from me where this approach is used to replace all special characters with secure characters to get plain latin
Just correcting above code (it was cutting DOTs)
CREATE FUNCTION [dbo].[RemoveNonASCII]
(
@nstring nvarchar(255)
)
RETURNS nvarchar(255)
AS
BEGIN
DECLARE @Result nvarchar(255)
SET @Result = ''
DECLARE @nchar nvarchar(1)
DECLARE @position int
SET @position = 1
WHILE @position <= LEN(@nstring)
BEGIN
SET @nchar = SUBSTRING(@nstring, @position, 1)
--Unicode & ASCII are the same from 1 to 255.
--Only Unicode goes beyond 255
--0 to 31 are non-printable characters
IF (UNICODE(@nchar) between 192 and 198) or (UNICODE(@nchar) between 225 and 230) -- letter A or a with accents
SET @nchar = 'a'
IF (UNICODE(@nchar) between 200 and 203) or (UNICODE(@nchar) between 232 and 235) -- letter E or e with accents
SET @nchar = 'e'
IF (UNICODE(@nchar) between 204 and 207) or (UNICODE(@nchar) between 236 and 239) -- letter I or i with accents
SET @nchar = 'i'
IF (UNICODE(@nchar) between 210 and 214) or (UNICODE(@nchar) between 242 and 246) or (UNICODE(@nchar)=240) -- letter O or o with accents
SET @nchar = 'o'
IF (UNICODE(@nchar) between 217 and 220) or (UNICODE(@nchar) between 249 and 252) -- letter U or u with accents
SET @nchar = 'u'
IF (UNICODE(@nchar)=199) or (UNICODE(@nchar)=231) -- letter Ç or ç
SET @nchar = 'c'
IF (UNICODE(@nchar)=209) or (UNICODE(@nchar)=241) -- letter Ñ or ñ
SET @nchar = 'n'
IF (UNICODE(@nchar) between 45 and 46) or (UNICODE(@nchar) between 48 and 57) or (UNICODE(@nchar) between 64 and 90) or (UNICODE(@nchar) = 95) or (UNICODE(@nchar) between 97 and 122)
SET @Result = @Result + @nchar
SET @position = @position + 1
END
set @Result = lower(@Result) -- e-mails in lower case
RETURN @Result
END
Thank you for sharing your code.
I needed something like it, not just for cleaning up e-mail addresses but for general purpose, filtering user web site input before it reaches SAP ERP via integration modules.
Running in Brazil, it must obey language accents...
Here goes the resulting code.
Maybe it can help someone someday, like it did to me.
IF EXISTS
(
SELECT *
FROM sys.objects
WHERE object_id = OBJECT_ID(N'[dbo].[fnRemoveNonASCII]')
AND type IN (N'FN')
)
DROP FUNCTION dbo.fnRemoveNonASCII
GO
CREATE FUNCTION [dbo].[fnRemoveNonASCII]
(
@nstring nvarchar(MAX)
)
RETURNS nvarchar(MAX)
AS
BEGIN
DECLARE @nchar nvarchar(1) -- individual char in string
DECLARE @nUnicode nvarchar(3) -- ASCII for individual char in string
DECLARE @position int -- subscript to control loop in the string
DECLARE @Result nvarchar(MAX) -- return valus
SET @Result = ''
SET @position = 1
WHILE @position <= LEN(@nstring)
BEGIN
--Unicode & ASCII are the same from 1 to 255.
--Only Unicode goes beyond 255
--0 to 31 are non-printable characters
SET @nchar = SUBSTRING(@nstring, @position, 1)
SET @nUnicode = UNICODE(@nChar)
IF @nUnicode = 10
OR @nUnicode = 13
OR @nUnicode BETWEEN 32 AND 126
OR @nUnicode = 160
OR @nUnicode BETWEEN 192 AND 207
OR @nUnicode BETWEEN 210 AND 213
OR @nUnicode BETWEEN 217 AND 219
OR @nUnicode BETWEEN 224 AND 227
OR @nUnicode BETWEEN 231 AND 234
OR @nUnicode = 236
OR @nUnicode = 237
OR @nUnicode BETWEEN 242 AND 245
OR @nUnicode = 247
OR @nUnicode = 249
OR @nUnicode = 250
SET @Result = @Result + @nchar
ELSE IF @nUnicode = 9 -- TAB
SET @Result = @Result + ' '
ELSE
SET @Result = @Result + ' '
SET @position = @position + 1
END
RETURN @Result
END
/*
---------------------------------------------------------------------------------------------------------------
-- Tabela dos caracteres Unicode/ASCII exportáveis
	 | 	 | %9 = TAB
| 
 | %a = 0A Line Feed
| 
 | %d = 0D Carriage Return
  |   | %20 = <space>
! | ! | %21 = !
" | " | %22 = "
# | # | %23 = #
$ | $ | %24 = $
% | % | %25 = %
& | & | %26 = &
' | ' | %27 = '
( | ( | %28 = (
) | ) | %29 = )
* | * | %2a = *
+ | + | %2b = +
, | , | %2c = ,
- | - | %2d = -
. | . | %2e = .
/ | / | %2f = /
0 | 0 | %30 = 0
1 | 1 | %31 = 1
2 | 2 | %32 = 2
3 | 3 | %33 = 3
4 | 4 | %34 = 4
5 | 5 | %35 = 5
6 | 6 | %36 = 6
7 | 7 | %37 = 7
8 | 8 | %38 = 8
9 | 9 | %39 = 9
: | : | %3a = :
; | ; | %3b = ;
< | < | %3c = <
= | = | %3d = =
> | > | %3e = >
? | ? | %3f = ?
@ | @ | %40 = @
A | A | %41 = A
B | B | %42 = B
C | C | %43 = C
D | D | %44 = D
E | E | %45 = E
F | F | %46 = F
G | G | %47 = G
H | H | %48 = H
I | I | %49 = I
J | J | %4a = J
K | K | %4b = K
L | L | %4c = L
M | M | %4d = M
N | N | %4e = N
O | O | %4f = O
P | P | %50 = P
Q | Q | %51 = Q
R | R | %52 = R
S | S | %53 = S
T | T | %54 = T
U | U | %55 = U
V | V | %56 = V
W | W | %57 = W
X | X | %58 = X
Y | Y | %59 = Y
Z | Z | %5a = Z
[ | [ | %5b = [
\ | \ | %5c = \
] | ] | %5d = ]
^ | ^ | %5e = ^
_ | _ | %5f = _
` | ` | %60 = `
a | a | %61 = a
b | b | %62 = b
c | c | %63 = c
d | d | %64 = d
e | e | %65 = e
f | f | %66 = f
g | g | %67 = g
h | h | %68 = h
i | i | %69 = i
j | j | %6a = j
k | k | %6b = k
l | l | %6c = l
m | m | %6d = m
n | n | %6e = n
o | o | %6f = o
p | p | %70 = p
q | q | %71 = q
r | r | %72 = r
s | s | %73 = s
t | t | %74 = t
u | u | %75 = u
v | v | %76 = v
w | w | %77 = w
x | x | %78 = x
y | y | %79 = y
z | z | %7a = z
{ | { | %7b = {
| | | | %7c = |
} | } | %7d = }
~ | ~ | %7e = ~
  |   | %a0 = <nbsp>
À | À | %c0 = À
Á | Á | %c1 = Á
 |  | %c2 = Â
à | à | %c3 = Ã
Ä | Ä | %c4 = Ä
Å | Å | %c5 = Å
Æ | Æ | %c6 = Æ
Ç | Ç | %c7 = Ç
È | È | %c8 = È
É | É | %c9 = É
Ê | Ê | %ca = Ê
Ë | Ë | %cb = Ë
Ì | Ì | %cc = Ì
Í | Í | %cd = Í
Î | Î | %ce = Î
Ï | Ï | %cf = Ï
Ò | Ò | %d2 = Ò
Ó | Ó | %d3 = Ó
Ô | Ô | %d4 = Ô
Õ | Õ | %d5 = Õ
Ù | Ù | %d9 = Ù
Ú | Ú | %da = Ú
Û | Û | %db = Û
à | à | %e0 = à
á | á | %e1 = á
â | â | %e2 = â
ã | ã | %e3 = ã
ç | ç | %e7 = ç
è | è | %e8 = è
é | é | %e9 = é
ê | ê | %ea = ê
ì | ì | %ec = ì
í | í | %ed = í
ò | ò | %f2 = ò
ó | ó | %f3 = ó
ô | ô | %f4 = ô
õ | õ | %f5 = õ
÷ | ÷ | %f7 = ÷
ù | ù | %f9 = ù
ú | ú | %fa = ú
*/
GO
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With