Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove weird characters ( A with hat) from SQL Server varchar column

Some weird characters are getting stored in one of the table. They seem to be coming from .csv feeds so I don't have much control over that.

Hello Kitty Essential Accessory Kit

How can I clean it and remove these characters. I am ok doing it at db level or in C#.

EDIT

As per the suggestions received in comments. I am also looking into what I can do to correct it at feed level. Here's more info on it.

  1. Feeds are from third party.
  2. I opened feed in notepad++ and checked the encoding menu I see dot in front of 'encode in ansi' so I believe that's the encoding of the file
  3. And that's how it appears in notepad++ "Hello Kitty Essential Accessory Kit"
  4. One strange thing though. when I search that row in powershel from csv file. and it comes up with the row. I don't see these weird characters there..
like image 535
Ankit Avatar asked Dec 11 '22 22:12

Ankit


2 Answers

You can use .net regular expression functions. For example, using Regex.Replace:

Regex.Replace(s, @"[^\u0000-\u007F]", string.Empty);

As there is no support for regular expressions in SQL Server you need to create a SQL CLR function. More information about the .net integration in SQL Server can be found here:

  • String Utility Functions Sample - full working examples
  • Stairway to SQLCLR - still in progress
  • Introduction to SQL Server CLR Integration - official documentation

In your case:

  1. Open Visual Studio and create Class Library Project:

    enter image description here

  2. Then rename the class to StackOverflow and paste the following code in its file:

    using Microsoft.SqlServer.Server;
    using System;
    using System.Collections.Generic;
    using System.Data.SqlTypes;
    using System.Linq;
    using System.Text;
    using System.Text.RegularExpressions;
    using System.Threading.Tasks;
    
    public class StackOverflow
    {
        [SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic = true, Name = "RegexReplace")]
        public static SqlString Replace(SqlString sqlInput, SqlString sqlPattern, SqlString sqlReplacement)
        {
            string input = (sqlInput.IsNull) ? string.Empty : sqlInput.Value;
            string pattern = (sqlPattern.IsNull) ? string.Empty : sqlPattern.Value;
            string replacement = (sqlReplacement.IsNull) ? string.Empty : sqlReplacement.Value;
            return new SqlString(Regex.Replace(input, pattern, replacement));
        }
    }
    
  3. Now, build the project. Open the SQL Server Management Studio. Select your database and replace the path value of the following FROM clause to match your StackOverflow.dll:

    CREATE ASSEMBLY [StackOverflow] FROM 'C:\Users\gotqn\Desktop\StackOverflow\StackOverflow\bin\Debug\StackOverflow.dll';
    
  4. Finally, create the SQL CLR function:

    CREATE FUNCTION [dbo].[StackOverflowRegexReplace] (@input NVARCHAR(MAX),@pattern NVARCHAR(MAX), @replacement NVARCHAR(MAX))
    RETURNS NVARCHAR(4000)
    AS EXTERNAL NAME [StackOverflow].[StackOverflow].[Replace]
    GO
    

You are ready to use RegexReplace .net function directly in your T-SQL statements:

    SELECT [dbo].[StackOverflowRegexReplace] ('Hello Kitty Essential Accessory Kit', '[^\u0000-\u007F]', '')

    //Hello Kitty Essential Accessory Kit
like image 105
gotqn Avatar answered Mar 30 '23 00:03

gotqn


if you are looking for alphabets and numbers only in a string, than this can help you out.

In this, Regex is used to replace all characters other than alphabets and numbers.

like image 21
Tirthak Shah Avatar answered Mar 30 '23 00:03

Tirthak Shah