How do I remove emoji characters from a string?

Tags:

I've got a text input from a mobile device. It contains emoji. In C#, I have the text as

Text 🍫🌐 text

Simply put, I want the output text to be

Text text

I'm trying to just remove all such emojis from the text with rejex.. except, I'm not sure how to convert that emoji into it's unicode sequence.. How do I do that?

edit:

I'm trying to save the user input into mysql. It looks like mysql UTF8 doesn't really support unicode characters and the right way to do it would be by changing the schema but I don't think that is an option for me. So I'm trying to just remove all the emoji characters before saving it in the database.

This is my schema for the relevant column:

enter image description here

I'm using Nhibernate as my ORM and the insert query generated looks like this:

Insert into `Content` (ContentTypeId, Comments, DateCreated) 
values (?p0, ?p1, ?p2);
?p0 = 4 [Type: Int32 (0)]. ?p1 = 'Text 🍫🌐 text' [Type: String (20)], ?p2 = 19/01/2015 10:38:23 [Type: DateTime (0)]

When I copy this query from logs and run it on mysql directly, I get this error:

1 warning(s): 1366 Incorrect string value: '\xF0\x9F\x98\x80 t...' for column 'Comments' at row 1   0.000 sec

Also, I've tried to convert it into encoding bytes and it doesn't really work..

enter image description here

239

asked Jan 19 '15 11:01

LocustHorde

1 Answers

Assuming you just want to remove all non-BMP characters, i.e. anything with a Unicode code point of U+10000 and higher, you can use a regex to remove any UTF-16 surrogate code units from the string. For example:

using System;
using System.Text.RegularExpressions;

class Test
{
    static void Main(string[] args)
    {
        string text = "x\U0001F310y";
        Console.WriteLine(text.Length); // 4
        string result = Regex.Replace(text, @"\p{Cs}", "");
        Console.WriteLine(result); // 2
    }
}

Here "Cs" is the Unicode category for "surrogate".

It appears that Regex works based on UTF-16 code units rather than Unicode code points, otherwise you'd need a different approach.

Note that there are non-BMP characters other than emoji, but I suspect you'll find they'll have the same problem when you try to store them.

190

answered Oct 11 '22 23:10

Jon Skeet

Related questions
                            
                                Int32 vs. Int64 vs. Int in C# [duplicate]
                            
                                Convert Time to decimal in C#
                            
                                Class vs. Interface
                            
                                Changing a horizontal bar chart to a vertical one
                            
                                Using generics with XmlSerializer
                            
                                How to do a NotEqual to in NHibernate
                            
                                C# Big-endian ulong from 4 bytes
                            
                                Entity Framework 4 Code Only Error "Multiple objects sets per type are not supported"
                            
                                How can anonymous types be created using LINQ with lambda syntax?
                            
                                How do I use Selenium in C#?
                            
                                linq query to select top 10 entries with most comments from the table
                            
                                Comparing two dictionaries for equal data in c#?
                            
                                Get the latest Value of table via EntityFramework
                            
                                How to release Image from Image Source in WPF
                            
                                Should I store telephone numbers as strings or integers? [duplicate]
                            
                                Get all ModelState Errors on view
                            
                                How to make sql query result to xml file
                            
                                Adding style to asp.net label
                            
                                Unreadable content in Excel file generated with EPPlus
                            
                                The configuration section 'system.web.webPages.razor' cannot be read because it is missing a section declaration

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I remove emoji characters from a string?

Tags:

c#

mysql

unicode

emoji

LocustHorde

People also ask

1 Answers

Jon Skeet

Recent Activity

Donate For Us