Find non-ASCII characters in a text file and convert them to their Unicode equivalent

Q: What is a non-ASCII characters example?

An example of a non-ASCII character is the Ñ. The URL can't contain any non-ASCII character or even a space. This issue commonly arises from developers misusing symbols or making coding mistakes — it could arise from a lack of knowledge or even negligence.

Tags:

character-encoding

unicode

I am importing .txt file from a remote server and saving it to a database. I use a .Net script for this purpose. I sometimes notice a garbled word/characters (Ullerهkersvنgen) inside the files, which makes a problem while saving to the database.

I want to filter all such characters and convert them to unicode before saving to the database.

Note: I have been through many similar posts but had no luck.

Your help in this context will be highly appreciated.

Thanks.

939

asked May 23 '13 11:05

Mehboob

1 Answers

Assuming your script does know the correct encoding of your text snippet than that should be the regular expression to find all Non-ASCII charactres:

[^\x00-\x7F]+

see here: https://stackoverflow.com/a/20890052/1144966 and https://stackoverflow.com/a/8845398/1144966

Also, the base-R tools package provides two functions to detect non-ASCII characters:

tools::showNonASCII()
tools::showNonASCIIfile()

answered Oct 07 '22 14:10

petermeissner

Related questions
                            
                                Simplest way to convert unicode codepoint into UTF-8
                            
                                How do I quote a UTF-8 String Literal in Sqlite3
                            
                                how to make regexp not hungry with quotes?
                            
                                Ruby convert IDN domain from Punycode to Unicode
                            
                                How do I get the decimal value of a unicode character in Java?
                            
                                Use ready-made character class and restrict it further
                            
                                Parsing through Arabic / RTL text from left to right
                            
                                Get word count from a string in Unicode (in any language)
                            
                                Getting empty character literal error in java code that specified unicode literals [duplicate]
                            
                                Overloaded method call has alternatives: String.format
                            
                                What happens when encode is used on str in python?
                            
                                Example of a name that is not mapped in Unicode code points
                            
                                How can I truncate a string to have at most N characters?
                            
                                How to combine multiple Unicode properties in perl regex?
                            
                                Pytesseract foreign language extraction using python
                            
                                Cast to LPCWSTR?
                            
                                C# and UTF-16 characters
                            
                                How could I catch an "Unicode non-character"-warning?
                            
                                QSettings doesn't handle unicode well
                            
                                How to convert some character into five digit unicode one in Python 3.3?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With