Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find non-ASCII characters in a text file and convert them to their Unicode equivalent

I am importing .txt file from a remote server and saving it to a database. I use a .Net script for this purpose. I sometimes notice a garbled word/characters (Ullerهkersvنgen) inside the files, which makes a problem while saving to the database.

I want to filter all such characters and convert them to unicode before saving to the database.

Note: I have been through many similar posts but had no luck.

Your help in this context will be highly appreciated.

Thanks.

like image 939
Mehboob Avatar asked May 23 '13 11:05

Mehboob


People also ask

How do you find non-ASCII characters in Python?

You can check the if the character value is between 0 - 127. for c in someString: if 0 <= ord(c) <= 127: # this is a ascii character. else: # this is a non-ascii character.

What is a non-ASCII characters example?

An example of a non-ASCII character is the Ñ. The URL can't contain any non-ASCII character or even a space. This issue commonly arises from developers misusing symbols or making coding mistakes — it could arise from a lack of knowledge or even negligence.


1 Answers

Assuming your script does know the correct encoding of your text snippet than that should be the regular expression to find all Non-ASCII charactres:

[^\x00-\x7F]+

see here: https://stackoverflow.com/a/20890052/1144966 and https://stackoverflow.com/a/8845398/1144966

Also, the base-R tools package provides two functions to detect non-ASCII characters:

tools::showNonASCII()
tools::showNonASCIIfile()
like image 74
petermeissner Avatar answered Oct 07 '22 14:10

petermeissner