I am importing .txt file from a remote server and saving it to a database. I use a .Net script for this purpose. I sometimes notice a garbled word/characters (Ullerهkersvنgen
) inside the files, which makes a problem while saving to the database.
I want to filter all such characters and convert them to unicode before saving to the database.
Note: I have been through many similar posts but had no luck.
Your help in this context will be highly appreciated.
Thanks.
You can check the if the character value is between 0 - 127. for c in someString: if 0 <= ord(c) <= 127: # this is a ascii character. else: # this is a non-ascii character.
An example of a non-ASCII character is the Ñ. The URL can't contain any non-ASCII character or even a space. This issue commonly arises from developers misusing symbols or making coding mistakes — it could arise from a lack of knowledge or even negligence.
Assuming your script does know the correct encoding of your text snippet than that should be the regular expression to find all Non-ASCII charactres:
[^\x00-\x7F]+
see here: https://stackoverflow.com/a/20890052/1144966 and https://stackoverflow.com/a/8845398/1144966
Also, the base-R tools package provides two functions to detect non-ASCII characters:
tools::showNonASCII()
tools::showNonASCIIfile()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With