Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing non-UTF-8 characters from large txt file

I'm working on a 1 Gigabyte JSON text file which I'm trying to parse using Java. However, the parser throws an exception because it runs into the character 'ñ' generating this exception:

Exception Invalid UTF-8 start byte 0x96

I've tried to remove the character using sed and perl, but it seems that they cannot read the character and thus the file remains unchanged. I'd like to remove the character from the whole file or replace it with any other character or string so that the parsing works.

like image 534
user1261046 Avatar asked Jun 19 '12 16:06

user1261046


1 Answers

Your file is not encoded in UTF-8.

You should find the encoding and use this encoding to read the File using InputStreamReader. And then save it if needed in UTF-8 (using for exemple an OutputStreamWriter).

If you don't know the encoding, I suggest you test with a few probable encodings : see Charsets.

like image 78
Denys Séguret Avatar answered Sep 25 '22 04:09

Denys Séguret