Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP Remove garbage from string

Tags:

string

php

utf-8

I am stuck on a problem, I am using a very basic RTE to get user input, and trimming the garbage from the string, when that is posted using the functions provided with RTE. I am using http://premiumsoftware.net/cleditor

After user submits the data, I parse it with PHP and remove the unwanted content. Most of the users are Linux / Mac users, and they usually copy content from emails/word documents and paste that in RTE, causing lots of garbage.

We also need to allow all the UTF8 chars from any language.

Saying all this, please check this image

enter image description here

As you can see, in the color notes there is not special char visible, and if I copy this from MYSQL and paste it any where, there will be no garbage. But if I turn the values to HEX you can see, a strange char is there. Highlighted with yellow.

Is there any way to filter these kind of issues. It is causing my PDF generation script to stop working

like image 220
Aamir Mahmood Avatar asked Dec 17 '13 13:12

Aamir Mahmood


Video Answer


1 Answers

That is not "garbage", it's the Line Separator character U+2028 encoded in UTF-8. It only looks like garbage if you interpret it in ASCII/Latin-1, the way everything looks like garbage when interpreted with the wrong character set. There's nothing as such to remove. If you decide that you want to remove certain superfluous characters, feel free to do so. But they're part of the original content and they're not "wrong" per se, so there's no general advice to give here.

If your PDF generator chokes on it, figure out why. Maybe it's just generally not handling Unicode correctly, in which case you need to fix that if you want to support Unicode with it. If it does have specific characters it chokes on (which would be weird), then you need to figure out what these characters are exactly and strip them.

like image 67
deceze Avatar answered Sep 28 '22 04:09

deceze