Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove all non printable characters in a string?

Tags:

php

ascii

utf-8

I imagine I need to remove chars 0-31 and 127.

Is there a function or piece of code to do this efficiently?

like image 377
Stewart Robinson Avatar asked Jul 24 '09 10:07

Stewart Robinson


People also ask

How do I remove non printable characters from a string in Java?

replaceAll("\\p{Cntrl}", "?"); The following will replace all ASCII non-printable characters (shorthand for [\p{Graph}\x20] ), including accented characters: my_string. replaceAll("[^\\p{Print}]", "?");

How do I remove non printable characters from a string in Python?

To strip non printable characters from a string in Python, we can call the isprintable method on each character and use list comprehension. to check if each character in my_string is printable with isprintable . Then we call ''. join with the iterator to join the printable characters in my_string back to a string.

How do you remove non-ascii characters?

Use . replace() method to replace the Non-ASCII characters with the empty string.


1 Answers

7 bit ASCII?

If your Tardis just landed in 1963, and you just want the 7 bit printable ASCII chars, you can rip out everything from 0-31 and 127-255 with this:

$string = preg_replace('/[\x00-\x1F\x7F-\xFF]/', '', $string); 

It matches anything in range 0-31, 127-255 and removes it.

8 bit extended ASCII?

You fell into a Hot Tub Time Machine, and you're back in the eighties. If you've got some form of 8 bit ASCII, then you might want to keep the chars in range 128-255. An easy adjustment - just look for 0-31 and 127

$string = preg_replace('/[\x00-\x1F\x7F]/', '', $string); 

UTF-8?

Ah, welcome back to the 21st century. If you have a UTF-8 encoded string, then the /u modifier can be used on the regex

$string = preg_replace('/[\x00-\x1F\x7F]/u', '', $string); 

This just removes 0-31 and 127. This works in ASCII and UTF-8 because both share the same control set range (as noted by mgutt below). Strictly speaking, this would work without the /u modifier. But it makes life easier if you want to remove other chars...

If you're dealing with Unicode, there are potentially many non-printing elements, but let's consider a simple one: NO-BREAK SPACE (U+00A0)

In a UTF-8 string, this would be encoded as 0xC2A0. You could look for and remove that specific sequence, but with the /u modifier in place, you can simply add \xA0 to the character class:

$string = preg_replace('/[\x00-\x1F\x7F\xA0]/u', '', $string); 

Addendum: What about str_replace?

preg_replace is pretty efficient, but if you're doing this operation a lot, you could build an array of chars you want to remove, and use str_replace as noted by mgutt below, e.g.

//build an array we can re-use across several operations $badchar=array(     // control characters     chr(0), chr(1), chr(2), chr(3), chr(4), chr(5), chr(6), chr(7), chr(8), chr(9), chr(10),     chr(11), chr(12), chr(13), chr(14), chr(15), chr(16), chr(17), chr(18), chr(19), chr(20),     chr(21), chr(22), chr(23), chr(24), chr(25), chr(26), chr(27), chr(28), chr(29), chr(30),     chr(31),     // non-printing characters     chr(127) );  //replace the unwanted chars $str2 = str_replace($badchar, '', $str); 

Intuitively, this seems like it would be fast, but it's not always the case, you should definitely benchmark to see if it saves you anything. I did some benchmarks across a variety string lengths with random data, and this pattern emerged using php 7.0.12

     2 chars str_replace     5.3439ms preg_replace     2.9919ms preg_replace is 44.01% faster      4 chars str_replace     6.0701ms preg_replace     1.4119ms preg_replace is 76.74% faster      8 chars str_replace     5.8119ms preg_replace     2.0721ms preg_replace is 64.35% faster     16 chars str_replace     6.0401ms preg_replace     2.1980ms preg_replace is 63.61% faster     32 chars str_replace     6.0320ms preg_replace     2.6770ms preg_replace is 55.62% faster     64 chars str_replace     7.4198ms preg_replace     4.4160ms preg_replace is 40.48% faster    128 chars str_replace    12.7239ms preg_replace     7.5412ms preg_replace is 40.73% faster    256 chars str_replace    19.8820ms preg_replace    17.1330ms preg_replace is 13.83% faster    512 chars str_replace    34.3399ms preg_replace    34.0221ms preg_replace is  0.93% faster   1024 chars str_replace    57.1141ms preg_replace    67.0300ms str_replace  is 14.79% faster   2048 chars str_replace    94.7111ms preg_replace   123.3189ms str_replace  is 23.20% faster   4096 chars str_replace   227.7029ms preg_replace   258.3771ms str_replace  is 11.87% faster   8192 chars str_replace   506.3410ms preg_replace   555.6269ms str_replace  is  8.87% faster  16384 chars str_replace  1116.8811ms preg_replace  1098.0589ms preg_replace is  1.69% faster  32768 chars str_replace  2299.3128ms preg_replace  2222.8632ms preg_replace is  3.32% faster 

The timings themselves are for 10000 iterations, but what's more interesting is the relative differences. Up to 512 chars, I was seeing preg_replace alway win. In the 1-8kb range, str_replace had a marginal edge.

I thought it was interesting result, so including it here. The important thing is not to take this result and use it to decide which method to use, but to benchmark against your own data and then decide.

like image 146
Paul Dixon Avatar answered Sep 23 '22 12:09

Paul Dixon