Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx for removing non ASCII characters from both ends

Tags:

python

regex

I have to loop multiple times using this code, is there a better way?

item = '!@#$abc-123-4;5.def)(*&^;\n'

or

'!@#$abc-123-4;5.def)(*&^;\n_'

or

'!@#$abc-123-4;5.def)_(*&^;\n_'

The one I have like this did not work

item = re.sub('^\W|\W$', '', item)

Expect

abc-123-4;5.def

The final goal is to keep only remove anything not [a-zA-Z0-9] from both ends while keeping any chars in between. The first and last letter is in class [a-zA-Z0-9]

like image 731
Gang Avatar asked May 10 '19 00:05

Gang


2 Answers

You can accomplish this by using the carat ^ character at the beginning of a character set to negate its contents. [^a-zA-Z0-9] will match anything that isn't a letter or numeral.

^[^a-zA-Z0-9]+|[^a-zA-Z0-9]+$
like image 78
CAustin Avatar answered Nov 11 '22 09:11

CAustin


This expression is not bounded from the left side, and it might perform faster, if all your desired chars would be similar to the example you have provided in your question:

([a-z0-9;.-]+)(.*)

Here, we're assuming that you might just want to filter those special chars in the left and right parts of your input strings.

You can include other chars and boundaries to the expression, and you can even modify/change it to a simpler and faster expression, if you wish.

enter image description here

RegEx Descriptive Graph

This graph shows how the expression would work and you can visualize other expressions in this link:

enter image description here

If you wish to add other boundaries in the right side, you can simply do that:

([a-z0-9;.-]+)(.*)$

or even you can list your special chars both in the left and right of the capturing group.

JavaScript Test

const regex = /([a-z0-9;.-]+)(.*)$/gm;
const str = `!@#\$abc-123-4;5.def)(*&^;\\n`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

Performance Test

This JavaScript snippet shows the performance of that expression using a simple loop.

const repeat = 1000000;
const start = Date.now();

for (var i = repeat; i >= 0; i--) {
	const string = '!@#\$abc-123-4;5.def)(*&^;\\n';
	const regex = /([!@#$)(*&^;]+)([a-z0-9;.-]+)(.*)$/gm;
	var match = string.replace(regex, "$2");
}

const end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚💚💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");

Python Test

import re

regex = r"([a-z0-9;.-]+)(.*)$"
test_str = "!@#$abc-123-4;5.def)(*&^;\\n"
print(re.findall(regex, test_str))

Output

[('abc-123-4;5.def', ')(*&^;\\n')]
like image 31
Emma Avatar answered Nov 11 '22 09:11

Emma