Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to account for accent characters for regex in Python?

Tags:

I currently use re.findall to find and isolate words after the '#' character for hash tags in a string:

hashtags = re.findall(r'#([A-Za-z0-9_]+)', str1) 

It searches str1 and finds all the hashtags. This works however it doesn't account for accented characters like these for example: áéíóúñü¿.

If one of these letters are in str1, it will save the hashtag up until the letter before it. So for example, #yogenfrüz would be #yogenfr.

I need to be able to account for all accented letters that range from German, Dutch, French and Spanish so that I can save hashtags like #yogenfrüz

How can I go about doing this

like image 303
deadlock Avatar asked Sep 06 '13 17:09

deadlock


People also ask

How do you use special characters in regex Python?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

How do you write regex code in Python?

Python has a module named re to work with RegEx. Here's an example: import re pattern = '^a...s$' test_string = 'abyss' result = re. match(pattern, test_string) if result: print("Search successful.") else: print("Search unsuccessful.")

How do you match a character in Python?

Using m option allows it to match newline as well. Matches any single character in brackets. Matches 0 or more occurrences of preceding expression. Matches 1 or more occurrence of preceding expression.


1 Answers

Try the following:

hashtags = re.findall(r'#(\w+)', str1, re.UNICODE) 

Regex101 Demo

EDIT Check the useful comment below from Martijn Pieters.

like image 187
Ibrahim Najjar Avatar answered Oct 13 '22 01:10

Ibrahim Najjar