I have some documents that went through OCR conversion from PDF into HTML. Because of that, they wound up having lots of random unicode punctuation where the converter messed up (i.e. elipses, etc...). They also correctly have a bunch of Non-English, but still Alphabetic characters, like é, and Russian characters, etc... Is there any way to make a Regex that will match any unicode alphabetic character (from alphabets of any language)? Or one that will only match non-alphabetic characters? Either one would be really helpful and awesome. I'm using Perl, if that changes anything. Thanks!

Check out Unicode character properties: http://www.regular-expressions.info/unicode.html#prop. I think what you are looking for is probably <pre class="prettyprint"><code>\p{L} </code></pre> which will match any letters or ideographs. You may also want to include letters with marks on them, so you could do <pre class="prettyprint"><code>\p{L}\p{M}* </code></pre> In any case, all the different types of character properties are detailed in the first link. Edit: You may also want to look at this Stack Overflow answer discussing whether \w matches unicode characters. They suggest that you could also use \p{Word} or \p{Alnum}: Does \w match all alphanumeric characters defined in the Unicode standard?

Depending on which language you're using, the regular expression engine may or may not be Unicode aware. If it is, it may or may not know the <code>\p{}</code> property tokens. If it does, your answer is in Unicode Characters and Properties in Jan Goyvaerts' regex tutorial. You can use <code>\p{Latin}</code>, if supported, to detect everything that is (or isn't, of course) from a language that uses any of the Unicode Latin blocks.

Is There a Way to Match Any Unicode Alphabetic Character?

Tags:

regex

unicode

perl

character-properties

I have some documents that went through OCR conversion from PDF into HTML. Because of that, they wound up having lots of random unicode punctuation where the converter messed up (i.e. elipses, etc...). They also correctly have a bunch of Non-English, but still Alphabetic characters, like é, and Russian characters, etc...

Is there any way to make a Regex that will match any unicode alphabetic character (from alphabets of any language)? Or one that will only match non-alphabetic characters? Either one would be really helpful and awesome. I'm using Perl, if that changes anything. Thanks!

777

asked May 14 '11 23:05

Eli

2 Answers

Check out Unicode character properties: http://www.regular-expressions.info/unicode.html#prop. I think what you are looking for is probably

\p{L}

which will match any letters or ideographs. You may also want to include letters with marks on them, so you could do

\p{L}\p{M}*

In any case, all the different types of character properties are detailed in the first link.

Edit: You may also want to look at this Stack Overflow answer discussing whether \w matches unicode characters. They suggest that you could also use \p{Word} or \p{Alnum}: Does \w match all alphanumeric characters defined in the Unicode standard?

answered Oct 25 '22 11:10

mpdaugherty

Depending on which language you're using, the regular expression engine may or may not be Unicode aware. If it is, it may or may not know the \p{} property tokens. If it does, your answer is in Unicode Characters and Properties in Jan Goyvaerts' regex tutorial.

You can use \p{Latin}, if supported, to detect everything that is (or isn't, of course) from a language that uses any of the Unicode Latin blocks.

answered Oct 25 '22 11:10

Mike 'Pomax' Kamermans

Related questions
                            
                                in R, use gsub to remove all punctuation except period
                            
                                Alternative to String.Replace
                            
                                Problem with quantifiers and look-behind
                            
                                Validating Password using Regex
                            
                                What is a "Nested Quantifier" and why is it causing my regex to fail?
                            
                                Regex: Matching against groups in different order without repeating the group
                            
                                How do I replace special characters with regex in javascript?
                            
                                ansible parse text string from stdout
                            
                                Regular expression for parsing name value pairs
                            
                                Regex to search for a word in a string in Visual Studio
                            
                                Regex accent insensitive?
                            
                                How to convert a scientific notation string to decimal notation?
                            
                                Extracting top-level and second-level domain from a URL using regex
                            
                                Extract text with multiple separators
                            
                                Perform regex (replace) in an SQL query
                            
                                C# How to delete XML/HTML comments with regular expression
                            
                                Replace all characters in a regex match with the same character in Vim
                            
                                Java regex error - Look-behind group does not have an obvious maximum length
                            
                                Regular Expression almost perfect for a Numeric Value
                            
                                Find numbers in string using Golang regexp

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With