Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: any way to perform this "hybrid" split() on multi-lingual (e.g. Chinese & English) strings?

I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don't (Chinese, Japanese, Korean).

Given such a string, I want to separate the English/French/etc part into words using whitespace as separator, and to separate the Chinese/Japanese/Korean part into individual characters.

And I want to put of all those separated components into a list.

Some examples would probably make this clear:

Case 1: English-only string. This case is easy:

>>> "I love Python".split()
['I', 'love', 'Python']

Case 2: Chinese-only string:

>>> list(u"我爱蟒蛇")
[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']

In this case I can turn the string into a list of Chinese characters. But within the list I'm getting unicode representations:

[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']

How do I get it to display the actual characters instead of the unicode? Something like:

['我', '爱', '蟒', '蛇']

??

Case 3: A mix of English & Chinese:

I want to turn an input string such as

"我爱Python"

and turns it into a list like this:

['我', '爱', 'Python']

Is it possible to do something like that?

like image 266
Continuation Avatar asked Sep 27 '10 06:09

Continuation


People also ask

Can you split on multiple characters Python?

Method 1: Split multiple characters from string using re. split() This is the most efficient and commonly used method to split multiple characters at once. It makes use of regex(regular expressions) in order to do this.

How do you split an expression in Python?

Use split() method to split by delimiter. If the argument is omitted, it will be split by whitespace, such as spaces, newlines \n , and tabs \t . Consecutive whitespace is processed together. A list of the words is returned.

Can you split () by a newline Python?

split() method splits the string by new line character and returns a list of strings. The string can also contain \n characters in the string as shown below, instead of a multi-line string with triple quotes.

What is the best way to split a string in Python?

Python String split() Method The split() method splits a string into a list. You can specify the separator, default separator is any whitespace. Note: When maxsplit is specified, the list will contain the specified number of elements plus one.


2 Answers

I thought I'd show the regex approach, too. It doesn't feel right to me, but that's mostly because all of the language-specific i18n oddnesses I've seen makes me worried that a regular expression might not be flexible enough for all of them--but you may well not need any of that. (In other words--overdesign.)

# -*- coding: utf-8 -*-
import re
def group_words(s):
    regex = []

    # Match a whole word:
    regex += [ur'\w+']

    # Match a single CJK character:
    regex += [ur'[\u4e00-\ufaff]']

    # Match one of anything else, except for spaces:
    regex += [ur'[^\s]']

    regex = "|".join(regex)
    r = re.compile(regex)

    return r.findall(s)

if __name__ == "__main__":
    print group_words(u"Testing English text")
    print group_words(u"我爱蟒蛇")
    print group_words(u"Testing English text我爱蟒蛇")

In practice, you'd probably want to only compile the regex once, not on each call. Again, filling in the particulars of character grouping is up to you.

like image 171
Glenn Maynard Avatar answered Oct 07 '22 09:10

Glenn Maynard


In Python 3, it also splits the number if you needed.

def spliteKeyWord(str):
    regex = r"[\u4e00-\ufaff]|[0-9]+|[a-zA-Z]+\'*[a-z]*"
    matches = re.findall(regex, str, re.UNICODE)
    return matches

print(spliteKeyWord("Testing English text我爱Python123"))

=> ['Testing', 'English', 'text', '我', '爱', 'Python', '123']

like image 38
Winter Lin Avatar answered Oct 07 '22 08:10

Winter Lin