Python: any way to perform this "hybrid" split() on multi-lingual (e.g. Chinese & English) strings?

Tags:

I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don't (Chinese, Japanese, Korean).

Given such a string, I want to separate the English/French/etc part into words using whitespace as separator, and to separate the Chinese/Japanese/Korean part into individual characters.

And I want to put of all those separated components into a list.

Some examples would probably make this clear:

Case 1: English-only string. This case is easy:

>>> "I love Python".split()
['I', 'love', 'Python']

Case 2: Chinese-only string:

>>> list(u"我爱蟒蛇")
[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']

In this case I can turn the string into a list of Chinese characters. But within the list I'm getting unicode representations:

[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']

How do I get it to display the actual characters instead of the unicode? Something like:

['我', '爱', '蟒', '蛇']

Case 3: A mix of English & Chinese:

I want to turn an input string such as

"我爱Python"

and turns it into a list like this:

['我', '爱', 'Python']

Is it possible to do something like that?

266

asked Sep 27 '10 06:09

Continuation

2 Answers

I thought I'd show the regex approach, too. It doesn't feel right to me, but that's mostly because all of the language-specific i18n oddnesses I've seen makes me worried that a regular expression might not be flexible enough for all of them--but you may well not need any of that. (In other words--overdesign.)

# -*- coding: utf-8 -*-
import re
def group_words(s):
    regex = []

    # Match a whole word:
    regex += [ur'\w+']

    # Match a single CJK character:
    regex += [ur'[\u4e00-\ufaff]']

    # Match one of anything else, except for spaces:
    regex += [ur'[^\s]']

    regex = "|".join(regex)
    r = re.compile(regex)

    return r.findall(s)

if __name__ == "__main__":
    print group_words(u"Testing English text")
    print group_words(u"我爱蟒蛇")
    print group_words(u"Testing English text我爱蟒蛇")

In practice, you'd probably want to only compile the regex once, not on each call. Again, filling in the particulars of character grouping is up to you.

171

answered Oct 07 '22 09:10

Glenn Maynard

In Python 3, it also splits the number if you needed.

def spliteKeyWord(str):
    regex = r"[\u4e00-\ufaff]|[0-9]+|[a-zA-Z]+\'*[a-z]*"
    matches = re.findall(regex, str, re.UNICODE)
    return matches

print(spliteKeyWord("Testing English text我爱Python123"))

=> ['Testing', 'English', 'text', '我', '爱', 'Python', '123']

answered Oct 07 '22 08:10

Winter Lin

Related questions
                            
                                Python for Autohotkey style key-combination sniffing, automation?
                            
                                urlsafe_b64encode always ends in '=' ?:
                            
                                Why does this pyd file not import on some computers?
                            
                                Python - Threading and a While True Loop
                            
                                how fast is python's slice
                            
                                Is there an way to programmatically read a file from a TrueCrypt disk into memory?
                            
                                Clojure equivalent to Python's lxml library?
                            
                                url builder for python
                            
                                How can I configure Geany to compile and run my Python programs?
                            
                                Python rounding issue
                            
                                blocking channels vs async message passing
                            
                                Python web development framework for python 3.1 user
                            
                                How to create a CFuncType in Python
                            
                                Why do dicts of defaultdict(int)'s use so much memory? (and other simple python performance questions)
                            
                                django and netbeans?
                            
                                Terminate a python script from another python script
                            
                                Input from 20+ microphones
                            
                                Rake:task equivalent in Django
                            
                                Why Does Looping Beat Indexing Here?
                            
                                How to get the original value of changed fields?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python: any way to perform this "hybrid" split() on multi-lingual (e.g. Chinese & English) strings?

Tags:

python

string

unicode

multilingual

cjk

Continuation

People also ask

2 Answers

Glenn Maynard

Winter Lin

Recent Activity

Donate For Us