Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split unicode strings character by character in python?

My website supports a number of Indian languages. The user can change the language dynamically. When user inputs some string value, I have to split the string value into its individual characters. So, I'm looking for a way to write a common function that will work for English and a select set of Indian languages. I have searched across sites, however, there appears to be no common way to handle this requirement. There are language-specific implementations (for example Open-Tamil package for Tamil implements get_letters) but I could not find a common way to split or iterate through the characters in a unicode string taking the graphemes into consideration.

One of the many methods that I've tried:

name = u'தமிழ்'
print name
for i in list(name):
  print i

#expected output
தமிழ்
த
மி
ழ்

#actual output
தமிழ்
த
ம
ி
ழ
்

#Here is another an example using another Indian language
name = u'हिंदी'
print name
for i in list(name):
  print i

#expected output
हिंदी
हिं
दी

#actual output
हिंदी
ह
ि  
ं 
द
ी
like image 572
user1928896 Avatar asked Oct 11 '15 18:10

user1928896


People also ask

How do you split a character by text in Python?

The Python standard library comes with a function for splitting strings: the split() function. This function can be used to split strings between characters. The split() function takes two parameters. The first is called the separator and it determines which character is used to split the string.

How do you split a string into two strings in Python?

Use Newline (\n) Character In Python, the string is split by the use of the newline (\n) character.


3 Answers

The way to solve this is to group all "L" category characters with their subsequent "M" category characters:

>>> regex.findall(ur'\p{L}\p{M}*', name)
[u'\u0ba4', u'\u0bae\u0bbf', u'\u0bb4\u0bcd']
>>> for c in regex.findall(ur'\p{L}\p{M}*', name):
...   print c
... 
த
மி
ழ்

regex

like image 71
Ignacio Vazquez-Abrams Avatar answered Nov 15 '22 14:11

Ignacio Vazquez-Abrams


To get "user-perceived" characters whatever the language, use \X (eXtended grapheme cluster) regular expression:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import regex # $ pip install regex

for text in [u'தமிழ்', u'हिंदी']:
    print("\n".join(regex.findall(r'\X', text, regex.U)))

Output

த
மி
ழ்
हिं
दी
like image 22
jfs Avatar answered Nov 15 '22 15:11

jfs


uniseg works really well for this, and the docs are OK. The other answer to this question works for international Unicode characters, but falls flat if users enter Emoji. The solution below will work:

>>> emoji = u'😀😃😄😁'
>>> from uniseg.graphemecluster import grapheme_clusters
>>> for c in list(grapheme_clusters(emoji)):
...     print c
...
😀
😃
😄
😁

This is from pip install uniseg==0.7.1.

like image 25
Aidan Fitzpatrick Avatar answered Nov 15 '22 15:11

Aidan Fitzpatrick