I have a string in unicode and I need to return the first N characters. I am doing this: <pre class="prettyprint"><code>result = unistring[:5] </code></pre> but of course the length of unicode strings != length of characters. Any ideas? The only solution is using re? Edit: More info <pre class="prettyprint"><code>unistring = "Μεταλλικα" #Metallica written in Greek letters result = unistring[:1] </code></pre> returns-> ? I think that unicode strings are two bytes (char), that's why this thing happens. If I do: <pre class="prettyprint"><code>result = unistring[:2] </code></pre> I get <code>M</code> which is correct, So, should I always slice*2 or should I convert to something?

Unfortunately for historical reasons prior to Python 3.0 there are two string types. byte strings (<code>str</code>) and Unicode strings (<code>unicode</code>). Prior to the unification in Python 3.0 there are two ways to declare a string literal: <code>unistring = "Μεταλλικα"</code> which is a byte string and <code>unistring = u"Μεταλλικα"</code> which is a unicode string. The reason you see <code>?</code> when you do <code>result = unistring[:1]</code> is because some of the characters in your Unicode text cannot be correctly represented in the non-unicode string. You have probably seen this kind of problem if you ever used a really old email client and received emails from friends in countries like Greece for example. So in Python 2.x if you need to handle Unicode you have to do it explicitly. Take a look at this introduction to dealing with Unicode in Python: Unicode HOWTO

When you say: <pre class="prettyprint"><code>unistring = "Μεταλλικα" #Metallica written in Greek letters </code></pre> You do not have a unicode string. You have a bytestring in (presumably) UTF-8. That is not the same thing. A unicode string is a separate datatype in Python. You get unicode by decoding bytestrings using the right encoding: <pre class="prettyprint"><code>unistring = "Μεταλλικα".decode('utf-8') </code></pre> or by using the unicode literal in a source file with the right encoding declaration <pre class="prettyprint"><code># coding: UTF-8 unistring = u"Μεταλλικα" </code></pre> The unicode string will do what you want when you do <code>unistring[:5]</code>.

Returning the first N characters of a unicode string

Tags:

python

unicode

python-2.x

I have a string in unicode and I need to return the first N characters. I am doing this:

result = unistring[:5]

but of course the length of unicode strings != length of characters. Any ideas? The only solution is using re?

Edit: More info

unistring = "Μεταλλικα" #Metallica written in Greek letters
result = unistring[:1]

returns-> ?

I think that unicode strings are two bytes (char), that's why this thing happens. If I do:

result = unistring[:2]

I get

M

which is correct, So, should I always slice*2 or should I convert to something?

469

asked Jan 28 '10 10:01

Jon Romero

2 Answers

Unfortunately for historical reasons prior to Python 3.0 there are two string types. byte strings (str) and Unicode strings (unicode).

Prior to the unification in Python 3.0 there are two ways to declare a string literal: unistring = "Μεταλλικα" which is a byte string and unistring = u"Μεταλλικα" which is a unicode string.

The reason you see ? when you do result = unistring[:1] is because some of the characters in your Unicode text cannot be correctly represented in the non-unicode string. You have probably seen this kind of problem if you ever used a really old email client and received emails from friends in countries like Greece for example.

So in Python 2.x if you need to handle Unicode you have to do it explicitly. Take a look at this introduction to dealing with Unicode in Python: Unicode HOWTO

answered Oct 19 '22 20:10

Tendayi Mawushe

When you say:

unistring = "Μεταλλικα" #Metallica written in Greek letters

You do not have a unicode string. You have a bytestring in (presumably) UTF-8. That is not the same thing. A unicode string is a separate datatype in Python. You get unicode by decoding bytestrings using the right encoding:

unistring = "Μεταλλικα".decode('utf-8')

or by using the unicode literal in a source file with the right encoding declaration

# coding: UTF-8
unistring = u"Μεταλλικα"

The unicode string will do what you want when you do unistring[:5].

answered Oct 19 '22 21:10

Thomas Wouters

Related questions
                            
                                "which conda" command returns something not expected
                            
                                Why are True and False being set in globals by this code?
                            
                                Keras Sequential without providing input shape
                            
                                (Keras) ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float)
                            
                                InvalidArgumentError : input depth must be evenly divisible by filter depth: 4 vs 3
                            
                                "[CRITICAL] WORKER TIMEOUT" in logs when running "Hello Cloud Run with Python" from GCP Setup Docs
                            
                                Is there any way to draw INDIA Map in plotly?
                            
                                How does setColumnStretch and setRowStretch works
                            
                                UnboundLocalError: local variable 'batch_outputs' referenced before assignment
                            
                                Greenlet runtime error and deployed app in docker keeps booting all the workers
                            
                                Pandas: Add an empty row after every index in a MultiIndex dataframe
                            
                                RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED using pytorch
                            
                                Getting the name of the active window
                            
                                how to integrate ZSH and (i)python?
                            
                                PyQt clipboard doesn't copy to system clipboard
                            
                                What does keyword CONSTRAINT do in this CREATE TABLE statement
                            
                                Accessing data files before and after distutils/setuptools
                            
                                Does anyone know of a asynchronous mysql lib for python?
                            
                                Django - Allow duplicate usernames
                            
                                python datetime localization

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With