Assuming the following: <pre class="prettyprint"><code>>>> square = '²' # Superscript Two (Unicode U+00B2) >>> cube = '³' # Superscript Three (Unicode U+00B3) </code></pre> Curiously: <pre class="prettyprint"><code>>>> square.isdigit() True >>> cube.isdigit() True </code></pre> OK, let's convert those "digits" to integer: <pre class="prettyprint"><code>>>> int(square) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for int() with base 10: '²' >>> int(cube) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for int() with base 10: '³' </code></pre> Oooops! Could someone please explain what behavior I should expect from the <code>str.isdigit()</code> method when handling strings?

<code>str.isdigit</code> doesn't claim to be related to parsability as an <code>int</code>. It's reporting a simple Unicode property, is it a decimal character or digit of some sort: <blockquote> <code>str.isdigit()</code> Return <code>True</code> if all characters in the string are digits and there is at least one character, <code>False</code> otherwise. Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. This covers digits which cannot be used to form numbers in base 10, like the Kharosthi numbers. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal. </blockquote> In short, <code>str.isdigit</code> is thoroughly useless for detecting valid numbers. The correct solution to checking if a given string is a legal integer is to call <code>int</code> on it, and catch the <code>ValueError</code> if it's not a legal integer. Anything else you do will be (badly) reinventing the same tests the actual parsing code in <code>int()</code> performs, so why not let it do the work in the first place? Side-note: You're using the term "utf-8" incorrectly. UTF-8 is a specific way of encoding Unicode, and only applies to raw binary data. Python's <code>str</code> is an "idealized" Unicode text type; it has no encoding (under the hood, it's stored encoded as one of ASCII, latin-1, UCS-2, UCS-4, and possibly also UTF-8, but none of that is visible at the Python layer outside of indirect measurements like <code>sys.getsizeof</code>, which only hints at the underlying encoding by letting you see how much memory the string consumes). The characters you're talking about are simple Unicode characters above the ASCII range, they're not specifically UTF-8.

str.isdigit() behaviour when handling strings

Tags:

python

python-3.x

Assuming the following:

>>> square = '²'      # Superscript Two (Unicode U+00B2)
>>> cube  = '³'       # Superscript Three (Unicode U+00B3)

Curiously:

>>> square.isdigit()
True
>>> cube.isdigit()
True

OK, let's convert those "digits" to integer:

>>> int(square)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '²'
>>> int(cube)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '³'

Oooops!

Could someone please explain what behavior I should expect from the str.isdigit() method when handling strings?

707

asked Sep 22 '21 00:09

Lacobus

1 Answers

str.isdigit doesn't claim to be related to parsability as an int. It's reporting a simple Unicode property, is it a decimal character or digit of some sort:

str.isdigit()

Return True if all characters in the string are digits and there is at least one character, False otherwise. Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. This covers digits which cannot be used to form numbers in base 10, like the Kharosthi numbers. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal.

In short, str.isdigit is thoroughly useless for detecting valid numbers. The correct solution to checking if a given string is a legal integer is to call int on it, and catch the ValueError if it's not a legal integer. Anything else you do will be (badly) reinventing the same tests the actual parsing code in int() performs, so why not let it do the work in the first place?

Side-note: You're using the term "utf-8" incorrectly. UTF-8 is a specific way of encoding Unicode, and only applies to raw binary data. Python's str is an "idealized" Unicode text type; it has no encoding (under the hood, it's stored encoded as one of ASCII, latin-1, UCS-2, UCS-4, and possibly also UTF-8, but none of that is visible at the Python layer outside of indirect measurements like sys.getsizeof, which only hints at the underlying encoding by letting you see how much memory the string consumes). The characters you're talking about are simple Unicode characters above the ASCII range, they're not specifically UTF-8.

106

answered Oct 11 '22 08:10

ShadowRanger

Related questions
                            
                                convert python datetime with timezone to string
                            
                                SGDClassifier vs LogisticRegression with sgd solver in scikit-learn library
                            
                                Python + Ubuntu Linux + nohup error: [1]+ Exit
                            
                                Why doesn't '%matplotlib inline' work in python script?
                            
                                How can I delay the __init__ call until an attribute is accessed?
                            
                                AttributeError: module 'PyQt5.QtGui' has no attribute 'QWidget'
                            
                                How to get predicted values in Keras?
                            
                                what is meaning of hook that used in tensorflow
                            
                                pipenv and bash aliases
                            
                                Pandas - expand nested json array within column in dataframe
                            
                                Count frequency of item in a list of tuples
                            
                                Python OpenCV video format play in browser
                            
                                Difference between df[x], df[[x]], df['x'] , df[['x']] and df.x
                            
                                Unable to connect to kubernetes python api - .kube/config file not found
                            
                                how to get numeric column names in pandas dataframe
                            
                                Customizing the order of legends in plotly
                            
                                Where does spacy language model download?
                            
                                Python Class "Constants" in Dataclasses
                            
                                Which characters are considered whitespace by split()?
                            
                                get_config missing while loading previously saved model without custom layers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With