Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

str.isdigit() behaviour when handling strings

Assuming the following:

>>> square = '²'      # Superscript Two (Unicode U+00B2)
>>> cube  = '³'       # Superscript Three (Unicode U+00B3)

Curiously:

>>> square.isdigit()
True
>>> cube.isdigit()
True

OK, let's convert those "digits" to integer:

>>> int(square)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '²'
>>> int(cube)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '³'

Oooops!

Could someone please explain what behavior I should expect from the str.isdigit() method when handling strings?

like image 707
Lacobus Avatar asked Sep 22 '21 00:09

Lacobus


People also ask

Does Isdigit work for strings?

isdigit() only returns true for strings (here consisting of just one character each) contains only digits. Because only digits are passed through, int() always works, it is never called on a letter.

What is the type of value returned by string method Isdigit ()?

Python String isdigit() Method. Returns: True – If all characters in the string are digits.

What does the Isdigit method do in Python?

The isdigit() method returns True if all the characters are digits, otherwise False.

What does Isdigit check for?

The isdigit() method returns True if all characters in a string are digits or Unicode char of a digit. If not, it returns False.


1 Answers

str.isdigit doesn't claim to be related to parsability as an int. It's reporting a simple Unicode property, is it a decimal character or digit of some sort:

str.isdigit()

Return True if all characters in the string are digits and there is at least one character, False otherwise. Digits include decimal characters and digits that need special handling, such as the compatibility superscript digits. This covers digits which cannot be used to form numbers in base 10, like the Kharosthi numbers. Formally, a digit is a character that has the property value Numeric_Type=Digit or Numeric_Type=Decimal.

In short, str.isdigit is thoroughly useless for detecting valid numbers. The correct solution to checking if a given string is a legal integer is to call int on it, and catch the ValueError if it's not a legal integer. Anything else you do will be (badly) reinventing the same tests the actual parsing code in int() performs, so why not let it do the work in the first place?

Side-note: You're using the term "utf-8" incorrectly. UTF-8 is a specific way of encoding Unicode, and only applies to raw binary data. Python's str is an "idealized" Unicode text type; it has no encoding (under the hood, it's stored encoded as one of ASCII, latin-1, UCS-2, UCS-4, and possibly also UTF-8, but none of that is visible at the Python layer outside of indirect measurements like sys.getsizeof, which only hints at the underlying encoding by letting you see how much memory the string consumes). The characters you're talking about are simple Unicode characters above the ASCII range, they're not specifically UTF-8.

like image 106
ShadowRanger Avatar answered Oct 11 '22 08:10

ShadowRanger