I was playing with the Python interpreter (Python 3.2.3) and tried the following:
>>> dir(1)
This gave me all the attributes and methods of the int object. Next I tried:
>>> 1.__class__
However this threw an exception:
File "<stdin>", line 1
1.__class__
^
SyntaxError: invalid syntax
When I tried out the same with a float I got what I expected:
>>> 2.0.__class__
<class 'float'>
Why do int
and float
literals behave differently?
It's probably a consequence of the parsing algorithm used. A simple mental model is that the tokenizer attempts to match all the token patterns there are, and recognizes the longest match it finds. On a lower-level, the tokenizer works character-by-character, and makes a decision based only on the current state and input character – there shouldn't be any backtracking or re-reading of input.
After joining patterns with common prefixes – in this case, the pattern for int
literals and the integral part of the pattern of float
literals – what happens in the tokenizer is that it:
1
, and enters the state that indicates "reading either a float
or an int
literal"
.
, and enters the state "reading a float
literal"
_
, which can not be part of a float
literal. The parser emits 1.
as a float
literal token._
, and eventually emits __class__
as an identifier token.Aside: This tokenizing approach is also the reason why common languages have the syntax restrictions they have. E.g. identifiers contain letters, digits, and underscores, but cannot start with a digit. If that was allowed,
123abc
could be intended as either an identifier, or the integer123
followed by the identifierabc
.A lex-like tokenizer would recognize this as the former since it leads to the longest single token, but nobody likes having to keep details like this in their head when trying to read code. Or when trying to write and debug the tokenizer for that matter.
The parser then tries to process the token stream:
<FloatLiteral: '1.'> <Identifier: '__class__'>
In Python, a literal directly followed by an identifier – without an operator between the tokens – makes no sense, so the parser bails. This also means that the reason why Python would complain about 123abc
being invalid syntax isn't the tokenizer error "the character a
isn't valid in an integer literal", but the parser error "the identifier abc
cannot directly follow the integer literal 123
"
The reason why the tokenizer can't recognize the 1
as an int
literal is that the character that makes it leave the float
-or-int
state determines what it just read. If it's .
, it was the start of a float
literal, which might continue afterwards. If it's something else, it was a complete int
literal token.
It's not possible for the tokenizer to "go back" and re-read the previous input as something else. In fact, the tokenizer is at too low a level to care about what an "attribute access" is and handle such ambiguities.
Now, your second example is valid because the tokenizer knows a float
literal can only have one .
in it. More precisely: the first .
makes it transition from the float
-or-int
state to the float
state. In this state, it only expects digits (or an E
for scientific/engineering notation, a j
for complex numbers…) to continue the the float
literal. The first character that's not a digit etc. (i.e. the .
) is definitely no longer part of the float
literal and the tokenizer can emit the finished token. The token stream for your second example will thus be:
<FloatLiteral: '1.'> <Operator: '.'> <Identifier: '__class__'>
Which, of course, the parser then recognizes as valid Python. Now we also know enough why the suggested workarounds help. In Python, separating tokens with whitespace is optional – unlike, say, in Lisp. Conversely, whitespace does separate tokens. (That is, no tokens except string
literals may contain whitespace, it's merely skipped between tokens.) So the code:
1 .__class__
is always tokenized as
<IntLiteral: '1'> <Operator: '.'> <Identifier: '__class__'>
And since a closing parenthesis cannot appear in an int
literal, this:
(1).__class__
gets read as this:
<Operator: '('> <IntLiteral: '1'> <Operator: ')'> <Operator: '.'> <Identifier: '__class__'>
The above implies that, amusingly, the following is also valid:
1..__class__ # => <type 'float'>
The decimal part of a float
literal is optional, and the second .
read will make the preceding input be recognized as one.
It is a tokenization issue... the .
is parsed as the beginning of the fractional part of a floating point number.
You can use
(1).__class__
to avoid the problem
Because if there's a .
after a number, python thinks you're creating a float. When it encounters something else that isn't a number, it will throw an error.
However, in a float, python doesn't expect another .
to be a part of the value, hence the result! It works. :)
How do we get the attributes, then?
You can easily wrap it in parentheses. For example, see this console session:
>>> (1).__class__
<type 'int'>
Now, Python knows that you're not trying to make a float, but to refer to the int itself.
Bonus: putting a blank space after the number works as well.
>>> 1 .__class__
<type 'int'>
Also, if you only want to get the __class__
, type(1)
will do it for you.
Hope this helps!
Or you can even do this:
>>> getattr(1 , '__class__')
<type 'int'>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With