Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting the attributes of integers and floats [duplicate]

Tags:

python

I was playing with the Python interpreter (Python 3.2.3) and tried the following:

>>> dir(1)

This gave me all the attributes and methods of the int object. Next I tried:

>>> 1.__class__

However this threw an exception:

File "<stdin>", line 1
1.__class__
          ^
SyntaxError: invalid syntax

When I tried out the same with a float I got what I expected:

>>> 2.0.__class__
<class 'float'> 

Why do int and float literals behave differently?

like image 696
ajay Avatar asked Nov 28 '13 15:11

ajay


4 Answers

It's probably a consequence of the parsing algorithm used. A simple mental model is that the tokenizer attempts to match all the token patterns there are, and recognizes the longest match it finds. On a lower-level, the tokenizer works character-by-character, and makes a decision based only on the current state and input character – there shouldn't be any backtracking or re-reading of input.

After joining patterns with common prefixes – in this case, the pattern for int literals and the integral part of the pattern of float literals – what happens in the tokenizer is that it:

  1. Reads the 1, and enters the state that indicates "reading either a float or an int literal"
  2. Reads the ., and enters the state "reading a float literal"
  3. Reads the _, which can not be part of a float literal. The parser emits 1. as a float literal token.
  4. Carries on parsing starting with the _, and eventually emits __class__ as an identifier token.

Aside: This tokenizing approach is also the reason why common languages have the syntax restrictions they have. E.g. identifiers contain letters, digits, and underscores, but cannot start with a digit. If that was allowed, 123abc could be intended as either an identifier, or the integer 123 followed by the identifier abc.

A lex-like tokenizer would recognize this as the former since it leads to the longest single token, but nobody likes having to keep details like this in their head when trying to read code. Or when trying to write and debug the tokenizer for that matter.


The parser then tries to process the token stream:

<FloatLiteral: '1.'> <Identifier: '__class__'>

In Python, a literal directly followed by an identifier – without an operator between the tokens – makes no sense, so the parser bails. This also means that the reason why Python would complain about 123abc being invalid syntax isn't the tokenizer error "the character a isn't valid in an integer literal", but the parser error "the identifier abc cannot directly follow the integer literal 123"


The reason why the tokenizer can't recognize the 1 as an int literal is that the character that makes it leave the float-or-int state determines what it just read. If it's ., it was the start of a float literal, which might continue afterwards. If it's something else, it was a complete int literal token.

It's not possible for the tokenizer to "go back" and re-read the previous input as something else. In fact, the tokenizer is at too low a level to care about what an "attribute access" is and handle such ambiguities.


Now, your second example is valid because the tokenizer knows a float literal can only have one . in it. More precisely: the first . makes it transition from the float-or-int state to the float state. In this state, it only expects digits (or an E for scientific/engineering notation, a j for complex numbers…) to continue the the float literal. The first character that's not a digit etc. (i.e. the .) is definitely no longer part of the float literal and the tokenizer can emit the finished token. The token stream for your second example will thus be:

<FloatLiteral: '1.'> <Operator: '.'> <Identifier: '__class__'>

Which, of course, the parser then recognizes as valid Python. Now we also know enough why the suggested workarounds help. In Python, separating tokens with whitespace is optional – unlike, say, in Lisp. Conversely, whitespace does separate tokens. (That is, no tokens except string literals may contain whitespace, it's merely skipped between tokens.) So the code:

1 .__class__

is always tokenized as

<IntLiteral: '1'> <Operator: '.'> <Identifier: '__class__'>

And since a closing parenthesis cannot appear in an int literal, this:

(1).__class__

gets read as this:

<Operator: '('> <IntLiteral: '1'> <Operator: ')'> <Operator: '.'> <Identifier: '__class__'>

The above implies that, amusingly, the following is also valid:

1..__class__ # => <type 'float'>

The decimal part of a float literal is optional, and the second . read will make the preceding input be recognized as one.

like image 147
millimoose Avatar answered Oct 22 '22 14:10

millimoose


It is a tokenization issue... the . is parsed as the beginning of the fractional part of a floating point number.

You can use

(1).__class__

to avoid the problem

like image 23
6502 Avatar answered Oct 22 '22 14:10

6502


Because if there's a . after a number, python thinks you're creating a float. When it encounters something else that isn't a number, it will throw an error.

However, in a float, python doesn't expect another . to be a part of the value, hence the result! It works. :)

How do we get the attributes, then?

You can easily wrap it in parentheses. For example, see this console session:

>>> (1).__class__
<type 'int'>

Now, Python knows that you're not trying to make a float, but to refer to the int itself.

Bonus: putting a blank space after the number works as well.

>>> 1 .__class__
<type 'int'>

Also, if you only want to get the __class__, type(1) will do it for you.

Hope this helps!

like image 10
aIKid Avatar answered Oct 22 '22 16:10

aIKid


Or you can even do this:

>>> getattr(1 , '__class__')
<type 'int'>
like image 4
Alexander Zhukov Avatar answered Oct 22 '22 14:10

Alexander Zhukov