Python

Question

I've got a few questions about best practices in Python. Not too long ago I would do something like this with my code:

...
junk_block = "".join(open("foo.txt","rb").read().split())
...

I don't do this anymore because I can see that it makes code harder to read, but would the code run slower if I split the statements up like so:

f_obj = open("foo.txt", "rb")
f_data = f_obj.read()
f_data_list = f_data.split()
junk_block = "".join(f_data_list)

I also noticed that there's nothing keeping you from doing an 'import' within a function block, is there any reason why I should do that?

Alex Martelli · Accepted Answer

As long as you're inside a function (not at module top level), assigning intermediate results to local barenames has an essentially-negligible cost (at module top level, assigning to the "local" barenames implies churning on a dict -- the module's __dict__ -- and is measurably costlier than it would be within a function; the remedy is never to have "substantial" code at module top level... always stash substantial code within a function!-).

Python's general philosophy includes "flat is better than nested" -- and that includes highly "nested" expressions. Looking at your original example...:

junk_block = "".join(open("foo.txt","rb").read().split())

presents another important issues: when is that file getting closed? In CPython today, you need not worry -- reference counting in practice does ensure timely closure. But most other Python implementations (Jython on the JVM, IronPython on .NET, PyPy on all sorts of backends, pynie on Parrot, Unladen Swallow on LLVM if and when it matures per its published roadmap, ...) do not guarantee the use of reference counting -- many garbage collection strategies may be involved, with all sort of other advantages.

Without any guarantee of reference counting (and even in CPython it's always been deemed an implementation artifact, not part of the language semantics!), you might be exhausting resources, by executing such "open but no close" code in a tight loop -- garbage collection is triggered by scarcity of memory, and does not consider other limited resources such as file descriptors. Since 2.6 (and 2.5, with an "import from the future"), Python has a great solution via the RAII ("resource acquisition is initialization") approach supported by the with statement:

with open("foo.txt","rb") as f:
  junk_block = "".join(f.read().split())

is the least-"unnested" way that will ensure timely closure of the file across all compliant versions of Python. The stronger semantics make it preferable.

Beyond ensuring the correct, and prudent;-), semantics, there's not that much to choose between nested and flattened versions of an expression such as this. Given the task "remove all runs of whitespace from the file's contents", I would be tempted to benchmark alternative approaches based on re and on the .translate method of strings (the latter, esp. in Python 2.*, is often the fastest way to delete all characters from a certain set!), before settling on the "split and rejoin" approach if it proves to be faster -- but that's really a rather different issue;-).

Python - Things I shouldn't be doing?

Tags:

Enrico Tuvera Jr

1 Answers

Alex Martelli

Recent Activity

Donate For Us