Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are hourglass imports and why would they be avoided in a codebase?

I saw some commits in a Python code base removing "hourglass imports." I've never seen this term before and I can't find anything about it via the Python documentation or web search.

What are hourglass imports and when would one use or not use them? My best guess is that removing them makes submodules easier to find, but are there other reasons?

An example change removing hourglass imports from one of the linked commits:

diff --git a/tensorflow/contrib/slim/python/slim/nets/vgg.py b/tensorflow/contrib/slim/python/slim/nets/vgg.py
index 3c29767f2..d4eb43cbb 100644
--- a/tensorflow/contrib/slim/python/slim/nets/vgg.py
+++ b/tensorflow/contrib/slim/python/slim/nets/vgg.py
@@ -37,13 +37,20 @@ Usage:
 @@vgg_16
 @@vgg_19
 """
+
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function

-import tensorflow as tf
-
-slim = tf.contrib.slim
+from tensorflow.contrib import layers
+from tensorflow.contrib.framework.python.ops import arg_scope
+from tensorflow.contrib.layers.python.layers import layers as layers_lib
+from tensorflow.contrib.layers.python.layers import regularizers
+from tensorflow.contrib.layers.python.layers import utils
+from tensorflow.python.ops import array_ops
+from tensorflow.python.ops import init_ops
+from tensorflow.python.ops import nn_ops
+from tensorflow.python.ops import variable_scope


 def vgg_arg_scope(weight_decay=0.0005):

The top level tensorflow __init__.py exports the symbols from the submodules.

# tensorflow/python/__init__.py
...
from tensorflow.python.ops.standard_ops import *
...
# tensorflow/python/ops/standard_ops.py
...
from tensorflow.python.ops.array_ops import *
from tensorflow.python.ops.check_ops import *
from tensorflow.python.ops.clip_ops import *
...
like image 552
cyang Avatar asked Apr 05 '17 04:04

cyang


People also ask

What is import * in Python?

Python code in one module gains access to the code in another module by the process of importing it. The import statement is the most common way of invoking the import machinery, but it is not the only way. Functions such as importlib.

What happens when you import a module Python?

This happens because when Python imports a module, it runs all the code in that module. After running the module it takes whatever variables were defined in that module, and it puts them on the module object, which in our case is salutations . So within our salutations module, we have a greet function: >>> salutations.

How do I know if a Python module is imported?

Use: if "sys" not in dir(): print("sys not imported!")

How do you import all objects from a module into the current namespace?

So __all__ specifies all modules that shall be loaded and imported into the current namespace when we use from <package> import * .


1 Answers

TensorFlow contributor here :wave:. We use the term hourglass import to refer to modules that import a bunch of things from other modules and re-export them. You’ve provided a good example in your question.

The reason that we care about this and the reason that we call it an hourglass both have to do with the shape of the build graph. The whole point of the hourglass module is that lots of users will depend on it as a convenient entry point. And it itself depends on lots of internal symbols. So your dependency graph has a lot of edges going through this one node, funnelled as through the center of an hourglass:

Diagram of a simple build graph with three end-user binaries
depending on :standard_ops, and :standard_ops depending on three
internal targets.

In a real-world context, the hourglass will be both wider and deeper than this, on both sides. End-users may define libraries that depend on :standard_ops and binaries that depend on those libraries, and the internal ops may themselves have layers of dependency.

The problem with this is that it makes it hard to cheaply and correctly re-build in response to changes. If we change part of :check_ops, then it looks like :standard_ops needs to be re-built, because one of its dependencies has changed. And because :standard_ops has been re-built, so too must its dependencies be. But now we’ve re-built all the end-user programs, even if they didn’t even actually use the functionality provided by :check_ops at all. We say that the build graph overapproximates the actual dependency graph. Overapproximation is sound—the builds will still be correct—but it can be wasteful.

This is a problem on large codebases like TensorFlow, where we have many thousands of tests, we run all affected tests when you change any code, and the tests can be expensive. If your estimate of “which tests are affected by this change?” is a vast overapproximation due to an hourglass dependency, you’re wasting a lot of compute power on tests, and your developers also have to wait longer to merge their changes.

The patch in your original question shows how we might remove an hourglass dependency and rewrite the clients to point directly to those parts of the build graph that they actually use:

Diagram of a more precise build graph, with edges from end-user
programs to just those targets that they actually need

This way, if :check_ops is changed, we can see that we only need to re-build and re-test one client.

There are benefits and drawbacks to this. For real end users, having to directly import lots of internals is annoying. That’s not a nice API, not nearly as nice as import numpy as np or import tensorflow as tf. Furthermore, it exposes implementation details, making it harder for us to move around those modules. So, for these reasons, we do still provide an hourglass import to users, both publicly and within Google. However, we try not to use hourglass imports within our own codebase. Breaking changes aren’t an issue within our own repository, since if we want to rename something we can just rename all its clients at the same time. And we have tools for working with our build graphs and are comfortable doing so, which is something that most Python programmers don’t want to have to worry about. The tools are pretty nice, though—in addition to generating nice visual graphs (as above) for your real codebase, they underlie a powerful query engine, where you can ask the system questions like “what targets that transitively depend on :foo are still running on Python 2 and belong to my team?”. This is more powerful when your build graph is more precise.

TL;DR: An hourglass module is one that bundles up imports from many submodules and exposes them to many client modules. We avoid them because it overapproximates the build graph, which makes it more expensive to run tests and harder to analyze the code.

like image 127
wchargin Avatar answered Oct 16 '22 17:10

wchargin