Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does NPM's policy of duplicated dependencies work?

Tags:

node.js

npm

By default, when I use NPM to manage a package depending on foo and bar, both of which depend on corelib, by default, NPM will install corelib twice (once for foo, and once for bar). They might even be different versions.

Now, let's suppose that corelib defined some data structure (e.g. a URL object) which is passed between foo, bar and the main application. Now, what I would expect, is if there was ever a backwards incompatible change to this object (e.g. one of the field names changed), and foo depended on corelib-1.0 and bar depended on corelib-2.0, I'd be a very sad panda: bar's version of corelib-2.0 might see a data structure created by the old version of corelib-1.0 and things would not work very well.

I was really surprised to discover that this situation basically never happens (I trawled Google, Stack Overflow, etc, looking for examples of people whose applications had stopped working, but who could have fixed it by running dedupe.) So my question is, why is this the case? Is it because node.js libraries never define data structures that are shared outside of the programmers? Is it because node.js developers never break backwards compatibility of their data structures? I'd really like to know!

like image 962
Edward Z. Yang Avatar asked Aug 12 '14 15:08

Edward Z. Yang


People also ask

What happens when you run npm install twice?

Running npm install twice creates wrong package-lock.

How does npm dedupe work?

deduped is short for "deduplicated" (duplicates were removed). The documentation for npm dedupe explains how npm does this: Searches the local package tree and attempts to simplify the overall structure by moving dependencies further up the tree, where they can be more effectively shared by multiple dependent packages.

How do I remove extra dependencies from package JSON?

You can use npm-prune to remove extraneous packages. Extraneous packages are packages that are not listed on the parent package's dependencies list. If the --production flag is specified or the NODE_ENV environment variable is set to production, this command will remove the packages specified in your devDependencies.


Video Answer


3 Answers

this situation basically never happens

Yes, my experience is indeed that that is not a problem in the Node/JS ecosystem. And I think it is, in part, thanks to the robustness principle.

Below is my view on why and how.

Primitives, the early days

I think the first and foremost reason is that the language provides a common basis for primitive types (Number, String, Bool, Null, Undefined) and some basic compound types (Object, Array, RegExp, etc...).

So if I receive a String from one of the libs' APIs I use, and pass it to another, it cannot go wrong because there is just a single String type.

This is what used to happen, and still happens to some extent to this day: Library authors try to rely on the built-ins as much as possible and only diverge when there is sufficient reason to, and with sufficient care and thought.

Not so in Haskell. Before I started using stack, I've run into the following situation quite a few times with Text and ByteString:

Couldn't match type ‘T.Text’                with ‘Text’ NB: ‘T.Text’       is defined in ‘Data.Text.Internal’ in package ‘text-1.2.2.1’     ‘Text’ is defined in ‘Data.Text.Internal’ in package ‘text-1.2.2.0’ Expected type: String -> Text   Actual type: String -> T.Text 

This is quite frustrating, because in the above example only the patch version is different. The two data types may only be different nominally, and the ADT definition and the underlying memory representation may be completely identical.

As an example, it could have been a minor bugfix to the intersperse function that warranted the release of 1.2.2.1. Which is completely irrelevant to me if all I care about, in this hypothetical example, is concatenating some Texts and comparing their lengths.

Compound types, objects

Sometimes there is sufficient reason to diverge in JS from the built in data types: Take Promises as an example. It's such a useful abstraction over async computations compared to callbacks that many APIs started using them. What now? How come we don't run into many incompatibilities when different versions of these {then(), fail(), ...} objects are being passed up, down and around the dependency tree?

I think it's thanks to the robustness principle.

Be conservative in what you send, be liberal in what you accept.

So if I am authoring a JS library which I know returns promises and takes promises as part of its API, I'll be very careful how I interact with the received objects. E.g. I won't be calling fancy .success(), .finally(), ['catch']() methods on it, since I want to be as compatible as possible with different users, with different implementations of Promises. So, very conservatively, I may just use .then(done, fail), and nothing more. At this point, it doesn't matter if the user uses the promises that my lib returns, or Bluebirds' or even if they hand-write their own, so long as those adhere to the most basic Promise 'laws' -- the most basic API contracts.

Can this still lead to breakage at runtime? Yes, it can. If even the most basic API contract is not fulfilled, you may get an exception saying "Uncaught TypeError: promise.then is not a function". I think the trick here is that library authors are explicit about what their API needs: e.g. a .then method on the supplied object. And then its up to whoever is building on top of that API to make it damn sure that that method is available on the object they pass in.

I'd like to also point out here that this is also the case for Haskell, isn't it? Should I be so foolish as to write an instance for a typeclass that still type-checks without following its laws, I'll get runtime errors, won't I?

Where do we go from here?

Having thought through all this just now, I think we might be able to have the benefits of the robustness principle even in Haskell with much less (or even no(?)) risk for runtime exceptions/errors compared to JavaScript: We just need the typesystem be granular enough so it can distinguish what we want to do with the data we manipulate, and determine if that is still safe or not. E.g. The hypothetical Text example above, I would wager is still safe. And the compiler should only complain if I'm trying to use intersperse, and asks me to qualify it. E.g. with T.intersperse so it can be sure which one I want to use.

How do we do this in practice? Do we need extra support, e.g. language extension flags from GHC? We might not.

Just recently I found bookkeeper, which is a compile-time type-checked anonymous records implementation.

Please note: The following is conjecture on my part, I haven't taken much time to try and experiment with Bookkeeper. But I intend to in my Haskell projects to see if what I write about below could really be achieved with an approach such as this.

With Bookkeeper I could define an API like so:

emptyBook & #then =: id & #fail =: const   :: Bookkeeper.Internal.Book'        '["fail" 'Data.Type.Map.:-> (a -> b -> a),          "then" 'Data.Type.Map.:-> (a1 -> a1)] 

Since functions are also first-class values. And whichever API takes this Book as an argument can be very specific what it demands from it: Namely the #then function, and that it has to match a certain type signature. And it cares not for any other function that may or may not be present with whatever signature. All this checked at compile time.

Prelude Bookkeeper > let f o = (o ?: #foo) "a" "b" in f $ emptyBook & #foo =: (++) "ab" 

Conclusion

Maybe Bookkeeper or something similar will turn out to be useful in my experiments. Maybe Backpack will rush to the rescue with its common interface definitions. Or some other solution comes along. But either way, I hope we can move towards being able to take advantage of the robustness principle. And that Haskell's dependency management can also "just work" most of the time and fail with type errors only when it is truly warranted.

Does the above make sense? Anything unclear? Does it answer your question? I'd be curious to hear.


Further possibly relevant discussion may be found in this /r/haskell reddit thread, where this topic came up just not long ago, and I thought to post this answer to both places.

like image 83
Wizek Avatar answered Sep 22 '22 05:09

Wizek


If I understand well, the supposed problem might be :

  • Module A

    exports = require("c") //v0.1 
  • Module B

    console.log(require("a")) console.log(require("c")) //v0.2 
  • Module C

    • V0.1

      exports = "hello"; 
    • V0.2

      exports = "world"; 

By copying C_0.2 in node_modules and C0.1 in node_modules/a/node_modules and creating dummy packages.json, I think I created the case you're talking about.

will B have 2 different conflicting versions of C_data ?

Short answer :

it does. So node does not handle conflicting versions.

The reason you don't see it on the internet is as gustavohenke explained that node naturally does not encourage you to pollute the global scope or chain pass structures between modules.

In other words, it's not often that you'll see a module export another module's structure.

like image 45
sdumetz Avatar answered Sep 21 '22 05:09

sdumetz


I don't have first-hand experience with this kind of situation in a large JS program, but I would guess that it has to do with the OO style of bundling data together with the functions that act on that data into a single object. Effectively the "ABI" of an object is to pull public methods by name out of a dictionary, and then invoke them by passing the object as the first argument. (Or perhaps the dictionary contains closures that are already partially applied to the object itself; it doesn't really matter.)

In Haskell we do encapsulation at a module level. For example, take a module that defines a type T and a bunch of functions, and exports the type constructor T (but not its definition) and some of the functions. The normal way to use such a module (and the only way that the type system will permit) is to use one exported function create to create a value of type T, and another exported function consume to consume the value of type T: consume (create a b c) x y z.

If I had two different versions of the module with different definitions of T and I was able to use the create from version 1 together with the consume from version 2 then I'd likely get a crash or wrong answer. Note that this is possible even if the public API and externally observable behavior of the two versions is identical; perhaps version 2 has a different representation of T that allows for a more efficient implementation of consume. Of course, GHC's type system stops you from doing this, but there are no such safeguards in a dynamic language.

You can translate this style of programming directly into a language like JavaScript or Python:

import M
result = M.consume(M.create(a, b, c), x, y, z)

and it would have exactly the same kind of problem that you are talking about.

However, it's far more common to use the OO style:

import M
result = M.create(a, b, c).consume(x, y, z)

Note that only create is imported from the module. consume is in a sense imported from the object we got back from create. In your foo/bar/corelib example, let's say that foo (which depends on corelib-1.0) calls create and passes the result to bar (which depends on corelib-2.0) which will call consume on it. Actually, while foo needs a dependency on corelib to call create, bar does not need a dependency on corelib to call consume at all. It's only using the base language notions to invoke consume (what we could spell getattr in Python). In this situation, bar will end up invoking the version of consume from corelib-1.0 regardless of what version of corelib bar "depends on".

Of course for this to work the public API of corelib must not have changed too much between corelib-1.0 and corelib-2.0. If bar wants to use a method fancyconsume which is new in corelib-2.0 then it won't be present on an object created by corelib-1.0. Still, this situation is much better than we had in original Haskell version, where even changes that do not affect the public API at all can cause breakage. And perhaps bar depends on corelib-2.0 features for the objects it creates and consumes itself, but only uses the API of corelib-1.0 to consume objects it receives externally.

To achieve something similar in Haskell, you could use this translation. Rather than directly using the underlying implementation

data TImpl = TImpl ...     -- private
create_ :: A -> B -> C -> TImpl
consume_ :: TImpl -> X -> Y -> Z -> R
...

we wrap up the consumer interface with an existential in an API package corelib-api:

module TInterface where

data T = forall a. T { impl :: a,
                       _consume :: a -> X -> Y -> Z -> R,
                       ... }   -- Or use a type class if preferred.
consume :: T -> X -> Y -> Z -> R
consume t = (_consume t) (impl t)

and then the implementation in a separate package corelib:

module T where

import TInterface

data TImpl = TImpl ...     -- private
create_ :: A -> B -> C -> TImpl
consume_ :: TImpl -> X -> Y -> Z -> R
...

create :: A -> B -> C -> T
create a b c = T { impl = create_ a b c,
                   _consume = consume_ }

Now foo uses corelib-1.0 to call create, but bar only needs corelib-api to call consume. The type T lives in corelib-api, so if the public API version does not change, then foo and bar can interoperate even if bar is linked against a different version of corelib.

(I know Backpack has a lot to say about this kind of thing; I'm offering this translation as a way to explain what is happening in the OO programs, not as a style one should seriously adopt.)

like image 22
Reid Barton Avatar answered Sep 20 '22 05:09

Reid Barton