Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Organizing multiple scala interrelated sbt & git projects - best practice suggestions

With scala, using sbt for builds and git for version control, what would be a good way of organizing your team code when it outgrows being a single project? At some point, you start thinking about separating your code into separate libraries or projects, and importing between them as necessary. How would you organize things for that? or would you avoid the temptation and just manage all packages under the same sbt and git single "project"?

Points of interest being: (feel free to change)

  • Avoiding inventing new "headaches" that over-engineer imaginary needs.
  • Still being able to easily build everything when you still want to, on a given dev machine or a CI server.
  • Packaging for production: being able to use SbtNativePackager to package your stuff for production without too much pain.
  • Easily control which version of each library you use on a given dev machine, and being able to switch between them seamlessly.
  • Avoiding git manipulation becoming worse than it basically typically is.

In addition, would you use some sort of "local sbt/maven team repository" and what may need to be done to accomplish that? hopefully, this is not necessary though.

Thanks!

like image 229
matanster Avatar asked Oct 21 '14 12:10

matanster


2 Answers

I use the following lines in the sand:

  • Code which ultimately goes in different deployables goes in different folders in the same repository, under an umbrella project - what SBT calls a multi-project build (I use maven rather than SBT but the concepts are very similar). It will be built/deployed to different jars.

I try to consider the final deployables when making divisions that make sense. For example, if my system foosys has foosys-frontend and foosys-backend deployables, where foosys-frontend does HTML templating and foosys-backend talks to the database and the two communicate via a REST API, then I'll have those as separate projects, and a foosys-core project for common code. foosys-core isn't allowed to depend on the html templating library (because foosys-backend doesn't want that), nor on the ORM library (because foosys-frontend doesn't want that). But I don't worry about separating the code that works with the REST library from the "core domain objects", because both foosys-frontend and foosys-backend use the REST code.

Now supose I add a new foosys-reports deployable, which accesses the database to do some reports. Then I'll probably create a foosys-database project, depending on foosys-core, to hold shared code used by both foosys-backend and foosys-reports. And since foosys-reports doesn't use the REST library, I should probably also split out foosys-rest from foosys-core. So I end up with a foosys-core library, two more library projects that depend on it (foosys-database and foosys-rest), and the three deployable projects (foosys-reports depending on foosys-database, foosys-frontend depending on foosys-rest, and foosys-backend depending on both).

You'll notice that this means there's one code project for every combination of deployables where that code might be used. Code that goes in all three deployables goes in foosys-core. Code that goes in just one deployable goes in that deployable's project. Code that goes in two of the three deployables goes in foosys-rest or foosys-database. If we wanted to have some code that was part of the foosys-frontend and foosys-reports deployables, but not the foosys-backend deployable, we'd have to create another project for that code. In theory this means an exponential blowup in the number of projects as we add more deployables. In practice I've found it's not too problematic - most theoretically possible combinations don't actually make sense, so as long as we only create new projects when we actually have code to put in them it's ok. And if we end up with a couple of classes in foosys-core that aren't actually used in every single deployable, it's not the end of the world.

Tests are best understood in this view as another kind of deployable. So I would have a separate foosys-test project containing common code that was used for tests for all three deployable projects (depending on foosys-core), and perhaps a foosys-database-test project (depending on foosys-test and foosys-database) for test helper code (e.g. database integration test setup code) that was common between foosys-backend and foosys-reports. Ultimately we might end up with a full parallel hierarchy of -test projects.

  • Only move projects into separate git repositories (and, at the same time, separate overall builds) once they have different release lifecycles.

Code in different repositories is necessarily versioned independently, so in some sense this is a vacuous definition. But I think you should move on to separate git repositories only when you have to (analogy with this post: you should only use Hadoop when your data is too big to use anything friendlier). Once your code is in multiple git repositories, you have to manually update the dependencies between them (on a dev machine you can use -SNAPSHOT dependencies and IDE support to work as though the versions were still in sync, but you have to manually update this every time you resync with master, so it adds friction to development). Since you're doing releases and updating the dependency asynchronously, you have to adopt and enforce something like semantic versioning, so that people know when it's safe to update the dependency on foocorp-utils and when it isn't. You have to publish changelogs, and have an early-warning CI build, and a more thorough code review process. All this is because the feedback cycle is a lot longer; if you break something in a downstream project, you won't know about this until they update their dependency on foocorp-utils, months or even years later (yes, years - I have witnessed this, and in an 80-person startup, not a megacorp). So you need process to prevent that, and everything becomes correspondingly less agile.

Valid reasons to do this include:

  • A full build of your project is taking too long, slowing down integration on the code you're working on - though try to speed it up first.
  • Deploying all your deployables is taking too long - though again, try to automate this and speed it up. There's a real advantage from keeping everything in sync, you don't want to give it up until you absolutely have to.
  • Separate teams need to work on the code. If you're not in constant communication with each other then you'll need the process overhead (semantic versioning etc.) anyway, so you may as well get the faster build times. (To be clear, I think every git repository should have a single team that owns and is responsible for it, and when teams split they should split repositories. I have further thoughts on release processes and responsibilities, but this answer is already pretty long).

I would use a team maven repository, probably Nexus. Actually I'd recommend this even before you get to the multi-project stage. It's very easy to run (just a Java app), and you can proxy your external dependencies through it, meaning you have a reliable source for your dependency jars and your builds will be reproducible even if one of your upstream dependencies disappears.

I intend to write up my ways of team working as a blog post, but in the meantime I'm happy to answer any further questions.

like image 167
lmm Avatar answered Nov 15 '22 15:11

lmm


I'm a little late here, but my 2 cents.

Most scala projects and/or any projects I've worked in my past jobs have ended up with a very similar structure. Usually with consensus with other team members (which helps to validate the decision). The only main philosophical difference has been to either separate projects on technical infrastructure layers or by business modules. Examples below:

Common Projects

  • App.Utils : Shared utility code used by all other projects ( minimial to 0 dependencies )
  • App.Core : Shared business code (models, core helpers, interfaces, types)

Option 1: Module separation

  • App.Inventory: The inventory module with services, database code, helpers
  • App.Orders : The order management module with services, database, helpers

This can be very convenient and easy to manage by business area and you can then deploy single modules as needed. You can also later decide to separate out the modules into separate APIs if needed ( with a shared code base still in utils, and core ). The disadvantage here is that the approach can make the number of projects swell.

Option 2: Tech layer separation

  • App.Database: Database access functions
  • App.Services : Core implementations of business services

In this approach all the logic / services for all areas are in the services project and likewise for the database. So the code for say the inventory is split between in the database and services projects. This allows separating by traditional technical tiers. This can be much faster for smaller projects.

Personally, I prefer the more modular separation in option 1. Its more scalable and generally feels simpler when making code changes.

-K

like image 40
Kishore Reddy Avatar answered Nov 15 '22 17:11

Kishore Reddy