Why You Should (Not) Prefer Monorepos For Git

Category:

Tags:

#DevOps

#GitHub

#Git

Published: June 23, 2023 Reading Time: 4 min

It’s a question I hear frequently, and it usually starts with someone questioning best practices for managing and maintaining a monorepo.

What are the best practices for adopting a monorepo approach to source code management?

They mention a decision to try to move in that direction, and they’d like to understand how I might recommend they proceed. In most cases, I recommend they don’t.

I understand the many reasons why teams consider a mono repo. In many cases, it simplifies complex dependency management for their code. In other cases, it makes it easier to ensure that related code is updated and managed together. Testing and pull requests can be coordinated very easily.

Invariably, however, the challenges will always appear. Developers will complain about being overwhelmed with notifications. Merge queues will become longer, delaying releases. Minor changes to production will require larger, slower deployments. Managing CODEOWNERS and pull requests start to get more complicated, with approvals blocking entire releases.

So how do large companies like Google make this work when so many others fail?

But Google and Facebook do it…

Frequently, I hear the objection – “but Google does it! If it works at their scale, it can work for me!” Unfortunately, this often misses the deeper details of the implementation. In 2016, Google published the details of their approach in Why Google Stores Billions of Lines of Code in a Single Repository (ACM, 2016). In short, their solution predates Git. It represents an evolution from a legacy centralized version control system. To support the number of files, they had to build custom file system drivers and other OS-level tooling. The core idea was to simplify dealing with complex C-based dependencies by managing them within a single repository. In the end, they built a customized source management system which requires dedicated teams and a budget for ongoing maintenance. Today, it would take millions of dollars of time and effort to migrate.

I don’t think this is what companies have in mind when they say they want to develop like Google.

Since then, Google has learned. All new work is done in Git, and it generally follows the single responsibility principle. Projects are split into multiple repositories, with various solutions utilized to bring together a version of the dependencies. For example, Android has components across more than 800 repositories. Kubernetes moved to GitHub from a mono repo, diversifying into nearly 48,000 related repositories. If you’re interested, you can read a discussion that started in 2016 to understand the problems that led to this decision. In 2023, they presented a discussion of the path away from the monorepo, Mission accomplished! Kubernetes is not a monorepo. They detailed the challenges of the change, but also some of the improvements it made possible. They still have a very large repo, but they continue to work at breaking it apart (after YEARS of effort).

Facebook has a similar journey. They started by migrating to Mercurial, then customizing that to meet their needs. Ultimately, they created a custom solution and a virtual file system. Components of this were released in 2022 as Sapling (see their blog announcement). Again, supporting a monorepo at scale required a heavily customized version control system.

Most companies don’t really want to be like Google and Facebook. More importantly, most lack the budget required to support that decision.

What does Linus say about Git?

Linus Torvalds was asked about the ability of Git to support very large repositories. If you can trust his opinion, he argued that:

Git fundamnetally (sic) never really looks at less than the whole repo. Even if you limit things a bit (ie check out just a portion, or have the history go back just a bit), git ends up still always caring about the whole thing, and carrying the knowledge around.
So git scales really badly if you force it to look at everything as one huge repository. I don’t think that part is really fixable, although we can probably improve on it.

As further evidence, Linux itself doesn’t truly use a monorepo (as explained in this article). Linus is very particular about what is included in the core repository. Everything else must be developed and managed externally.

All that glitters

Teams often think that a monorepo is an easy answer to their current problems. They often cite major companies that use monorepos as an inspiration for their decition. Taken without the full understanding of the context, they mistake it for a paved path.

While there can be use cases where a monorepo makes sense, it often comes with additional overhead and considerations. Unless the full set of challenges and limitations are considered, the path often leads to expensive fixes, future migrations (if things go well!), and reduced development velocity.

In short, just because it glitters doesn’t mean it’s gold. A monorepo is not an instant solution for all problems. In truth, it should be a tool of last resort (and only with enough understanding of the tradeoffs). I’ll leave those discussions (and alternatives to adopting monorepos) for another time. 😄