Have you ever needed to clone a very large repository multiple times? Ever wondered how to make this process efficient? Or why there aren’t a lot of proxy servers for Git like there are for package managers?
What if I told you that you can download a repository once and reuse it across multiple systems?
The challenge of frequent cloning
Consider this situation. You have a 5 GB repository with extensive Git history and Git LFS assets. Your build infrastructure consists of 10 servers, each running multiple builds per day. With a traditional approach, each server downloads the entire 5 GB for every build. That’s 50 GB of network traffic just to get started, and it happens repeatedly throughout the day.
The math gets worse when you consider:
- Network bandwidth
- Saturating your connection downloading the same data repeatedly
- Build time
- Time spent transferring data for the clone before any actual work begins
- Reliability
- More network operations mean more opportunities for transient failures
- Rate limiting
- Hitting API limits on services like GitHub when many servers clone simultaneously and frequently.
This scenario isn’t limited to build farms. You’ll encounter similar challenges when:
- Trying to quickly set up development environments across multiple machines
- Working with large repositories from offices that are geographically distributed and one or more of those have limited bandwidth
- Running parallel builds or analysis on a batch of high-performance computing (HPC) nodes
The --reference
solution
Git provides the --reference
option to solve this problem elegantly. Instead of fetching all objects from the remote server, Git can borrow objects from a local reference repository. Think of it as creating a shared library of Git objects that you can reference when cloning. These objects can include commits, trees, blobs, and tags.
Here’s the basic approach:
- Create a shared reference repository that is accessible to every system that will need to perform the clone
- Keep this reference repository updated with the latest changes
- When each build server needs to clone, it references the shared repository as part of that process.
Let’s see this in action. First, create and maintain your reference repository:
1# Initial setup: clone a bare repository to use as the reference
2git clone --bare https://github.com/owner/repo.git /shared/repo-reference.git
3
4# Periodically update the reference repository (can be a cron job)
5cd /shared/repo-reference.git
6git fetch --all
When you need a fresh copy, just add a --reference
to the existing repository when you clone::
1git clone --reference /shared/repo-reference.git https://github.com/owner/repo.git build-workspace
This command creates a new clone in the build-workspace
directory, but instead of downloading all the objects from GitHub, Git creates a link to the shared reference repository. Under the hood, Git creates a file called .git/objects/info/alternates
that points to the reference repository’s object store. When Git needs an object, it first checks the new repository’s .git/objects
directory, and if it doesn’t find it there, it looks in the reference repository.
The result? Your clone completes in seconds instead of minutes. The new clone only downloads objects that don’t exist in the reference repository – typically just the latest commits since the reference was last updated.
Handling updates and missing objects
You might be wondering: what happens when someone pushes new commits to the remote repository, but your reference repository hasn’t been updated yet? This is a common scenario – developers are constantly pushing changes. What if your reference repository only updates every hour or once per day?
The good news is that Git handles this gracefully. When you clone with --reference
, Git will use any available objects from the reference repository. It will then fetch any missing objects from the remote server.
Let’s walk through a specific example. Suppose you have a team on the other side of the world that updates a reference repository once per day. Their reference repository was last updated yesterday and contains commits up to abc1234
. Since then, developers have pushed 50 new commits, and the latest commit is def5678
. A new employee has joined and needs to clone the repository. When that developer runs git clone --reference
, Git will check the reference repository and find commits up to abc1234
. The developer will get a reference to those commitments without downloading them again. Next, Git will contact the remote server and discover that def5678
is the current HEAD. It will download the 50 new commits and any associated objects that are missing.
The clone completes successfully, but instead of downloading 5 GB, it only downloads the new commits – perhaps 50 MB. That’s still a massive time and bandwidth savings compared to cloning from scratch. Because SHAs are consistent and immutable, any repository can borrow objects from any other clone that has them. That means you can even use a developer’s local clone as a reference if it’s accessible (or a copy on a USB drive).
All of this means that your reference repository doesn’t need to always be up-to-date. Even a reference that’s a few hours or days old can provide significant benefits.
Keeping the reference current
To maximize efficiency, you’ll want to keep your reference repository reasonably current. The more up-to-date it is, the fewer objects need to be downloaded. There are two common strategies.
Scheduled updates
Use a cron job to fetch updates regularly:
1# Update reference repository every hour
20 * * * * cd /shared/repo-reference.git && git fetch --all
Pre-build updates
Update the reference before starting a batch of builds. When this is being used to improve performance or provide a cache that is shared across multiple servers (like a Jenkins farm or HPC cluster), consider having one process that updates the repository periodically or before the other systems need to be updated. This minimizes the number of remote fetches required to update all of the instances.
Breaking up isn’t hard to do
Using --reference
is great for speeding up the initial clone, but there’s an important consideration: your new repository remains dependent on the reference repository. If you move or delete the original repository, your new clone will break because it can’t find the objects it’s borrowing. In addition, if the reference repository is on a network share or slow drive, you may have to incur the performance cost of network access every time Git needs to read objects. Even a little I/O latency can quickly add up when many objects are needed.
This is where the --dissociate
option comes into play. The --dissociate
flag tells Git to copy all the borrowed objects from the reference repository into the new clone after the clone operation completes. This makes your new repository completely self-contained.
1git clone --reference /shared/repo-reference.git --dissociate https://github.com/owner/repo.git build-workspace
How dissociation works
When you use --dissociate
, Git performs these steps automatically:
- Completes the initial clone using the reference repository
- Runs
git repack -a -d
to create a new pack file containing all objects (both borrowed and new) - Removes the
.git/objects/info/alternates
file to break the link to the reference repository
You can also perform dissociation manually after cloning without the flag by performing these steps manually:
1# Clone with reference
2git clone --reference /path/to/existing/repo https://github.com/owner/repo.git new-repo
3
4# Later, manually dissociate
5cd new-repo
6git repack -a -d
7rm .git/objects/info/alternates
This gives you flexibility to decide later whether you want to maintain the reference relationship or make the repository independent.
When to dissociate your repos
This may seem counterintuitive if the goal is to save time and bandwidth, but there are valid use cases for dissociation. You should consider --dissociate
when:
- Independence
- You need the clone to be a fully independent, self-contained copy of the reference repository. This may be for a developer’s machine or for a build that for security reasons cannot have access that might let it potentially modify the reference repository.
- Distributed systems
- The repository needs to be used on systems that don’t have access to the reference repository or where the concurrent access would lead to read contention from numerous systems trying to access the content simultaneously.
- Network performance
- The reference repository is on a network share, and you want to avoid latency or availability issues.
- Modifications
- You plan to make significant changes to the repository that would alter the reference, including rewriting history or deleting objects.
- Long-running workspaces
- The workspace will exist for an extended period, and you want to avoid the need to continuously fetch data from the reference repository.
At the same time, you might not want to --dissociate
in some scenarios. For example:
- Short-lived builds
- The clone exists only as a read-only copy for the duration of the work before it is deleted
- Stable reference infrastructure
- Your reference repository is reliable and will remain available throughout the process
- Disk space constraints
- You need to minimize disk usage and don’t expect contention
- Maximum performance
- The time for the dissociation step (repacking objects) needs to be avoided
Using references with Git LFS
If your repository uses Git Large File Storage (LFS), the --reference
option works with LFS objects too! When you clone with a reference, Git LFS checks the alternate repository’s .git/lfs/objects
directory and copies or links those objects instead of downloading them from the remote server.
You can see this in action by enabling Git tracing:
1GIT_TRACE=1 git clone --reference /path/to/existing/repo https://github.com/owner/repo.git new-repo
The trace log will show entries like:
1trace git-lfs: altMediafile: /path/to/existing/repo/.git/lfs/objects/b5/bb/b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c
This confirms that Git LFS is using the alternate repository’s LFS objects instead of downloading them from the remote. That said, the process isn’t quite the same as regular Git objects. There are some important nuances to understand.
The challenge with LFS dissociation
Here’s where things get interesting. While --dissociate
works perfectly for regular Git objects, it doesn’t work for Git LFS objects. Why? The issue lies in the timing of Git’s operations.
When you run git clone --reference --dissociate
, Git performs these steps in order:
- Fetches repository objects from the remote
- Sets up the
.git/objects/info/alternates
file - Runs
git repack -a -d
(dissociate step) - Removes the alternates file (dissociate step)
- Checks out the working directory
- The checkout process restores files to the working directory, triggering the Git LFS filter to fetch the objects ( this previous post explains the LFS processes)
Notice the problem? Git LFS doesn’t get involved until step 6 – after the alternates file has already been removed in step 4. By the time Git LFS tries to check for alternate LFS objects, the link to the reference repository is gone. At that point, there is no longer any relationship or ties to the reference repository. As a result, Git LFS downloads all the objects from the remote server. In this case, it only benefits from the references when acquiring the Git objects themselves.
This is a fundamental limitation based on how Git LFS integrates with Git through filter and checkout operations rather than being part of Git’s core fetch operations.
So how do we work around this?
Manual LFS dissociation
The key is to let Git LFS use the reference during the initial checkout, then break the link. You want to clone the repository, then use the manual process mentioned above to separate the repositories. This allows the LFS objects to be handled before the alternates file is removed.
Because of how Git LFS works, this may not fully dissociate the LFS objects. If LFS has local access (on the same volume) to the reference repository during the checkout, it will use hard links to the files rather than copying them. This means that it continues to have a dependency on the reference repository for those objects. If this is a problem, you can remove the LFS folder and manually copy the objects to the file system.
Alternatively, LFS objects are automatically dissociated if the reference repository is using a different file system. This prevents the OS from creating hard links. This includes network shares or different physical drives. When LFS can’t create a hard link, it copies the files instead. This effectively dissociates them.
Getting to know Git
Whether you’re managing a cluster of build servers or development team, understanding how to leverage local references can dramatically reduce latency, network costs, and infrastructure overhead. Used properly (and with appropriate security considerations), reference repositories can make it easier and more practical to work with Git. I encourage you to experiment with this technique in your own processes. You might find this approach transforms how you think about using Git.