Continuing this series on GitHub custom images, let’s tackle a common performance challenge: cloning large repositories. If you’ve ever waited several minutes for a massive monorepo to clone at the start of every workflow run, you know how frustrating this can be. In the last few posts, you learned some tricks you can use with image creation. Today, you’ll put that knowledge to work as you cache repositories directly on your custom images.
Why cache repositories?
Every time a workflow runs, the actions/checkout action clones your repository from scratch. For small repositories, this takes just a few seconds. But for large repositories – especially monorepos with years of history – this can take several minutes. When you’re running dozens or hundreds of workflows per day, that time can add up quickly.
The solution is to pre-cache the repository on your custom image. When the workflow runs, Git uses the cached repository as a reference, downloading only the objects that have changed since the image was created. This reduces clone times from minutes to seconds.
How reference clones work
Git has a built-in feature called reference clones (or “alternate object directories”) that makes this possible. When you clone with the --reference flag, Git looks for objects in the reference repository before downloading them from the remote. If the object already exists locally, Git skips the download entirely. The basic syntax is git clone --reference /path/to/cached-repo https://github.com/owner/repo.git.
Using references means you only download the commits, trees, and blobs that were created after your cached copy was made. For example, you have a repository that’s 10 GB. The latest commits added 50 MB. Cloning with a reference saves you from downloading all of the data. Instead, you only download the 50 MB of new data, making the clone operation much faster.
Security considerations
Before you implement this pattern, there’s an important security consideration: anyone who can use a runner with your custom image will have read access to the cached repository. If you have strict access controls on the repository, caching it on a shared image may not be appropriate. Consider whether all potential users of the image should have access before considering a cache.
Setting up the GitHub App
To clone a private repository during image creation, you’ll need authentication. The best approach is to use a GitHub App, which provides fine-grained permissions and short-lived tokens. Here’s how to set it up:
Create a GitHub App in your organization with
contents: readpermission for repositories. The app doesn’t need any other permissions. If you’re not sure how to do this, my colleague Josh Johanning has a great guide you can use.Install the GitHub App in your organization and grant it access to the specific repository you want to cache.
Store the app credentials as secrets in the repository where you’ll build your custom image:
APP_ID: The GitHub App’s IDAPP_PRIVATE_KEY: The GitHub App’s private key
Creating the custom image
Here’s a complete workflow that caches a repository on a custom image:
1name: Build Custom Image with Cached Repository
2on:
3 workflow_dispatch:
4 schedule:
5 # Rebuild weekly to keep the cache fresh
6 - cron: '0 0 * * 0'
7
8jobs:
9 build-image:
10 runs-on: larger-runner-demo
11 snapshot:
12 image-name: my-cached-image
13 version: ${{ github.run_number }}
14 permissions:
15 contents: read
16 steps:
17 - name: Generate GitHub App token
18 id: app-token
19 uses: actions/create-github-app-token@v2.2.1
20 with:
21 app-id: ${{ secrets.APP_ID }}
22 private-key: ${{ secrets.APP_PRIVATE_KEY }}
23 owner: ${{ github.repository_owner }}
24 repositories: |
25 large-repo
26
27 - name: Create cache directory
28 run: mkdir -p /opt/cached-repos
29
30 - name: Clone repository to cache
31 env:
32 REPO_TOKEN: ${{ steps.app-token.outputs.token }}
33 run: |
34 git clone --mirror \
35 "https://x-access-token:${REPO_TOKEN}@github.com/your-org/large-repo.git" \
36 /opt/cached-repos/large-repo
37
38 # Remove the remote to avoid storing a token on disk
39 git -C /opt/cached-repos/large-repo remote remove origin
40
41 - name: Configure environment variable
42 run: |
43 echo "CACHED_LARGE_REPO=/opt/cached-repos/large-repo" | \
44 sudo tee -a /etc/environmentA few things to note about this workflow:
- Using
--mirrorcreates a repository that includes all refs. This ensures the cache contains all branches and tags. This option is not available inactions/checkout(which defaults to a shallow copy of single branch). If you don’t need a complete mirror, you can use the Action instead. Just make sure to setpersist-credentials: falseand to explicitly set apathfor the checkout. - Removing the remote after cloning prevents the token from being stored in the repository’s config file. The cached repository is only used as a local reference, so it doesn’t need a remote.
- Adding the cache location to
/etc/environmentmakes it available to the workflow without hardcoding paths. - Running the workflow on a schedule keeps the cache reasonably fresh, reducing the delta that needs to be downloaded.
Using the cached repository in workflows
Once your image is ready, you can use the cached repository in your workflows. Here’s how to clone using the reference:
1jobs:
2 build:
3 runs-on: my-cached-image
4 steps:
5
6 - name: Clone with reference
7 run: |
8 git clone --reference "$CACHED_LARGE_REPO" \
9 "https://x-access-token:${GITHUB_TOKEN}@github.com/your-org/large-repo.git"
10 cd large-repo
11 git config --global --add safe.directory "$(pwd)"
12 git checkout ${GITHUB_SHA}The clone operation will be significantly faster because Git only downloads objects that don’t exist in the cached reference repository. Why am I not using actions/checkout? Currently, it doesn’t support the --reference option, so a manual clone is necessary. Normally, the action also handles setting the safe directory setting, so you’ll want to add that as well to ensure the current folder is trusted.
After that, you can proceed with checking out the appropriate code. If you’re planning to use a specific branch, then you can optionally add --branch <branch-name> to the clone command instead.
Storage considerations
When caching repositories on custom images, keep storage limits in mind. GitHub larger runners provide storage based on the machine size, not a separate storage quota. If your cached repository is very large, you’ll need a larger size of runner. For example, if you need 500 GB for your cached repository, you will require at least a 16-core runner (which provides 600 GB storage).
Remember that this storage is shared between:
- The operating system and pre-installed tools
- Your cached repository
- Any other cached dependencies (npm packages, Maven artifacts, etc.)
- Working directory space for the actual job
Always plan accordingly and leave buffer space for job execution.
Summary
Caching repositories on custom images is a powerful technique for speeding up workflows that work with large repositories. By using Git’s reference clone feature, you can reduce clone times from minutes to seconds, improving developer productivity and reducing CI/CD costs.
