Working with large Git repositories

Git is a fantastic tool for distributed development, though it does sometimes cause annoyances if repositories get very large. While its possible to split larger repositories into smaller ones and link them using sub-repositories, this comes with its own set of challenges, which will eventually lead most developers wo use sub-repositories back to the so called monorepo - one repository for the whole project.

A monorepo has one big drawback though: Since it contains a whole project, likely includes dependencies and static assets, as well as the full project history, it tends to grow very large. Git itself does not have a problem with this, but once we need to transfer this data over the network, things will invariably slow down. The initial clone may take ages, pulls from the remote repository will be slower and often contain unneeded data, e.g. new UI assets that are not relevant for the backend engineer.

For the developer working locally, this is not an insurmountable problem: Subsequent commits and pushes only transmit the changed data, so they will be as fast as always, and disk space is often plenty on developer machines.

The big problem lies with build and CI pipelines: These will often run on ephemeral virtual machines in the cloud that will clone the whole repository, run a test or build script, archive the result and get destroyed again. Now, large parts of the (billable!) lifetime of the VM is spent cloning the monorepo, and tests suddenly become slow and expensive.

Git 2.25 introduces the sparse checkout to fix this.

Sparse clones

A sparse clone of a repository only contains some of the files of the parent. The files that are to be contained in the sparse clone are specified by a filter, which uses the same syntax as the .gitignore file.

If you are using a Mac and want to try these examples: MacOS currently ships with Git 2.23.3, so for the commands below to work, install a more recent version with homebrew.

To create a sparse checkout, start with a minimal clone (–no-checkout) without any history (–depth 1) or binary data (-filter=blob:none), combined with the –sparse option.

As an example, consider the Kubernetes repository. A full clone takes about 45 seconds on my machine and weights in at about one gigabyte.

time git clone \
  https://github.com/kubernetes/kubernetes.git

Cloning into 'kubernetes'...
remote: Enumerating objects: 218, done.
Receiving objects: 100% (1173921/1173921), 714.81 MiB | 23.43 MiB/s, done.
Resolving deltas: 100% (843104/843104), done.

40.35s user 16.10s system 125% cpu 44.956 total
du -hs kubernetes

1018M	kubernetes

The empty clone is created in seven seconds and the metadata contained therein only takes up a megabyte of disk space.

time git clone \
  --filter=blob:none --no-checkout --depth 1 --sparse \
  https://github.com/kubernetes/kubernetes.git

Cloning into 'kubernetes'...
remote: Enumerating objects: 4784, done.
Receiving objects: 100% (4784/4784), 900.42 KiB | 2.71 MiB/s, done.
Resolving deltas: 100% (11/11), done.

0.05s user 0.06s system 1% cpu 7.043 total
du -hs kubernetes

1.2M	kubernetes

Now, to work on the Kubernetes tests, we can add the test/ folder to our sparse checkout and fetch this from the remote and start working.

cd kubernetes
❯ git sparse-checkout add /test/
time git checkout

remote: Enumerating objects: 1510, done.
remote: Counting objects: 100% (1510/1510), done.
remote: Compressing objects: 100% (1346/1346), done.
remote: Total 1510 (delta 367), reused 397 (delta 118), pack-reused 0
Receiving objects: 100% (1510/1510), 2.40 MiB | 6.09 MiB/s, done.
Resolving deltas: 100% (367/367), done.
Updating files: 100% (1564/1564), done.
Your branch is up to date with 'origin/master'.

0.25s user 0.29s system 7% cpu 7.306 total
du -hs test

 13M	test

In this case, the sparse checkout took an additional seven seconds, so the total speedup (14 s / 45 s ≈ 3.2) is not very impressive, but the saved space (14 MB / 1018 MB ≈ 72.7) certainly is. The speedup is usually greater in real-world scenarios, where the repository may be much larger and clone time is completely dominated by the network transfer.