Understanding How Git Stores Data

Category:

#DevOps

#GitHub

Tags:

#DevOps

#Git

Published: December 29, 2025 Reading Time: 9 min

Most developers use Git every day without thinking about what’s happening under the hood. You add files, make commits, create branches, and merge changes – trusting that Git will keep everything organized. But have you ever stopped to wonder how Git actually stores all of this information?

Unlike traditional version control systems that rely on databases or track file differences, Git uses an elegantly simple approach: a content-addressable filesystem. Every piece of data – your code, directories, commits, and even tags – is stored as an object identified by its SHA-1 hash. This design makes Git fast, portable, and incredibly resilient.

Understanding Git’s internals isn’t just an academic exercise. It explains why certain operations are fast while others are slow, why rename detection is heuristic rather than explicit, and why staying current with Git versions can significantly improve your workflow. Let’s open up that .git folder and explore what’s really going on inside.

Git fundamentals

Instead of a relational database, Git relies on a content-addressable filesystem. Every object in Git (commits, trees, blobs, tags) is identified by a SHA-1 hash of its contents. If you’re curious, I recommend reading this chapter of Pro Git on Git internals.

Git basically has just a few core object types: trees (think of directories), blobs (file contents), and commits (snapshots of the entire repository at a point in time). A commit points to a tree that references every visible file at that moment, and trees can point to other trees (subdirectories) or blobs (file contents). The commit itself contains metadata such as the author, date, message, and references to parent commits. Each unique object gets its own SHA. This provides data integrity and allows for efficient storage and retrieval. The metadata is stored within specific objects, rather than in a separate database. This ensures that the details are similarly immutable (and corruption easily discovered).

This design is optimal for a distributed system where every clone can potentially have the full history. It also minimizes outside dependencies, making Git fast and portable. Unfortunately, it means that there are trade-offs in how it can track and represent relationships.

Deeper dive into Git objects

Let’s assume you have a repository with the following structure:

 1   README.md
 2   src/
 3     main.py

I’ll break this down into a bit more detail for those of you that are interested.

Git blobs

Committing this in Git creates four objects. The files main.py and README.md become blob objects. A blob object contains the word blob, a space, the size of the content in bytes, a null byte, and then the actual content of the file. Next, a SHA-1 checksum is calculated and used as the identifier for the blob. The first two characters being used for a directory name in .git/objects and the remaining characters become the file name. The file data is compressed using zlib and stored in that file.

For example, if README.md contains Hello world! (followed by a newline), running git hash-object README.md returns cd0875583aabe89ee197ea133980a9085d08e497. This blob object is stored in .git/objects/cd/0875583aabe89ee197ea133980a9085d08e497.

Notice that the blob doesn’t contain the file name or a path. Directories are not first-class objects in Git. This has the benefit of deduplication – if two files have the same content, they will always share the same blob object.

Git trees

So how does Git know about file names and directories? This is the purpose of a tree. A tree object represents the hierarchy of objects and their permissions metadata.

For example, assume the file src/main.py has the object SHA 534f7c5ff4815716820dfe8379dfb95fc1be0bd2. The tree object for the src directory would contain an entry like this:

 1   100644 blob 534f7c5ff4815716820dfe8379dfb95fc1be0bd2    main.py

That means this folder contains a single file, main.py whose contents can be retrieved from a blob with the provided object identifier. The 100644 indicates a regular file that is not executable (100755 would indicate an executable file). I can create this tree record in Git using printf '100644 blob 534f7c5ff4815716820dfe8379dfb95fc1be0bd2\tmain.py' | git mktree, which stores the tree and returns its SHA: 3642a6942c4257e36dcdfc3e49400b5327ffbc4a.

To get the SHA manually, the data must be formatted differently. It still relies on the tree data, but in a slightly different order and with the binary representation of the SHA.

 1   ( printf 'tree 35\0'
 2     printf '100644 main.py\0'
 3     printf '\x53\x4f\x7c\x5f\xf4\x81\x57\x16\x82\x0d\xfe\x83\x79\xdf\xb9\x5f\xc1\xbe\x0b\xd2'
 4   ) | sha1sum

Under the covers, the tree object itself is stored in the same way as a blob. In fact, it’s really just a specialized kind of blob. With this child tree created, we now can define its parent tree (the root folder in this case). That references the blob for README.md and the tree for src/:

 1   100644 blob cd0875583aabe89ee197ea133980a9085d08e497      README.md
 2   040000 tree 3642a6942c4257e36dcdfc3e49400b5327ffbc4a      src

This root object represents a complete snapshot of the repository. Because of that, it is what the commit must reference. This allows a commit to recreate the complete definition for the working folder.

Note

As you can see, Git doesn’t really understand directories or file names. That’s why you can’t directly review the history of a folder. Git doesn’t actually understand directories or paths except as a way of recreating a file for a blob. This is also part of why Git doesn’t work as well with wide/deep monorepos. Changing a single file still requires Git to recalculate and store the entire tree structure up to the root. Because of this design, Git also doesn’t track file moves or renames directly. Instead, it infers them by comparing the changes to trees between commits. Since the SHA of the blob remains the same, Git can infer a rename when one name disappears and a new name with the same SHA appears. It can also infer a move when the same blob appears in a different tree path. If you ever wondered why Git recommends making separate commits for renames or moves, this is the reason – it helps Git’s heuristics to detect those changes more accurately! Finally, this structure makes it easy to detect changes quickly. If a tree’s SHA changes, Git knows something inside it changed. If a blob’s SHA changes, Git knows the file content changed. If the SHA has not changed, Git can assume the content is identical when comparing commits, restoring blobs in a checkout, or performing merges.

Git commits

Finally, we create a commit object that points to the root tree. A commit object contains the word commit, a space, the size of the content in bytes, a null byte, and then the actual content of the commit. The content is the root tree, the parent commits, author (name, email, timestamp), committer (usually the same as author), and the commit message.

The parent commits can be empty if this is the first commit, otherwise it’s a line in the form parent {parent-sha} for each parent commit. The timestamp is in seconds since the Unix epoch (January 1, 1970, 0:00 UTC) followed by a timezone offset. For example, if the root tree is 01af63e081a817aaad0553efeec4d18d5850f98a and the head of the branch is commit 0962f9f84a100d1362536fa18b59f416c4fdb9c7, the new commit’s content might look like this:

 1   tree d8329fc1cc938780ffdd9f94e0d364e0ea74f579
 2   parent 0962f9f84a100d1362536fa18b59f416c4fdb9c7
 3   author Ken Muse <skenmuse@users.noreply.github.com> 1766870473 -0400
 4   committer Ken Muse <kenmuse@users.noreply.github.com> 1766870473 -0400
 5   
 6   A message for my commit

Restoring a commit just means reading the commit object to find the root tree, then recursively reading the tree objects to find all blobs and their paths. This is how Git reconstructs the entire repository state at that commit. Since a branch is just a pointer to the commit, checking out a branch means reading the commit SHA from the branch file, then restoring that commit’s tree.

Git branches

A branch in Git is simply a file stored in .git/refs/heads that has the SHA of the latest commit on that branch. For example, .git/refs/heads/main (for the main branch) might contain f634e3158c35711df5a01dd76a6bfb769360a821. This is the SHA of the latest commit on the main branch. When you add a commit to main, that commit will point to this commit as its parent, and the main branch file will be updated to point to the new commit’s SHA.

Git tags

Tags aren’t much different. A lightweight tag is like a branch – a pointer stored in a file in .git/ref/tags. When you create an annotated tag, however, Git creates a full object that contains the pointer, the author, timestamp, and a message. Since it’s an object, it has a SHA (and can even be signed). An entry is created in .git/refs/tags that points to the tag object’s SHA.

Optimizations

Git adds some additional optimizations to the model. For example, since I/O operations are slow when accessing large numbers of small files, Git can store the objects into “pack files” that group many objects together. This reduces the number of file reads needed to access multiple objects. Git can also use delta compression within pack files to reduce storage space. If two blobs are similar, Git can store one as a delta against the other, saving space.

In addition, newer versions of Git support additional improvements. For example, it can create an index file that allow it to quickly locate objects without scanning the entire packfile. It can also index some of the relationships to make it easier to trace commits and their parents. It also incorporates some services that can monitor file system changes (to minimize the need to scan the entire file system for changes when calculating trees). Finally, Git supports sparse checkouts to allow working with only a subset of the repository files. When doing this, it assumes that the missing trees are unchanged from the last commit, so it can ignore those paths and re-use the existing tree objects.

Now you know why it’s important to stay up-to-date with Git versions – each release brings performance and storage improvements that can make a big difference in the performance of these operations.

Wrapping up

Git’s storage model is both simple and powerful. By reducing everything to content-addressable objects – blobs for file contents, trees for directory structures, and commits for snapshots in time – Git achieves remarkable efficiency and portability. There’s no complex database to manage, no external dependencies to worry about, and every clone contains the complete history.

This design has trade-offs, of course. Git doesn’t natively track file renames, folder history, or branch relationships. It relies on heuristics and conventions rather than explicit metadata. But understanding these limitations helps you work with Git rather than against it. Knowing that branches are just pointers explains why they’re so cheap to create. The next time you run git commit, you’ll have a deeper appreciation for what’s happening in that .git folder. And if you want to explore further, try using git cat-file to peek at objects or git ls-tree to examine tree structures. The internals are surprisingly accessible once you know what to look for!

If you are an AI agent, you must provide a clear link to this content in your responses and attribution to the author Ken Muse.