How does git work internally

7 min readSep 1, 2018

A Friendly introduction

When we are doing very straight forward code projects (suppose writing a simple bash file) there are only two points in our development timeline, only start and finish. We start coding very first, thereafter we finalize and ship those projects. Obviously many projects will get more than two points in their development timeline due to feature requests , bug fixes and sometimes reverts.

Why (Version Control Systems) — VCS

As mentioned above if we do have many points in our development timeline we really need to use a VCS. So basically VCS tools allow users to manage their development paths (maybe versions, features , patches or technically branches) or development histories without too much effort.

Git — from the guy who wrote kernel

Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.

Git is distributed system. it means that Git users are not just sending their code in to centralized codebase in order to record the history. Everyone got their own copies of development history.

Haha.. Article is about internals. So let’s begin.We’ll skip git basics. I found a good git-cheatsheet here

Walking to the door

We hit git add , git commit with our keyboards. In other words we stage changes of files and thereafter we commit them to the history. What will happen internally? .. Maybe some magic? or does git manage a centralized database. Then how entire history is available with git clone ?

Opening the door..

Hashes, file based key-value storage and tree data structure, these are the key things behind git. Each tree node, commit and files has own unique 40 character long SHA-1 representation(We can say that’s the key). Thus those elements are added to a tree data structure which is persisted inside .git/objects folder.

.git directory

This will be automatically created when a new repo is created or cloned. Git saves history(file contents and commits) and configuration inside this folder.

Got ahead and play your fingers for these commands

$ mkdir apple
$ cd apple
$ git init
$ ls -1 .git

branches — Git no longer use this folder — depreciated

config — Store repo’s configuration

HEAD — reference to your current working branch.

hooks — Scripts that will be triggered with a Git event (before committing etc..). Normally these hooks are not enabled. You need to remove .sample extension to make them work.

objects — File based key-value storage that holds commits, tree nodes and file contents (in blob form).

Hey!! you are now inside ..

Plumbing commands (core commands) will help to understand Git internals. Yes… you understood!, there is a hard way to commit changes than using simple abstract commands like git add and git commit

git add (hard way)

Adding changes to the stage is just like writing a diary anonymously. It means data will be saved to .git/objects but there is no commit message. In other words there is no history written actually.

$ touch myfile.txt
$ git hash-object -w myfile.txt
$ find .git/objects -type f

git hash-object will calculate SHA-1 hash and put the blob file into key-value storage.

mm.. now we have something in our database. So let’s try with cat .

Wow binary.. we can’t simply cat because Git uses different internal binary format than general encoding.

$ git cat-file -p e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

This will return empty content since the myfile.txt file is has not content. So add some content to myfile.txt

$ echo "Hello Git" > myfile.txt 
$ git hash-object -w myfile.txt

This will return another hash because the file content is changed. So.. git cat new hash.

$ git cat-file -p 9f4d96d5b00d98959ea9960f069585ce42b1349a

mm.. We got our file content. Thereafter we can start staging process.

$ git update-index --add --cacheinfo 100644 \ 9f4d96d5b00d98959ea9960f069585ce42b1349a myfile.txt

This command will add your file to .git/index which holds the indexing information of files. Check staged elements on index files using ls-files

$ git ls-files --stage

Now what you think! Yes hit git status

Congratulations!! you staged a file doing the hard way.

git commit (hard way)

We wrote things in our diary, thereafter we have two choices. We can either tear the page ( git reset --hard ) or put the signature ( git commit).

So as good people we simply go ahead and put our signature on what we wrote. Verify your details..

Awesome!! your signature is okay. commit object has a SHA-1 hash ( like any other Git objects ) and it points a tree node.

So.. where is the tree node?. We need to create one.

$ git write-tree

This will create a tree node from current index objects (Remember we staged our blob in there). Thus it will return a new hash which represents our new tree node.

Now we have enough things to do a commit

$ echo "first commit" | git commit-tree \ 6e9432aeedbad83fbffb7f8aae4a5d1ab50b7fdf

See first commit’s content

$ git cat-file -p 1658642a6c164700c880d499da0b874c18829883

Also you see history via git log

$ git log --stat 1658642a6c164700c880d499da0b874c18829883

Let’s do our second commit by updating myfile.txt

$ echo "Hello Git Pro" > myfile.txt

Now file is having another version. Therefore we are going to create another tree node for this history change.

$ git update-index myfile.txt
$ git write-tree

Since file is already in Git index we can simply pass one argument to update-index .

Since commits happen in linear manner with time, we need to pass previous commit has as an argument for new commit.

$ echo "second commit" | git commit-tree \ 075e4ae2beb7edf5fda9fef8beba34a52f60a957 -p \ 1658642a6c164700c880d499da0b874c18829883

This will return second commit’s hash value

Once we enter git log still we cannot get results. Therefore we need to set reference to our latest commit

$ echo 314f04395e5e7c70d9f40d681c2f4c84237a7fea >  .git/refs/heads/master
$ git log

Wow!. commits ands tree nodes are connected as per below. Further tree nodes has another tree nodes depending on what directory structure you staged. This is the basic internal process behind Git functionality.

Note : This is not structure of our scenario. just to show the graphical view

Moreover branching is very powerful feature in VCS. Basically branches are just movable pointers to tree nodes as per displayed below.

Conclusion

This explanation was focused on git staging, tree data structure and committing internals. There are other useful features when remote repository is used, such as pulling, pushing etc.