How does git work internally
A Friendly introduction
When we are doing very straight forward code projects (suppose writing a simple bash file) there are only two points in our development timeline, only start and finish. We start coding very first, thereafter we finalize and ship those projects. Obviously many projects will get more than two points in their development timeline due to feature requests , bug fixes and sometimes reverts.
Why (Version Control Systems) — VCS
As mentioned above if we do have many points in our development timeline we really need to use a VCS. So basically VCS tools allow users to manage their development paths (maybe versions, features , patches or technically branches) or development histories without too much effort.
Git — from the guy who wrote kernel
Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.
Git is distributed system. it means that Git users are not just sending their code in to centralized codebase in order to record the history. Everyone got their own copies of development history.
Haha.. Article is about internals. So let’s begin.We’ll skip git basics. I found a good git-cheatsheet here
Walking to the door
We hit git add
, git commit
with our keyboards. In other words we stage changes of files and thereafter we commit them to the history. What will happen internally? .. Maybe some magic? or does git manage a centralized database. Then how entire history is available with git clone
?
Opening the door..
Hashes, file based key-value storage and tree data structure, these are the key things behind git. Each tree node, commit and files has own unique 40 character long SHA-1 representation(We can say that’s the key). Thus those elements are added to a tree data structure which is persisted inside .git/objects
folder.
.git directory
This will be automatically created when a new repo is created or cloned. Git saves history(file contents and commits) and configuration inside this folder.
Got ahead and play your fingers for these commands
$ mkdir apple
$ cd apple
$ git init
$ ls -1 .git
branches — Git no longer use this folder — depreciated
config — Store repo’s configuration
HEAD — reference to your current working branch.
hooks — Scripts that will be triggered with a Git event (before committing etc..). Normally these hooks are not enabled. You need to remove .sample
extension to make them work.
objects — File based key-value storage that holds commits, tree nodes and file contents (in blob form).
Hey!! you are now inside ..
Plumbing commands (core commands) will help to understand Git internals. Yes… you understood!, there is a hard way to commit changes than using simple abstract commands like git add
and git commit
git add (hard way)
Adding changes to the stage is just like writing a diary anonymously. It means data will be saved to .git/objects
but there is no commit message. In other words there is no history written actually.
$ touch myfile.txt
$ git hash-object -w myfile.txt
$ find .git/objects -type f
git hash-object
will calculate SHA-1 hash and put the blob file into key-value storage.
mm.. now we have something in our database. So let’s try with cat
.
Wow binary.. we can’t simply cat
because Git uses different internal binary format than general encoding.
$ git cat-file -p e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
This will return empty content since the myfile.txt
file is has not content. So add some content to myfile.txt
$ echo "Hello Git" > myfile.txt
$ git hash-object -w myfile.txt
This will return another hash because the file content is changed. So.. git cat new hash.
$ git cat-file -p 9f4d96d5b00d98959ea9960f069585ce42b1349a
mm.. We got our file content. Thereafter we can start staging process.
$ git update-index --add --cacheinfo 100644 \ 9f4d96d5b00d98959ea9960f069585ce42b1349a myfile.txt
This command will add your file to .git/index
which holds the indexing information of files. Check staged elements on index files using ls-files
$ git ls-files --stage
Now what you think! Yes hit git status
Congratulations!! you staged a file doing the hard way.
git commit (hard way)
We wrote things in our diary, thereafter we have two choices. We can either tear the page ( git reset --hard
) or put the signature ( git commit
).
So as good people we simply go ahead and put our signature on what we wrote. Verify your details..
Awesome!! your signature is okay. commit object has a SHA-1 hash ( like any other Git objects ) and it points a tree node.
So.. where is the tree node?. We need to create one.
$ git write-tree
This will create a tree node from current index objects (Remember we staged our blob in there). Thus it will return a new hash which represents our new tree node.
Now we have enough things to do a commit
$ echo "first commit" | git commit-tree \ 6e9432aeedbad83fbffb7f8aae4a5d1ab50b7fdf
See first commit’s content
$ git cat-file -p 1658642a6c164700c880d499da0b874c18829883
Also you see history via git log
$ git log --stat 1658642a6c164700c880d499da0b874c18829883
Let’s do our second commit by updating myfile.txt
$ echo "Hello Git Pro" > myfile.txt
Now file is having another version. Therefore we are going to create another tree node for this history change.
$ git update-index myfile.txt
$ git write-tree
Since file is already in Git index we can simply pass one argument to update-index
.
Since commits happen in linear manner with time, we need to pass previous commit has as an argument for new commit.
$ echo "second commit" | git commit-tree \ 075e4ae2beb7edf5fda9fef8beba34a52f60a957 -p \ 1658642a6c164700c880d499da0b874c18829883
This will return second commit’s hash value
Once we enter git log
still we cannot get results. Therefore we need to set reference to our latest commit
$ echo 314f04395e5e7c70d9f40d681c2f4c84237a7fea > .git/refs/heads/master
$ git log
Wow!. commits ands tree nodes are connected as per below. Further tree nodes has another tree nodes depending on what directory structure you staged. This is the basic internal process behind Git functionality.
Moreover branching is very powerful feature in VCS. Basically branches are just movable pointers to tree nodes as per displayed below.
Conclusion
This explanation was focused on git staging, tree data structure and committing internals. There are other useful features when remote repository is used, such as pulling, pushing etc.
References
https://git-scm.com/book/en/v1/Git-Internals
Useful links
- Commands list — https://github.com/git/git/blob/master/command-list.txt
- git add source — https://github.com/git/git/blob/master/builtin/add.c
- git commit source — https://github.com/git/git/blob/master/builtin/commit.c
Neutralinojs
Take a look on our latest open source work
Support me on Patreon
Happy version controlling!!!