banner image

How git works

If you use git, you're probably aware of git commit hashes.

But you might not know that really is the hash of a file located somewhere in your .git directory (probably inside a .zip file).

git cat-file lets you see this file. For example, this commit is secretly just a text file:

$ git cat-file commit b2778fc94b79f756cccbe2fd2e2a4f43a8425be0
tree 17a8a0b26deb632c8cb8ad13d2b0fd7d105aec3d
parent 63028ee5cb83d7fc5805494115943b0dc84d4e21
author user <user@lorbanery> 1466259574 +0100
committer user <user@lorbanery> 1466259574 +0100

use logging module

The file contains the commit message (use logging module).

It contains the parent commit(s) (parent 6302...).

But also in this file is:

tree 17a8a0b26deb632c8cb8ad13d2b0fd7d105aec3d

Which is the hash of this file:

$ git cat-file -p 17a8a0b26deb632c8cb8ad13d2b0fd7d105aec3d
100644 blob e6847d2ea2199889aa8f893241b8c3bedd2739ac    .gitignore
100644 blob e9aae548d6e479f6419e21ade292b7011c110ca3    README.md
040000 tree e874308f11046f6d9339a819e629f2018582f860    iss_time

That's a list of all the files in the (root of) the project, and their hashes.

(I used -p here because git cat-file tree ... is a binary format.)

There's another tree hash for the iss_time subdirectory, so we can view that:

$ git cat-file -p e874308f11046f6d9339a819e629f2018582f860
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391    __init__.py
100644 blob decd23ba86abf338a49d9eddb9ef2b435db2ee9f    constants.py
100644 blob 5f74fbb9a9b845b2e8ff518acc47490a14a626fe    display.py
100644 blob 091f0d15cefbef803dbf0e046849223a57723243    fetch_and_store.py

If the subdirectory doesn't change, it has the same hash. So git can store these objects once. Efficient!

Of course, you can show the actual files (blobs):

$ git cat-file blob 5f74fbb9a9b845b2e8ff518acc47490a14a626fe
from __future__ import print_function
from __future__ import unicode_literals

import sqlite3

from .constants import ISS_TIME_DB


def get_most_recent():
    conn = sqlite3.connect(ISS_TIME_DB)
    try:
        [row] = conn.execute('SELECT nowutc, lat, lng, tzdata, issnow, description, tzabbr, country FROM datapoints ORDER BY nowutc DESC LIMIT 1;')
        return row
    finally:
        conn.close()


def main():
    _nowutc, _lat, _lng, _tzdata, issnow, _description, _tzabbr, country = get_most_recent()
    print('Time on space station is: {issnow} ({country})'.format(
        issnow=issnow,
        country=country))

if __name__ == '__main__':
    main()

In git, everything is a file described by a hash.

And git cat-file is the key which will show you them all.

Unnecessary detail

I said you can hash objects, but what does that mean?

Sadly, you can't just pipe through sha1sum. But you can pipe through git hash-object:

$ echo "hello" | git hash-object --stdin -t blob
ce013625030ba8dba906f756967f9e9ca394464a

$ git cat-file blob 5f74fbb9a9b845b2e8ff518acc47490a14a626fe | git hash-object --stdin -t blob
5f74fbb9a9b845b2e8ff518acc47490a14a626fe

To see the difference, we can see what git actually writes in the file.

The file is compressed with zlib after being hashed, so you need to deflate it (eg: with zlib-flate:

$ cat .git/objects/5f/74fbb9a9b845b2e8ff518acc47490a14a626fe | zlib-flate -uncompress | hd
00000000  62 6c 6f 62 20 36 31 39  00 66 72 6f 6d 20 5f 5f  |blob 619.from __|
...

You can spot the extra \x00-terminated header at the start, containing the type of object (blob) and size of body (619).

This file has the correct sha1 hash:

$ cat .git/objects/5f/74fbb9a9b845b2e8ff518acc47490a14a626fe | zlib-flate -uncompress | sha1sum
5f74fbb9a9b845b2e8ff518acc47490a14a626fe  -

It's possible your repo might use sha256.

For more details, you may wish to read the source code of git hash-object and the implementation in oject-file.c.

For more fun, take a look at this cool sequential git commit hash hack.

Image credit: Gnome Project (cc-by-sa)