CrabGit — Building Git from Scratch in Rust

CrabGit — Building Git from Scratch in Rust

I use Git every day, but honestly had no idea how it worked under the hood. So I spent a few weeks building my own version in Rust to figure it out. Turns out, Git is way cooler than I thought.

RustGit

Look, I'll be honest - I've been using Git for years and had absolutely no clue what was actually happening when I ran these commands:

Bash
git add .
git commit -m "works on my machine"
git log

Sure, it worked. But what the hell was Git actually doing? Where did my files go? What's a blob? Why is everything hashed?

So I did what any curious developer would do - I built my own Git implementation from scratch. Meet CrabGit (yes, Rust's crab mascot demanded naming rights).

It's not meant to replace Git. It's way simpler - local-only, no remotes, no fancy features. Just the core stuff:

  • A content-addressable object store
  • Blobs, trees, and commits
  • Basic branching
  • A tiny .crab_git folder that holds everything

This post is basically my learning journey. I'll show you how CrabGit works, how data flows from add to commit, and why Git's architecture is actually pretty genius.


What CrabGit Can Do

I kept it minimal on purpose. Here's what I managed to implement:

Repository Basics

  • init - Start a new repo
  • status - See what's changed
  • add - Stage files

Version Control

  • commit - Save a snapshot
  • log - View history
  • diff - Compare changes

Branching

  • branch - Create/list/delete branches
  • checkout - Switch branches or commits

That's it. No push, no pull, no merge conflicts to debug at 2 AM. Just the fundamentals.


The CLI

When you run the binary, you get this ASCII art banner (because what's a CLI tool without ASCII art, right?):

Bash
./target/release/CrabGit
CrabGit CLI

How I Structured the Code

I split CrabGit into four layers to keep my sanity:

  1. UI Layer - Command-line parsing with clap
  2. Commands - Each Git operation (init, add, commit, etc.)
  3. Core - Repository logic and the object store
  4. Data Models - Rust structs for blobs, trees, commits
Architecture

The key rule I followed: commands never touch the filesystem directly. They go through the core layer, which handles all the file I/O and hashing. This kept things organized and made debugging way easier.


The Journey from add to commit

This is where it gets interesting. Let me walk you through what happens when you stage and commit a file.

Step 1: Staging (the add command)

You run:

Bash
crab_git add file.txt

Here's what happens behind the scenes:

  1. CrabGit reads file.txt from your working directory
  2. It calculates a SHA256 hash of the contents
  3. It creates a Blob object and saves it to .crab_git/objects/
  4. It updates the staging area (index) with an entry like:
    Text
    "file.txt" → abc123... (blob hash)
    

At this point, the file content is safely stored in the object database, and the index knows it should be included in the next commit.

Step 2: Committing (the commit command)

You run:

Bash
crab_git commit -m "Initial commit" --author "You <you@example.com>"

Now CrabGit does this:

  1. Reads all staged entries from the index
  2. Builds a Tree object that represents your directory structure
  3. Creates a Commit object with:
    • A pointer to the tree (the snapshot)
    • A pointer to the parent commit (if there is one)
    • Your author info, message, and timestamp
  4. Saves the commit to the object store
  5. Updates the current branch (like refs/heads/main) to point to this new commit
  6. Keeps HEAD pointing to that branch

Here's the full flow:

Add to Commit Data Flow

Inside the .crab_git Directory

This is where all the magic happens. Here's what the folder structure looks like:

File Tree
.crab_git/
├── objects/              # All your data lives here
│   ├── ab/
│   │   ├── cdef123...   # Blobs and trees
│   │   └── 987654...
│   ├── de/
│   └── 12/
│       └── 3456789...   # Everything's compressed with zlib

├── index                # Staging area (just a JSON file)
├── HEAD                 # Points to current branch
├── refs/
│   └── heads/
│       ├── main         # Branch reference (contains a commit hash)
│       └── feature
└── config               # Repo settings
Object Store Overview

Content-Addressable Storage (Fancy Term for Hash-Based Filing)

Every single object follows this pattern:

  1. Take the content
  2. Hash it with SHA256
  3. Use the first two characters as a directory name
  4. Use the rest as the filename

Example:

Text
hash: ab12cd...ef
path: .crab_git/objects/ab/12cd...ef

Why? Because if the content changes, the hash changes, which means it gets stored separately. Unchanged content gets reused automatically. Git's entire history system is built on this simple idea.

Compression

To save disk space, I compress everything with zlib when writing and decompress when reading. Objects are also serialized as JSON (yeah, I know real Git uses its own format, but JSON made debugging so much easier).


The Data Models (Git's Secret Sauce)

Here's how I represented Git's core concepts in Rust:

Blob - Raw File Content

JSON
{
  "hash": "abc123...",
  "content": "[file bytes]"
}

Simple. Just the file content and its hash.

Tree - Directory Structure

JSON
{
  "hash": "def456...",
  "entries": {
    "file.txt": "TreeEntry { hash, mode, is_file: true }",
    "src/": "TreeEntry { hash, mode, is_file: false }"
  }
}

Trees map filenames to blobs (files) or other trees (subdirectories).

TreeEntry - Single File or Folder

JSON
{
  "mode": "644",
  "hash": "blob_or_tree_hash",
  "name": "filename",
  "is_file": true
}

Commit - Snapshot in Time

JSON
{
  "hash": "ghi789...",
  "parent": "parent_commit_hash",
  "tree": "def456...",
  "author": "John Doe <john@example.com>",
  "message": "Commit message",
  "timestamp": "2024-11-15T10:30:00Z"
}

Each commit points to a tree (the full snapshot) and its parent commit (the previous snapshot). The first commit has no parent.

Index - Staging Area

JSON
{
  "entries": {
    "file.txt": "IndexEntry { hash, mode, path }",
    "src/main.rs": "IndexEntry { hash, mode, path }"
  }
}

This is what add modifies and commit reads from.


How It All Connects

Here's how commits, trees, and blobs relate:

Commit–Tree–Index Relationship

And here's how your actual filesystem maps to Git objects:

File System vs Git Objects

A closer look at how trees can point to both blobs and other trees:

Tree and Blob Architecture

The brilliant part:

  • Every commit is a full snapshot
  • But unchanged files share the same blob
  • So you're not duplicating data - you're reusing hashes

When you change one file, only that blob changes. The rest of the tree reuses existing objects. That's how Git can store years of history without eating your entire hard drive.


The Codebase Layout

I organized the project like this:

File Tree
CrabGit/
├── src/
│   ├── main.rs              # Entry point
│   ├── lib.rs               # Core types
│   ├── object_store.rs      # Hashing and storage
│   ├── utils.rs             # Repo utilities
│   └── commands/
│       ├── mod.rs
│       ├── init.rs          # Repository setup
│       ├── add.rs           # Staging
│       ├── commit.rs        # Snapshotting
│       ├── status.rs        # Checking changes
│       ├── log.rs           # History
│       ├── branch.rs        # Branch management
│       ├── checkout.rs      # Switching branches
│       └── diff.rs          # Comparing files
├── Cargo.toml
└── README.md

Pretty straightforward. Commands are isolated, core logic is separate, and everything talks through well-defined interfaces.


How Commands Actually Execute

Every command follows the same pipeline:

  1. Parse - clap turns CLI args into a Command enum
  2. Route - A match statement sends it to the right handler
  3. Load - The core finds .crab_git and loads repo state
  4. Execute - The command does its thing (read/write objects, update index)
  5. Save - New objects get written, refs get updated
  6. Output - Results print to the terminal

It's like a mini compiler pipeline, which made the code really easy to reason about.


Try It Yourself

Want to play with CrabGit? Here's how:

Bash
git clone https://github.com/abhinavkale-dev/CrabGit.git
cd CrabGit
cargo build --release

macOS / Linux

Bash
./target/release/CrabGit init
echo "Hello, CrabGit!" > hello.txt
./target/release/CrabGit add hello.txt
./target/release/CrabGit commit "Initial commit" --author "You <you@example.com>"
./target/release/CrabGit log
./target/release/CrabGit status
./target/release/CrabGit branch feature
./target/release/CrabGit checkout feature

Windows (PowerShell)

Powershell
.\target\release\CrabGit.exe init
echo "Hello, CrabGit!" > hello.txt
.\target\release\CrabGit.exe add hello.txt
.\target\release\CrabGit.exe commit "Initial commit" --author "You <you@example.com>"
.\target\release\CrabGit.exe log
.\target\release\CrabGit.exe status
.\target\release\CrabGit.exe branch feature
.\target\release\CrabGit.exe checkout feature

The Dependencies

Kept it minimal. Just these crates:

  • sha2 - SHA256 hashing
  • serde/serde_json - Serialization (so much easier than binary formats)
  • chrono - Timestamps for commits
  • clap - CLI argument parsing
  • walkdir - Recursive directory traversal
  • flate2 - zlib compression

What I Learned

Building CrabGit was honestly one of the best learning experiences I've had. Here's what clicked for me:

Git isn't magic. It's just a content-addressable filesystem with some clever bookkeeping. Every commit is a snapshot. Branches are just pointers. That's it.

Hashing is powerful. Once I understood that everything is identified by its hash, the whole system made sense. Content never changes - it just gets new hashes.

Rust was perfect for this. The type system forced me to think through the data model properly. No null pointer surprises, no accidental mutations. Just clean, explicit code.

Would I replace Git with this? Hell no. But do I finally understand how Git works? Absolutely.

If you've ever felt like Git is this mystical black box, I really recommend building something like this. Start small, add features one at a time, and suddenly it all clicks.

The code's on GitHub if you want to poke around. PRs welcome if you want to add features (like, I dunno, actual merge support?).