CrabGit — Building Git from Scratch in Rust

Look, I'll be honest - I've been using Git for years and had absolutely no clue what was actually happening when I ran these commands:

Bash

git add .
git commit -m "works on my machine"
git log

Sure, it worked. But what the hell was Git actually doing? Where did my files go? What's a blob? Why is everything hashed?

So I did what any curious developer would do - I built my own Git implementation from scratch. Meet CrabGit (yes, Rust's crab mascot demanded naming rights).

It's not meant to replace Git. It's way simpler - local-only, no remotes, no fancy features. Just the core stuff:

A content-addressable object store
Blobs, trees, and commits
Basic branching
A tiny .crab_git folder that holds everything

This post is basically my learning journey. I'll show you how CrabGit works, how data flows from add to commit, and why Git's architecture is actually pretty genius.

What CrabGit Can Do

I kept it minimal on purpose. Here's what I managed to implement:

Repository Basics

init - Start a new repo
status - See what's changed
add - Stage files

Version Control

commit - Save a snapshot
log - View history
diff - Compare changes

Branching

branch - Create/list/delete branches
checkout - Switch branches or commits

That's it. No push, no pull, no merge conflicts to debug at 2 AM. Just the fundamentals.

The CLI

When you run the binary, you get this ASCII art banner (because what's a CLI tool without ASCII art, right?):

Bash

./target/release/CrabGit

How I Structured the Code

I split CrabGit into four layers to keep my sanity:

UI Layer - Command-line parsing with clap
Commands - Each Git operation (init, add, commit, etc.)
Core - Repository logic and the object store
Data Models - Rust structs for blobs, trees, commits

The key rule I followed: commands never touch the filesystem directly. They go through the core layer, which handles all the file I/O and hashing. This kept things organized and made debugging way easier.

The Journey from `add` to `commit`

This is where it gets interesting. Let me walk you through what happens when you stage and commit a file.

Step 1: Staging (the `add` command)

You run:

Bash

crab_git add file.txt

Here's what happens behind the scenes:

CrabGit reads file.txt from your working directory
It calculates a SHA256 hash of the contents
It creates a Blob object and saves it to .crab_git/objects/
It updates the staging area (index) with an entry like:
Text
```
"file.txt" → abc123... (blob hash)
```

At this point, the file content is safely stored in the object database, and the index knows it should be included in the next commit.

Step 2: Committing (the `commit` command)

You run:

Bash

crab_git commit -m "Initial commit" --author "You <you@example.com>"

Now CrabGit does this:

Reads all staged entries from the index
Builds a Tree object that represents your directory structure
Creates a Commit object with:
- A pointer to the tree (the snapshot)
- A pointer to the parent commit (if there is one)
- Your author info, message, and timestamp
Saves the commit to the object store
Updates the current branch (like refs/heads/main) to point to this new commit
Keeps HEAD pointing to that branch

Here's the full flow:

Inside the `.crab_git` Directory

This is where all the magic happens. Here's what the folder structure looks like:

File Tree

.crab_git/
├── objects/              # All your data lives here
│   ├── ab/
│   │   ├── cdef123...   # Blobs and trees
│   │   └── 987654...
│   ├── de/
│   └── 12/
│       └── 3456789...   # Everything's compressed with zlib
│
├── index                # Staging area (just a JSON file)
├── HEAD                 # Points to current branch
├── refs/
│   └── heads/
│       ├── main         # Branch reference (contains a commit hash)
│       └── feature
└── config               # Repo settings

Content-Addressable Storage (Fancy Term for Hash-Based Filing)

Every single object follows this pattern:

Take the content
Hash it with SHA256
Use the first two characters as a directory name
Use the rest as the filename

Example:

Text

hash: ab12cd...ef
path: .crab_git/objects/ab/12cd...ef

Why? Because if the content changes, the hash changes, which means it gets stored separately. Unchanged content gets reused automatically. Git's entire history system is built on this simple idea.

Compression

To save disk space, I compress everything with zlib when writing and decompress when reading. Objects are also serialized as JSON (yeah, I know real Git uses its own format, but JSON made debugging so much easier).

The Data Models (Git's Secret Sauce)

Here's how I represented Git's core concepts in Rust:

Blob - Raw File Content

JSON

{
  "hash": "abc123...",
  "content": "[file bytes]"
}

Simple. Just the file content and its hash.

Tree - Directory Structure

JSON

{
  "hash": "def456...",
  "entries": {
    "file.txt": "TreeEntry { hash, mode, is_file: true }",
    "src/": "TreeEntry { hash, mode, is_file: false }"
  }
}

Trees map filenames to blobs (files) or other trees (subdirectories).

TreeEntry - Single File or Folder

JSON

{
  "mode": "644",
  "hash": "blob_or_tree_hash",
  "name": "filename",
  "is_file": true
}

Commit - Snapshot in Time

JSON

{
  "hash": "ghi789...",
  "parent": "parent_commit_hash",
  "tree": "def456...",
  "author": "John Doe <john@example.com>",
  "message": "Commit message",
  "timestamp": "2024-11-15T10:30:00Z"
}

Each commit points to a tree (the full snapshot) and its parent commit (the previous snapshot). The first commit has no parent.

Index - Staging Area

JSON

{
  "entries": {
    "file.txt": "IndexEntry { hash, mode, path }",
    "src/main.rs": "IndexEntry { hash, mode, path }"
  }
}

This is what add modifies and commit reads from.

How It All Connects

Here's how commits, trees, and blobs relate:

And here's how your actual filesystem maps to Git objects:

A closer look at how trees can point to both blobs and other trees:

The brilliant part:

Every commit is a full snapshot
But unchanged files share the same blob
So you're not duplicating data - you're reusing hashes

When you change one file, only that blob changes. The rest of the tree reuses existing objects. That's how Git can store years of history without eating your entire hard drive.

The Codebase Layout

I organized the project like this:

File Tree

CrabGit/
├── src/
│   ├── main.rs              # Entry point
│   ├── lib.rs               # Core types
│   ├── object_store.rs      # Hashing and storage
│   ├── utils.rs             # Repo utilities
│   └── commands/
│       ├── mod.rs
│       ├── init.rs          # Repository setup
│       ├── add.rs           # Staging
│       ├── commit.rs        # Snapshotting
│       ├── status.rs        # Checking changes
│       ├── log.rs           # History
│       ├── branch.rs        # Branch management
│       ├── checkout.rs      # Switching branches
│       └── diff.rs          # Comparing files
├── Cargo.toml
└── README.md

Pretty straightforward. Commands are isolated, core logic is separate, and everything talks through well-defined interfaces.

How Commands Actually Execute

Every command follows the same pipeline:

Parse - clap turns CLI args into a Command enum
Route - A match statement sends it to the right handler
Load - The core finds .crab_git and loads repo state
Execute - The command does its thing (read/write objects, update index)
Save - New objects get written, refs get updated
Output - Results print to the terminal

It's like a mini compiler pipeline, which made the code really easy to reason about.

Try It Yourself

Want to play with CrabGit? Here's how:

Bash

git clone https://github.com/abhinavkale-dev/CrabGit.git
cd CrabGit
cargo build --release

macOS / Linux

Bash

./target/release/CrabGit init
echo "Hello, CrabGit!" > hello.txt
./target/release/CrabGit add hello.txt
./target/release/CrabGit commit "Initial commit" --author "You <you@example.com>"
./target/release/CrabGit log
./target/release/CrabGit status
./target/release/CrabGit branch feature
./target/release/CrabGit checkout feature

Windows (PowerShell)

Powershell

.\target\release\CrabGit.exe init
echo "Hello, CrabGit!" > hello.txt
.\target\release\CrabGit.exe add hello.txt
.\target\release\CrabGit.exe commit "Initial commit" --author "You <you@example.com>"
.\target\release\CrabGit.exe log
.\target\release\CrabGit.exe status
.\target\release\CrabGit.exe branch feature
.\target\release\CrabGit.exe checkout feature

The Dependencies

Kept it minimal. Just these crates:

sha2 - SHA256 hashing
serde/serde_json - Serialization (so much easier than binary formats)
chrono - Timestamps for commits
clap - CLI argument parsing
walkdir - Recursive directory traversal
flate2 - zlib compression

What I Learned

Building CrabGit was honestly one of the best learning experiences I've had. Here's what clicked for me:

Git isn't magic. It's just a content-addressable filesystem with some clever bookkeeping. Every commit is a snapshot. Branches are just pointers. That's it.

Hashing is powerful. Once I understood that everything is identified by its hash, the whole system made sense. Content never changes - it just gets new hashes.

Rust was perfect for this. The type system forced me to think through the data model properly. No null pointer surprises, no accidental mutations. Just clean, explicit code.

Would I replace Git with this? Hell no. But do I finally understand how Git works? Absolutely.

If you've ever felt like Git is this mystical black box, I really recommend building something like this. Start small, add features one at a time, and suddenly it all clicks.

The code's on GitHub if you want to poke around. PRs welcome if you want to add features (like, I dunno, actual merge support?).

CrabGit — Building Git from Scratch in Rust

What CrabGit Can Do

The CLI

How I Structured the Code

The Journey from `add` to `commit`

Step 1: Staging (the `add` command)

Step 2: Committing (the `commit` command)

Inside the `.crab_git` Directory

Content-Addressable Storage (Fancy Term for Hash-Based Filing)

Compression

The Data Models (Git's Secret Sauce)

Blob - Raw File Content

Tree - Directory Structure

TreeEntry - Single File or Folder

Commit - Snapshot in Time

Index - Staging Area

How It All Connects

The Codebase Layout

How Commands Actually Execute

Try It Yourself

macOS / Linux

Windows (PowerShell)

The Dependencies

What I Learned

Related Blogs

Building My First Solana Smart Contract: A Simple Counter Explained

How I Built a Discord Bot in Go

What I Learned About TCP by Building My Own Server in Go

What CrabGit Can Do

The CLI

How I Structured the Code

The Journey from add to commit

Step 1: Staging (the add command)

Step 2: Committing (the commit command)

Inside the .crab_git Directory

Content-Addressable Storage (Fancy Term for Hash-Based Filing)

Compression

The Data Models (Git's Secret Sauce)

Blob - Raw File Content

Tree - Directory Structure

TreeEntry - Single File or Folder

Commit - Snapshot in Time

Index - Staging Area

How It All Connects

The Codebase Layout

How Commands Actually Execute

Try It Yourself

macOS / Linux

Windows (PowerShell)

The Dependencies

What I Learned

Related Blogs

Building My First Solana Smart Contract: A Simple Counter Explained

How I Built a Discord Bot in Go

What I Learned About TCP by Building My Own Server in Go

The Journey from `add` to `commit`

Step 1: Staging (the `add` command)

Step 2: Committing (the `commit` command)

Inside the `.crab_git` Directory