Text File Compressor

Overview

A CLI utility that compresses and decompresses text files using Huffman coding, implemented from scratch in C with no external libraries. The project was an exercise in algorithm implementation, bit manipulation, and careful binary I/O.

Algorithm

Huffman coding is a lossless data compression algorithm that assigns shorter bit sequences to more frequent symbols and longer ones to less frequent symbols, producing an optimal prefix-free code for a given symbol distribution.

Encoding pipeline:

Frequency analysis — scan the input file and count occurrences of each byte value
Priority queue construction — insert all (symbol, frequency) pairs into a min-heap
Tree construction — repeatedly extract the two lowest-frequency nodes and merge them into a new internal node, until one root remains
Code table generation — traverse the tree, assigning 0 for left branches and 1 for right branches
Header serialization — write the tree structure to the output file so it can be reconstructed during decoding
Payload encoding — re-read the input and write the variable-length codes bit by bit

Custom Binary Header

The header format was a significant design challenge. The decoder needs to reconstruct the Huffman tree exactly to decompress the file — this means the tree topology and leaf values must be stored efficiently.

I used a pre-order traversal serialization:

Internal nodes are marked with a 0 bit
Leaf nodes are marked with a 1 bit followed by the 8-bit symbol value

For a typical English text file, this header is substantially smaller than storing the full frequency table, and the recursive reconstruction during decoding is straightforward.

Bit Manipulation

The core challenge in Huffman encoding is writing variable-length codes to byte-aligned output. I maintained a bit buffer — an 8-bit accumulator and a count of bits written — flushing to the output stream when the buffer filled:

void write_bit(BitWriter *w, int bit) {
    w->buffer = (w->buffer << 1) | (bit & 1);
    w->bit_count++;
    if (w->bit_count == 8) {
        fputc(w->buffer, w->out);
        w->buffer = 0;
        w->bit_count = 0;
    }
}

void flush_bits(BitWriter *w) {
    if (w->bit_count > 0) {
        w->buffer <<= (8 - w->bit_count);
        fputc(w->buffer, w->out);
    }
}

The trailer byte stores how many bits in the final byte are padding, so the decoder knows when to stop.

Dynamic Memory Management

The Huffman tree is allocated dynamically during construction and freed recursively after use. All allocation paths have explicit error checks — malloc returning NULL surfaces as an error rather than a null pointer dereference.

The priority queue (min-heap) uses a dynamically-sized array with doubling reallocation, starting at 256 slots (the maximum number of unique byte values in a byte-level Huffman tree).

Results

On a representative corpus of English prose:

| File type | Original | Compressed | Ratio | |---|---|---|---| | Plain text (English) | 100 KB | ~56 KB | ~44% reduction | | Already-compressed | 100 KB | ~101 KB | slight expansion | | High-entropy (random) | 100 KB | ~101 KB | slight expansion |

Compression performance degrades on already-compressed data (expected — entropy is near maximum). The utility includes a check: if the encoded output would be larger than the input, it copies the input unchanged and flags it in the header.

Lessons Learned

The bit manipulation was less complex than expected once I got the abstraction right. The hard part was the header format — getting the tree serialization and reconstruction to be byte-exact required careful testing with hand-computed examples before I trusted the implementation.

Error handling in C requires discipline: every allocation, every file operation, every argument requires a check. The codebase is more verbose than equivalent Python but the control over memory layout and I/O was instructive.