Overview
A CLI utility that compresses and decompresses text files using Huffman coding, implemented from scratch in C with no external libraries. The project was an exercise in algorithm implementation, bit manipulation, and careful binary I/O.
Algorithm
Huffman coding is a lossless data compression algorithm that assigns shorter bit sequences to more frequent symbols and longer ones to less frequent symbols, producing an optimal prefix-free code for a given symbol distribution.
Encoding pipeline:
- Frequency analysis — scan the input file and count occurrences of each byte value
- Priority queue construction — insert all (symbol, frequency) pairs into a min-heap
- Tree construction — repeatedly extract the two lowest-frequency nodes and merge them into a new internal node, until one root remains
- Code table generation — traverse the tree, assigning
0for left branches and1for right branches - Header serialization — write the tree structure to the output file so it can be reconstructed during decoding
- Payload encoding — re-read the input and write the variable-length codes bit by bit
Custom Binary Header
The header format was a significant design challenge. The decoder needs to reconstruct the Huffman tree exactly to decompress the file — this means the tree topology and leaf values must be stored efficiently.
I used a pre-order traversal serialization:
- Internal nodes are marked with a
0bit - Leaf nodes are marked with a
1bit followed by the 8-bit symbol value
For a typical English text file, this header is substantially smaller than storing the full frequency table, and the recursive reconstruction during decoding is straightforward.
Bit Manipulation
The core challenge in Huffman encoding is writing variable-length codes to byte-aligned output. I maintained a bit buffer — an 8-bit accumulator and a count of bits written — flushing to the output stream when the buffer filled:
void write_bit(BitWriter *w, int bit) {
w->buffer = (w->buffer << 1) | (bit & 1);
w->bit_count++;
if (w->bit_count == 8) {
fputc(w->buffer, w->out);
w->buffer = 0;
w->bit_count = 0;
}
}
void flush_bits(BitWriter *w) {
if (w->bit_count > 0) {
w->buffer <<= (8 - w->bit_count);
fputc(w->buffer, w->out);
}
}
The trailer byte stores how many bits in the final byte are padding, so the decoder knows when to stop.
Dynamic Memory Management
The Huffman tree is allocated dynamically during construction and freed recursively after use. All allocation paths have explicit error checks — malloc returning NULL surfaces as an error rather than a null pointer dereference.
The priority queue (min-heap) uses a dynamically-sized array with doubling reallocation, starting at 256 slots (the maximum number of unique byte values in a byte-level Huffman tree).
Results
On a representative corpus of English prose:
| File type | Original | Compressed | Ratio | |---|---|---|---| | Plain text (English) | 100 KB | ~56 KB | ~44% reduction | | Already-compressed | 100 KB | ~101 KB | slight expansion | | High-entropy (random) | 100 KB | ~101 KB | slight expansion |
Compression performance degrades on already-compressed data (expected — entropy is near maximum). The utility includes a check: if the encoded output would be larger than the input, it copies the input unchanged and flags it in the header.
Lessons Learned
The bit manipulation was less complex than expected once I got the abstraction right. The hard part was the header format — getting the tree serialization and reconstruction to be byte-exact required careful testing with hand-computed examples before I trusted the implementation.
Error handling in C requires discipline: every allocation, every file operation, every argument requires a check. The codebase is more verbose than equivalent Python but the control over memory layout and I/O was instructive.