Huffman codes is a widely used technique for compressing data. Huffman codes is a greedy algorithm to build up an optimal way of representing each character as a binary string. Suppose we have a 10,000 character data file that we wish to store. And we observe a character ‘a’ occurs very frequently.
Now, there are many ways to represent a file. We consider a problem of designing a binary character code wherein each character is represented by a unique binary string.
For instance, we need 3 bits to represent five characters [a=000, b=001, c=010, d=011, e=100]. Now if we use variable length codeword, we could save a lot of space, for instance:
Character | a | b | c | d | e |
Frequency | 100 | 13 | 12 | 15 | 5 |
Fixed-length code | 000 | 001 | 010 | 011 | 100 |
Variable-length code | 1 | 010 | 001 | 011 | 000 |
A data file of 200,000 characters contains only the characters (a..e), with above mentioned frequencies. If we use variable-length code instead of fixed-length code, we could save 80,000 bits.
Huffman Code Algorithm
Huffman invented a greedy algorithm that constructs an optimal prefix code (variable length code) called Huffman Code.
As we could see from the image above, for the given characters {a, b, c} with frequencies {2, 4, 8} respectively, Huffman Codes (bits 0, 1 on edges) generated are:
a: 00 b: 01 c: 1
Pseudocode
This problem could be solved by using a Priority Queue or Heap data structure.
Huffman(C) 1. n = size[C] 2. Q = C 3. for i = 1 to n-1: 4. new node = z 5. left[z] = Extract-Min[Q] 6. right[z] = Extract-Min[Q] 7. f[z] = f[x] + f[y] 8. Insert(Q, z) 9. Extract-Min(Q)
Note: Extract-Min(Q)
operation removes the top (smallest) member of priority-queue Q
and left[z] = Extract-Min[Q]
operation removes the smallest element from priority-queue Q
and assign it to left[z]
.
Code Implementation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | // // main.cpp // Huffman Code // // Created by Himanshu on 18/09/21. // #include <cstdio> #include <iostream> #include <vector> #include <algorithm> #define MAX 100 using namespace std; struct node { int freq; char letter = NULL; struct node *left, *right; }; typedef struct node Node; bool cmp( const Node *a, const Node *b) { return (a->freq > b->freq); } Node* newNode( int freq, char ch) { Node *p = new node(); p->freq = freq; p->letter = ch; p->left = NULL; p->right = NULL; return p; } void printArray( int A[], int n) { for ( int i=0; i<n; i++) { cout<<A[i]; } cout<<endl; } Node* BuildHuffmanTree (vector< int > freq) { int n = ( int )freq.size(); vector<Node*> minHeap; for ( int i=0; i<n; i++) { Node *z = newNode(freq[i], ( char )(i + 'a' )); minHeap.push_back(z); push_heap(minHeap.begin(), minHeap.end(), &cmp); } while (minHeap.size() > 1) { //This is a non-alphabet node, hence -1 for ch Node *z = newNode(0, -1); Node *x = minHeap.front(); pop_heap(minHeap.begin(), minHeap.end(), &cmp); minHeap.pop_back(); z->left = x; Node *y = minHeap.front(); pop_heap(minHeap.begin(), minHeap.end(), &cmp); minHeap.pop_back(); z->right = y; z->freq = x->freq + y->freq; minHeap.push_back(z); push_heap(minHeap.begin(), minHeap.end(), &cmp); } return minHeap.front(); } void solve ( int A[], Node *root, int n) { if (root->letter != -1) { cout<<( char )(root->letter)<< ": " ; printArray(A, n); } if (root->left) { A[n] = 0; solve(A, root->left, n+1); } if (root->right) { A[n] = 1; solve(A, root->right, n+1); } } int main() { // char = {a, b, c, d, e, f} vector< int > freq = {5, 20, 15, 13, 30, 45}; int temp[MAX]; Node *node = BuildHuffmanTree(freq); solve(temp, node, 0); return 0; } |
Output
b: 00 e: 01 c: 100 a: 1010 d: 1011 f: 11
Time complexity: O(nlogn) (For calling Heapify operation (which is O(logn)), n times)
Auxiliary Space: O(n) (For creating Heap)
n is the number of alphabet characters