Power of trie-ing
In a bittersweet talk with a friend, immediately after failing a tech job interview, one unreasonable doubt came to my mind: how much memory compression can a trie actually offer? Naturally, since it was the data structure that defined my rejection, I opted to dedicate this post to it.
A trie, also known as prefix tree, is a tree data structure whose nodes store the characters present in a set of words given some alphabet (i.e., dictionary). These words can be easily retrieved by traversing down the branches of the tree.
For our analysis, we assume our alphabet size \(A\) to be \(26\), matching the English alphabet. In a simple and unoptimized manner, we could code our data structure in a few lines using Python.
Since most operations on the data structure depend on the number of nodes and not on the dictionary size (key idea), I wanted to evaluate how much less characters do we need to represent a dictionary if we use a trie.
We proceed by obtaining a dictionary of english valid words and plotting their distribution by length.
After adding each of the words of our dictionary in random order to our trie, I was truly surprised by the following results:
- Total number of chars in dictionary: 183728
- Total number of nodes of Trie to store dictionary: 77421
- General compression for storing all words in dictionary: 42.14 %
My intuition was wrong. Perhaps I was assuming that most of words in the dictionary show a more uniform distribution, causing the trie to efficiently compress short words but obtaining no memory gains on longer words. Moreover, my friend pointed out one aspect I totally overlooked: the English language has some inherent prefix structure. Most common words share a similar etimology. This observation drove my stubbornes into another experiment: random words.
We compare the memory compression rates achieved by using a trie to store the dictionary. For that purpose, we generate a multiple dictionary of random gibberish words of different lengths using the English alphabet. We compare the number of characters in total versus the number of nodes created while adding each word to the trie.
Again, my intuition was wrong :) Whenever you need to store retrieve a set of words, just use trie.