Ticker

6/recent/ticker-posts

The data structure for search. Demystifying inverted index.

Have you every wondered that a normal search on your computer takes more than 5 minutes even with very less amount of data as compared to search engines which have gigabytes of data. Search engines answers queries in some millisecond time duration. Believe me it is not magic, it is all about algorithmic improvements and choosing good data structure. In this post we are going to discuss about the data structure used in search engines.


Search engines uses a data structure called inverted index to give you most relevant answer to queries. Your computer system sequentially searches the name or sometimes content of the file which is very time taking process. Inverted index maps terms to the list of documents hence making queries faster.

Forward index is the mapping of documents to content. This is the simplest data structure used in search, basically in your computer. To search in a forward index you have to iterate over all the documents and search for the words in queries in the documents. For short collection it works nicely but for anything  large scale like search engines it will no be the practical solution.

(A forward index. There are three documents 1, 2 and 3)

Inverted Index

For making an inverted index you have to tokenize the document. After that you have to map the terms to the list of documents in which they occur.
(An inverted index. Terms are mapped to list of documents in which they occur)

In the above image inverted index is described for the 3 documents shown earlier. The right part of inverted index is called postings list. It is the list of documents containing the corresponding term. The left part is called dictionary or lexicon which is a hash table of terms pointing to their corresponding postings. 

The lexicon is commonly kept in main memory to make search faster. Postings are generally stored in secondary memory as postings are in amount of billions.

You may also notice that the word 'get' is indexed instead of 'getting' and 'paint' is indexed instead of 'painted'. This is because for user 'paint' and 'painted' have same meaning and most of the user does not care about grammar. Choping down of word into the smallest word is called Stemming. Google started to use stemming in 2004. If you have searched for 'painted' before 2004 on Google, Google may have not returned documents that have word 'paint'. You may have also noticed that we have omitted the word 'is' as words like 'is', 'am, 'are' etc. don't have any special meaning and occurs in every document. Hence lowering their value.

To process a query you have to first locate and fetch the postings of each word in the query. After then you have to merge them and rank them based on the ranking factor.
Take a look at the following example:-
(Posting list enlarged to depict merging of two postings)

First you locate the words in queries in the dictionary and fetch the postings of each term. After that you maintain a pointer in both of the list and start by pointing the first document number. If both are same you append the document number in the result array and advances both the pointer. If pointer one is less than the other, you keep advancing it until you find a match or a number greater than the pointer one. Keep this process repeating until you approach the end of list.



Finally the results are obtained. You can also rank them based on TF-IDF, BM25 ranking models.
Now we have covered data structure used in search engine, there can be more better algorithmic improvements. We can make results more relevant by using good ranking models.