Main Components of a Search Engine

Web Crawling

The crawler will download as many web pages as possible from the web, and put them onto your service. First, start crawling and dowloading pages and pass each page that you downloaded, look for additional hyperlinks, and then follow those links (e.g., BFS; avoid crawling the same page again)

After a while when you need to recrawl, you do not BFS again because now you have additional information about what pages are more important - the ones that a lot of people clicked on. Or you can do focused crawling which you crawl that only finds certain types of information or particular area of the web. Therefore, you can come up with recrawling strategies which you can crawl priorities to focus on important pages or to balance load .

Hardware setup

A very large data centers running customized servers with large RAM, SSDs, and HDDs where document and index data is partitioned over. When the query comes in, it gets broadcasted to these number of machines and it will give top-10 results back to a accumulator which will then combine with the all of them.

Search engines typically use a combination of data storage solutions, and while they might use relational databases for some tasks, the primary storage and retrieval system for search data is often a distributed system like Google's Bigtable or Apache Hadoop and its ecosystem. These systems are not relational databases in the traditional sense. They are designed to scale out by distributing the data across many servers and often prioritize availability and partition tolerance over immediate consistency.

Major search engines are based on scalable clusters of low-cost servers connected by LANS, but there's a lot of prorietary code and secret query processing internally.

Web Crawling

Simple Breadth-First Search Crawler

This will eventually download all pages reachable from the start set, also need to remember pages that have already been downloaded. But in reality, more refined strateties will be needed (still somewehat BFSish).

insert set of initial URLs into a queue Q
while Q is not empty
    currentURL = dequeue(Q)
    download page from currentURL
    for any hyperlink found in the page
        if hyperlink is to a new page
            enqueue hyperlink URL into Q

Last updated 2 years ago