The Web and Web Search

How the web works?

  1. Text View: collection of pages or documents

  2. Graph View: links between pages

  3. Site and Domain Structure: folders, sites, and domains often contain semantically related pages

Client Server Paradigm:

Browser uses HTTP(communication) to ask web server for an object identified by a URL(naming), and renders this object according to rules defined by HTML(rendering).

  • URL (Uniform Resource Locators): used to identify and locate objects

  • HTTP (Hypertext Transfer Protocol): used to request and transfer objects

  • HTML (Hypertext Markup Language): used to defined how object should be presented to user

Challenges for Search Engines

  • Complex URLs, result page computed by server in arbitrary manner, advertising networks, file types, and automatic redirects, and so on.

  • Things have gotten more and more complex: sites are now not fully crawlable, and auto-generate content. Large parts of the web are becoming inaccessible.

Basic Structure of a Search Engine

Crawler tries to download the entire Internet. The content, the HTML pages get stored on disks. They then get analyzed, data mining, and web mining. Index is being built, and finally people come and search for content - query processing needs to be done in order to get the best results.

Four Major Component:

Most of the cycles spent in data analysis and query processing.

  1. Data acquisition / web cralling: to collect pages from all over the web

  2. Data analysis / web mining: link analysis, spam detection and query log analysis (what they searched for and what they clicked on) by using mapreduce or similar specialized platforms to process the scale of data.

  3. Index building and maintenance

  4. Query processing / result ranking

Ranking

In order to return best pages first, you can use past information about queries and what people clicked on. But you can't compete with the quality if you don't have enough query logs or user behavior. To be competitive, you can use additional featrues such as if the keyword a people's name or not?

A problem you have now that all these new signals whether term-based, linked-based, or clicked-based, they should be combined into one score. In 2000, we now have transformer-based ranging where transformers are used in order to give you a semantic match between your query and documents.

Main Challenges for Large Search Engines

  • Coverage: need to crawl and store massive data sets in order to cover large part of the web

  • Good Ranking: need to narrow down queries and figure out top 10 when a broad search offered

  • Freshness: need to recrawl, reprocess, and reanalyze all the content brought here. Important stuff should be updated regularly, but some stuff never changes.

  • User load: need to deal with the user load because you have 5 billion (huge amount) queries per day.

  • Manipulation: site owners started pushing the fake links into their websites to against other websites in order to be ranked higher.

Plus monetization, personalization, locallization.

Search tools

  • Specialized Search Engines: use domain-specific knowledge and structure (limited resources)

  • Meta Search Engines: takes a query, and maybe rewrites it using domain knowledge and then sends it to another engine to answer queries. Or combine and rerank results from several engines - not clear how well this works

  • E-commerce Search: tricky issue with product search is that it's semi-structured search with keywords plus product attributes (constraints). All these different dimensions depend on the product type. For backpack, lighter weight is probably better, but buying the gold, more weight is probably better.

  • Sponsored Search (Computational Advertising): placing the "best" ades next to search results or placing ads on web pages when they are visited. The objective is somewhat different from organic search, but it's also a search problem.

  • Recommender Systems: very close to search, but a little different - "recsys" community

  • Question Answering, Multimedia search, Privacy-Preserving Search

Last updated