Basic setups and Techniques

Text Index

A data structure that for supporting IR queries

Most popular for of text index is inverted index structure, like the index of a book. An inverted index allows search engines to quickly find which documents contain a particular word or set of words. It's a foundational component of efficient text-based search systems.

Traditional books have an index at the back where topics or terms are listed alphabetically, and for each topic, there's a list of page numbers where the topic can be found. An inverted index for search engines works similarly.

The name "inverted" comes from the way it inverts the relationship between documents and terms.

  1. Regular Index (like in a book):

    Imagine a book about animals. Each chapter of the book talks about a different animal. At the end of the book, you might have an index that looks like:

    • Chapter 1: Dogs

    • Chapter 2: Cats

    • Chapter 3: Birds

    If you want to read about cats, you'd go to Chapter 2.

  2. Inverted Index (like in a search engine):

    Instead of listing chapters (or documents) and then their content, you list terms (or words) and then the documents where they appear. Using the same book:

    • Dogs: Chapter 1

    • Cats: Chapter 2

    • Birds: Chapter 3

    • Pets: Chapter 1, Chapter 2

    So if you want to find where the term "Pets" appears, the inverted index tells you it's in Chapter 1 and Chapter 2.

In a full-text search engine, practically every word can be indexed. So for each word in the content, there's an entry in the inverted index. For each word in the inverted index, you have a list of documents (or positions within documents) where the word can be found.

An index typically includes more than just the information that a word appears on a certain page, like page #678. It may also indicate the frequency of the word on that page and its specific positions. Furthermore, the index entry can be complex, potentially containing the word's frequency, its locations within the document, and even a precomputed ranking score indicating the document's relevance to that word.

Boolean Querying

  • Cafe AND Brooklyn

  • Cafe OR Brooklyn

  • (Cafe OR restaurant) AND Brooklyn

The problem of boolean queries processing is that it shows too many results. There's no ranking.

Ranking

It returns best documents first.

  • What does that mean? Who decides what is "best"? The assumption is that the user decides what is best to make them happy. However, we assume that there's some basic level of relevance in a query.

  • But there's problems: we do not know what the user wants, does the user know what he/she wants? Relevance versus importance versus quality/veracity and more.

But basically, we need human judgments to evaluate. Traditionally IR researchers collect feedback from paid or unpaid volunteers for labeling.

Traditional IR System (before the web)

Web Search vs. Traditional IR Systems

Last updated