Introduction to Information Retrieval

Information Retrieval (IR)

IR is concerned with the representation, storage, organization of, and access to information items.

IR foucses on automatic processing - indexing, clustering, and search of unstructured data such as text, images, audio, or poems. But we'll mainly foucs on text data.

Unstructured data is not really unstructured: when you think about novel, poems, or jokes, it's not unstructured, it's just destructured in complex manner that basic data models don't model it accurately. So we treat it as unstructured data. So structured data needs to be data with a an obvious structure or simple structure, and this goes a long way for many business applications. Unstructured data is where the structures so difficult that we don't really model in the simple data models.
Information retrieval focuses on human users and their needs: there's a person writing a query to help a person to find information that they want.
Database have optimized to serve as building block for higher-level apps: building blocks at the lower level so that you can write all kinds of application on top. DB should be really fast and also should be standardized.

Evaluation in IR: Precision versus Recall

(Also used for many ML problems)

Recall

The fraction of the relevant documents that are retrieved.

How many of the good documents that actually returned to the users?

\text { recall }=\frac{\left|R_{a}\right|}{|R|}

Precision

The fraction of retrieved documents (A) that are relevant

How many of the documents that are returend to the users are actually good?

\text { precision }=\frac{\left|R_{a}\right|}{|A|}

Example:

Search Problem: Given 1 million text documents, find all documents that are about kangaroos.

Suppose there are 160 documents about kangaroos, and our search tools return 100 documents of these 100 documents, 80 are about kangaroos.

Recall: 80/160 = 0.5
Precision: 80/100 = 0.8

How do you know that a document is relevant to a topic or query?

People getting paid to label things. This is a trade off between precision and recall.

If you ask me give you a query and I return no answer to you, my precision is 100% but my recall is 0%. If I return everything, my recall is 100%, but my precision is going to be close to 0. Somewhere in the middle, there's a sweet spot where you want. This spot may depend on your application.

More explanation on extreme cases

Formulas:

Precision = (True Positives) / (True Positives + False Positives)
Recall = (True Positives) / (True Positives + False Negatives)

When the system returns no result

If the system returns no result, that means there are no positive results given by the system. So, True Positives = 0 and False Positives = 0.

Precision = 0/0. In many contexts, we define 0/0 (an indeterminate form) in precision as 1 (100%), because we are saying that out of all the positive results the system produced (which is zero), all of them are correct. That's a bit of a philosophical point; the notion is that the system did not make any false positive errors because it didn't try to identify any positives.

However, since no results were returned, all the actual positive instances were missed, making False Negatives equal to the total number of actual positive instances. Therefore, Recall = 0/(any positive number) = 0%.

When the system returns all documents as a result

Again, using the formulas:

If the system returns all documents as results, that means it considers everything to be positive. So, False Positives will be all the instances that are actually negative but are incorrectly returned by the system. True Positives will be all the actual positive instances because the system didn't miss any.

Hence, Recall = (All actual positives)/(All actual positives + 0) = 100%, because the system has identified all actual positive instances.

However, Precision will be affected because while the system did correctly identify all the true positive instances, it also incorrectly identified many negative instances as positive. If there are many such incorrect identifications (i.e., high False Positives), Precision will be low. In the extreme case where there are no actual positives but the system identifies everything as positive, Precision = 0/(0 + all documents) = 0%.

In summary, precision gives you an idea of how many of the identified positives are actually correct, while recall gives you an idea of how many of the actual positives were identified by the system.

However, real IR systems use more complex evaluation measures.

Evaluation in IR: More Realistic Measures

Real measures of the quality at the beginning of the ranking and like top 10 results in particular, how are they ranked?
It also needs to recognize there's more than just relevant and irrelevant. There's very relevant, relevant, slightly relevant - there's a bunch of different levels of problem. Typically, people will look at four or five different levels.

There's a bunch of more realistic measures. But popular measure is precision@k (p@k)

For example, p@10: what fraction of top-10 results is relevant? meaing precision within top-10 results. If $p@10 = 0.59$ , it means that over many queries, about 59% of top 10 are relevant.

But, we should also take degree of relevance and position of relevant results into account.

IR different from String Processing

String Processing is basically dealing with simple graph kind of matching strings to find certain words.

IR vs. NLP

IR:

IR usually does not analyze grammar, local/sentence structure. It sees document as a set or bag or words mostly.
IR uses simple staticstical methods which are good for search and categorization.
IR is largely language-independent

NLP:

NLP analyzes sentence structure by shallow or deep parsing
NLP is good for automatic translation, summarization, or extraction.
NLP uses more knowledge about the language (grammar, thesaurus) and it's rooted in linguistics (grammar vs. context gree languages and grammar; statistical NLP).

In web sesarch, NLP has proven useful in many cases - extraction.tagging, analyzing and understanding user queries, and often more important to understand queires than docs (user intent)

IR vs. Recommender Systems

IR and RecSys are converging as IR systems need to model personal preferences, and RecSys need to support queires on large data sets.

IR:

IR usually assumes some comonality of user interests - commonality between different people that are searching for information. In IR, the fofus is on answering search queries.

Recommender Systems:

RecSys assumes users have different preferences. So RecSys do not need queries - "find some music I like".

Zipf and Bibliometric/Informetric Distributions

Zipf's distribution (or Zipf's law) is a special case of the power-law distribution. In the context of linguistics, where it's perhaps most famously applied, it states that the frequency of a word in a natural language corpus is inversely proportional to its rank in the frequency table. That is, the most common word appears about twice as often as the second most common word, three times as often as the third most common word, and so on.

some things occur much more frequently than others

"some authoers are much more often cited than others"
"some web pages have more in-links than others"
"some books or songs are more popular than others"
"some owrds are much more often used than others"
"some people have much more money than others"

This is a fundamental observation about many real-world phenomena, and it's what Zipf's distribution describes. Whether you're talking about word frequencies, city populations, income distributions, or the popularity of online articles, some items in these categories occur (or are present) way more often than others.

Larger z means larger skew

In the mathematical representation of Zipf's law, the parameter (sometimes represented as $s$ ) determines the skew of the distribution. A larger $z$ results in a steeper distribution where fewer items have much higher frequencies, and most items have very low frequencies.

Heavy-tailed: The distribution has a long tail, meaning that while many items have a very low frequency, there are a few items with extremely high frequency. This aligns with the statement: "some may have a lot more, but most stuff is owned by the many." This essentially means that even though a few entities (like words, authors, or websites) are highly prevalent or own a lot, the majority still belong to a broader set of less prevalent entities.
Non heavy-tailed: This is almost the opposite. A few entities dominate so overwhelmingly that they nearly "own" or represent everything. This aligns with: "some have a lot more, and own almost everything."

To put it simply, in many real-world datasets, a few items dominate in frequency or representation. This skew in distributions, often described by Zipf's law or other power-law distributions, has profound implications for areas like linguistics, economics, informatics, and more.

Last updated 2 years ago