Introduction to Information Retrieval
Information Retrieval (IR)
IR is concerned with the representation, storage, organization of, and access to information items.
IR foucses on automatic processing - indexing, clustering, and search of unstructured data such as text, images, audio, or poems. But we'll mainly foucs on text data.
Unstructured data is not really unstructured: when you think about novel, poems, or jokes, it's not unstructured, it's just destructured in complex manner that basic data models don't model it accurately. So we treat it as unstructured data. So structured data needs to be data with a an obvious structure or simple structure, and this goes a long way for many business applications. Unstructured data is where the structures so difficult that we don't really model in the simple data models.
Information retrieval focuses on human users and their needs: there's a person writing a query to help a person to find information that they want.
Database have optimized to serve as building block for higher-level apps: building blocks at the lower level so that you can write all kinds of application on top. DB should be really fast and also should be standardized.
Evaluation in IR: Precision versus Recall
(Also used for many ML problems)

Recall
The fraction of the relevant documents that are retrieved.
How many of the good documents that actually returned to the users?
Precision
The fraction of retrieved documents (A) that are relevant
How many of the documents that are returend to the users are actually good?
Example:
Search Problem: Given 1 million text documents, find all documents that are about kangaroos.
Suppose there are 160 documents about kangaroos, and our search tools return 100 documents of these 100 documents, 80 are about kangaroos.
Recall: 80/160 = 0.5
Precision: 80/100 = 0.8
How do you know that a document is relevant to a topic or query?
People getting paid to label things. This is a trade off between precision and recall.
If you ask me give you a query and I return no answer to you, my precision is 100% but my recall is 0%. If I return everything, my recall is 100%, but my precision is going to be close to 0. Somewhere in the middle, there's a sweet spot where you want. This spot may depend on your application.

However, real IR systems use more complex evaluation measures.
Evaluation in IR: More Realistic Measures
Real measures of the quality at the beginning of the ranking and like top 10 results in particular, how are they ranked?
It also needs to recognize there's more than just relevant and irrelevant. There's very relevant, relevant, slightly relevant - there's a bunch of different levels of problem. Typically, people will look at four or five different levels.
There's a bunch of more realistic measures. But popular measure is precision@k
(p@k)
For example, p@10: what fraction of top-10 results is relevant? meaing precision within top-10 results. If , it means that over many queries, about 59% of top 10 are relevant.
But, we should also take degree of relevance and position of relevant results into account.
IR different from String Processing
String Processing is basically dealing with simple graph kind of matching strings to find certain words.
IR vs. NLP
IR:
IR usually does not analyze grammar, local/sentence structure. It sees document as a set or bag or words mostly.
IR uses simple staticstical methods which are good for search and categorization.
IR is largely language-independent
NLP:
NLP analyzes sentence structure by shallow or deep parsing
NLP is good for automatic translation, summarization, or extraction.
NLP uses more knowledge about the language (grammar, thesaurus) and it's rooted in linguistics (grammar vs. context gree languages and grammar; statistical NLP).
In web sesarch, NLP has proven useful in many cases - extraction.tagging, analyzing and understanding user queries, and often more important to understand queires than docs (user intent)
IR vs. Recommender Systems
IR and RecSys are converging as IR systems need to model personal preferences, and RecSys need to support queires on large data sets.
IR:
IR usually assumes some comonality of user interests - commonality between different people that are searching for information. In IR, the fofus is on answering search queries.
Recommender Systems:
RecSys assumes users have different preferences. So RecSys do not need queries - "find some music I like".
Zipf and Bibliometric/Informetric Distributions

Zipf's distribution (or Zipf's law) is a special case of the power-law distribution. In the context of linguistics, where it's perhaps most famously applied, it states that the frequency of a word in a natural language corpus is inversely proportional to its rank in the frequency table. That is, the most common word appears about twice as often as the second most common word, three times as often as the third most common word, and so on.
some things occur much more frequently than others
"some authoers are much more often cited than others"
"some web pages have more in-links than others"
"some books or songs are more popular than others"
"some owrds are much more often used than others"
"some people have much more money than others"
This is a fundamental observation about many real-world phenomena, and it's what Zipf's distribution describes. Whether you're talking about word frequencies, city populations, income distributions, or the popularity of online articles, some items in these categories occur (or are present) way more often than others.
Larger z means larger skew
In the mathematical representation of Zipf's law, the parameter (sometimes represented as ) determines the skew of the distribution. A larger results in a steeper distribution where fewer items have much higher frequencies, and most items have very low frequencies.
Heavy-tailed: The distribution has a long tail, meaning that while many items have a very low frequency, there are a few items with extremely high frequency. This aligns with the statement: "some may have a lot more, but most stuff is owned by the many." This essentially means that even though a few entities (like words, authors, or websites) are highly prevalent or own a lot, the majority still belong to a broader set of less prevalent entities.
Non heavy-tailed: This is almost the opposite. A few entities dominate so overwhelmingly that they nearly "own" or represent everything. This aligns with: "some have a lot more, and own almost everything."
To put it simply, in many real-world datasets, a few items dominate in frequency or representation. This skew in distributions, often described by Zipf's law or other power-law distributions, has profound implications for areas like linguistics, economics, informatics, and more.
Last updated