Inverted Index

Definition

An inverted index is a data structure that maps content (such as keywords or terms) to their locations within a dataset, enabling fast lookup and filtering operations. In the context of Qdrant, the inverted index is used to optimize filtering capabilities by allowing efficient retrieval of vectors based on specific payload conditions, such as filtering by metadata or tags associated with the stored vectors.

This mechanism is particularly useful in hybrid search scenarios where sparse (keyword-based) filtering is combined with dense (vector-based) similarity searches.


Example in Qdrant

Imagine you are building a recommendation engine for an e-commerce platform. Each vector represents a product, and payloads (metadata) include fields such as category, price, and brand.

Creating a Collection with an Inverted Index

POST /collections/products
{
  "vectors": {
    "size": 128,
    "distance": "Cosine"
  },
  "payload_schema": {
    "category": {
      "type": "keyword",
      "index": true
    },
    "price": {
      "type": "integer",
      "index": true
    },
    "brand": {
      "type": "keyword",
      "index": true
    }
  }
}

In this configuration: - category and brand are indexed as keyword, allowing filtering by exact matches. - price is indexed as integer, enabling range queries (e.g., products priced between $10 and $50).


Query Example

To retrieve vectors for all products in the electronics category with a price between $50 and $200:

POST /collections/products/points/search
{
  "filter": {
    "must": [
      { "key": "category", "match": { "value": "electronics" } },
      { "key": "price", "range": { "gte": 50, "lte": 200 } }
    ]
  },
  "vector": [0.1, 0.2, 0.3, ...],
  "top": 10
}

Result

The inverted index ensures that the filter step is efficient, significantly reducing the search space before the similarity search is performed.


Why It Matters

An inverted index in Qdrant allows developers to create powerful, real-time search applications that combine metadata filtering and semantic similarity, optimizing both speed and relevance.


Tabular Example

Suppose we have the following dataset of products:

Product ID Category Brand Price
1 Electronics Samsung 150
2 Electronics Apple 200
3 Home Appliances Samsung 300
4 Electronics Sony 100
5 Furniture IKEA 250

Based on this data, an inverted index could look like this:

Key Value
category:electronics Product IDs: [1, 2, 4]
category:home appliances Product IDs: [3]
category:furniture Product IDs: [5]
brand:samsung Product IDs: [1, 3]
brand:apple Product IDs: [2]
brand:sony Product IDs: [4]
brand:ikea Product IDs: [5]
price_range:0-100 Product IDs: []
price_range:101-200 Product IDs: [1, 4]
price_range:201-300 Product IDs: [2, 5]
price_range:301-400 Product IDs: [3]

Explanation:

  • The inverted index maps keys (like category:electronics or brand:samsung) to a list of Product IDs.
  • It can also include derived keys, such as price_range, which groups prices into ranges.

This structure allows efficient filtering, as you can quickly retrieve all product IDs for a specific category, brand, or price range without scanning the entire dataset.

What is a Data Model for Unstructured Data?

Here is an example of a data model in the context of unstructured data:

Data Model Example: Document Search System

This is an example of a data model tailored for unstructured data in the context of a document search system, such as one built using a vector database like Qdrant.


Entity: Document

Field Name Data Type Description
id String (UUID) Unique identifier for the document.
title String The title of the document.
content_vector Float Array Dense vector representation of the document content, generated using a pre-trained language model (e.g., OpenAI, BERT).
metadata Object (JSON) Key-value pairs storing metadata about the document (e.g., author, date, tags).
categories Array of Strings List of categories the document belongs to (e.g., "contract law", "intellectual property").
created_at DateTime Timestamp when the document was created.
updated_at DateTime Timestamp when the document was last updated.

Example JSON Representation

{
  "id": "123e4567-e89b-12d3-a456-426614174000",
  "title": "Copyright Law in the Digital Age",
  "content_vector": [0.123, 0.987, 0.456, ...], 
  "metadata": {
    "author": "Jane Doe",
    "publish_date": "2024-01-15",
    "language": "English"
  },
  "categories": ["copyright law", "digital media"],
  "created_at": "2024-01-15T10:00:00Z",
  "updated_at": "2024-11-30T12:00:00Z"
}

Why This Model?

  1. Flexibility: The unstructured content_vector enables similarity search, while structured metadata supports filtering and faceting.
  2. Extensibility: You can add new fields (e.g., “related documents”) without major schema changes.
  3. Efficiency: Vector-based retrieval is efficient for unstructured text, while metadata aids precise filtering.