What is a Vector Database?
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla et euismod nulla. Curabitur feugiat, tortor non consequat finibus, justo purus auctor massa, nec semper lorem quam in massa.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla et euismod nulla. Curabitur feugiat, tortor non consequat finibus, justo purus auctor massa, nec semper lorem quam in massa.
An inverted index is a data structure that maps content (such as keywords or terms) to their locations within a dataset, enabling fast lookup and filtering operations. In the context of Qdrant, the inverted index is used to optimize filtering capabilities by allowing efficient retrieval of vectors based on specific payload conditions, such as filtering by metadata or tags associated with the stored vectors.
This mechanism is particularly useful in hybrid search scenarios where sparse (keyword-based) filtering is combined with dense (vector-based) similarity searches.
Imagine you are building a recommendation engine for an e-commerce platform. Each vector represents a product, and payloads (metadata) include fields such as category
, price
, and brand
.
POST /collections/products
{
"vectors": {
"size": 128,
"distance": "Cosine"
},
"payload_schema": {
"category": {
"type": "keyword",
"index": true
},
"price": {
"type": "integer",
"index": true
},
"brand": {
"type": "keyword",
"index": true
}
}
}
In this configuration:
- category
and brand
are indexed as keyword
, allowing filtering by exact matches.
- price
is indexed as integer
, enabling range queries (e.g., products priced between $10 and $50).
To retrieve vectors for all products in the electronics
category with a price between $50 and $200:
POST /collections/products/points/search
{
"filter": {
"must": [
{ "key": "category", "match": { "value": "electronics" } },
{ "key": "price", "range": { "gte": 50, "lte": 200 } }
]
},
"vector": [0.1, 0.2, 0.3, ...],
"top": 10
}
The inverted index ensures that the filter step is efficient, significantly reducing the search space before the similarity search is performed.
An inverted index in Qdrant allows developers to create powerful, real-time search applications that combine metadata filtering and semantic similarity, optimizing both speed and relevance.
Suppose we have the following dataset of products:
Product ID | Category | Brand | Price |
---|---|---|---|
1 | Electronics | Samsung | 150 |
2 | Electronics | Apple | 200 |
3 | Home Appliances | Samsung | 300 |
4 | Electronics | Sony | 100 |
5 | Furniture | IKEA | 250 |
Based on this data, an inverted index could look like this:
Key | Value |
---|---|
category:electronics |
Product IDs: [1, 2, 4] |
category:home appliances |
Product IDs: [3] |
category:furniture |
Product IDs: [5] |
brand:samsung |
Product IDs: [1, 3] |
brand:apple |
Product IDs: [2] |
brand:sony |
Product IDs: [4] |
brand:ikea |
Product IDs: [5] |
price_range:0-100 |
Product IDs: [] |
price_range:101-200 |
Product IDs: [1, 4] |
price_range:201-300 |
Product IDs: [2, 5] |
price_range:301-400 |
Product IDs: [3] |
category:electronics
or brand:samsung
) to a list of Product IDs.price_range
, which groups prices into ranges.This structure allows efficient filtering, as you can quickly retrieve all product IDs for a specific category, brand, or price range without scanning the entire dataset.
Here is an example of a data model in the context of unstructured data:
This is an example of a data model tailored for unstructured data in the context of a document search system, such as one built using a vector database like Qdrant.
Field Name | Data Type | Description |
---|---|---|
id |
String (UUID) | Unique identifier for the document. |
title |
String | The title of the document. |
content_vector |
Float Array | Dense vector representation of the document content, generated using a pre-trained language model (e.g., OpenAI, BERT). |
metadata |
Object (JSON) | Key-value pairs storing metadata about the document (e.g., author, date, tags). |
categories |
Array of Strings | List of categories the document belongs to (e.g., "contract law", "intellectual property"). |
created_at |
DateTime | Timestamp when the document was created. |
updated_at |
DateTime | Timestamp when the document was last updated. |
{
"id": "123e4567-e89b-12d3-a456-426614174000",
"title": "Copyright Law in the Digital Age",
"content_vector": [0.123, 0.987, 0.456, ...],
"metadata": {
"author": "Jane Doe",
"publish_date": "2024-01-15",
"language": "English"
},
"categories": ["copyright law", "digital media"],
"created_at": "2024-01-15T10:00:00Z",
"updated_at": "2024-11-30T12:00:00Z"
}
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla et euismod nulla. Curabitur feugiat, tortor non consequat finibus, justo purus auctor massa, nec semper lorem quam in massa.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla et euismod nulla. Curabitur feugiat, tortor non consequat finibus, justo purus auctor massa, nec semper lorem quam in massa.