Content area

Abstract

There are now over 20 commercial vector database management systems (VDBMSs), all produced within the past five years. But embedding-based retrieval has been studied for over ten years, and similarity search a staggering half century and more. Driving this shift from algorithms to systems are new data intensive applications, notably large language models, that demand vast stores of unstructured data coupled with reliable, secure, fast, and scalable query processing capability. A variety of new data management techniques now exist for addressing these needs, however there is no comprehensive survey to thoroughly review these techniques and systems. We start by identifying five main obstacles to vector data management, namely the ambiguity of semantic similarity, large size of vectors, high cost of similarity comparison, lack of structural properties that can be used for indexing, and difficulty of efficiently answering “hybrid” queries that jointly search both attributes and vectors. Overcoming these obstacles has led to new approaches to query processing, storage and indexing, and query optimization and execution. For query processing, a variety of similarity scores and query types are now well understood; for storage and indexing, techniques include vector compression, namely quantization, and partitioning techniques based on randomization, learned partitioning, and “navigable” partitioning; for query optimization and execution, we describe new operators for hybrid queries, as well as techniques for plan enumeration, plan selection, distributed query processing, data manipulation queries, and hardware accelerated query execution. These techniques lead to a variety of VDBMSs across a spectrum of design and runtime characteristics, including “native” systems that are specialized for vectors and “extended” systems that incorporate vector capabilities into existing systems. We then discuss benchmarks, and finally outline research challenges and point the direction for future work.

Details

Business indexing term
Title
Survey of vector database management systems
Author
Pan, James Jie 1 ; Wang, Jianguo 2 ; Li, Guoliang 1   VIAFID ORCID Logo 

 Tsinghua University, Department of Computer Science and Technology, Beijing, China (GRID:grid.12527.33) (ISNI:0000 0001 0662 3178) 
 Purdue University, Department of Computer Science, West Lafayette, USA (GRID:grid.169077.e) (ISNI:0000 0004 1937 2197) 
Publication title
Volume
33
Issue
5
Pages
1591-1615
Publication year
2024
Publication date
Sep 2024
Publisher
Springer Nature B.V.
Place of publication
New York
Country of publication
Netherlands
Publication subject
ISSN
10668888
e-ISSN
0949877X
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2024-07-15
Milestone dates
2024-06-25 (Registration); 2023-10-12 (Received); 2024-06-24 (Accepted); 2024-06-07 (Rev-Recd)
Publication history
 
 
   First posting date
15 Jul 2024
ProQuest document ID
3256783920
Document URL
https://www.proquest.com/scholarly-journals/survey-vector-database-management-systems/docview/3256783920/se-2?accountid=208611
Copyright
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.
Last updated
2025-10-03
Database
ProQuest One Academic