menu icon

Understanding the differences between sparse and dense semantic vectors

More and more frequently, we hear about semantic search and new ways to implement it. In the latest version of OpenSearch (2.11), semantic search through sparse vectors has been introduced. But what does sparse vector mean? How does it differ from dense matrix? Let's try to clarify within this article.

Understanding the differences between sparse and dense semantic vectors

Table of contents

  1. At the dawn of computing and search engines: sparse vectors
  2. From sparse vectors to dense
  3. Semantic search using dense vectors
  4. Semantic search through sparse vectors
  5. Final assessments and conclusions

At the dawn search engines: sparse vectors

If you’re familiar with conventional search engines and technologies like opensearch and elasticsearch, you’ve likely come across tf-idf and bm-25—two algorithms leveraging sparse structures for document representation and relevance calculation. To simplify the explanation of sparse structures, let’s consider a database with three documents:

  1. “The quick brown fox jumps.”
  2. “A lazy dog sleeps.”
  3. “The brown fox and the lazy dog.”

Representing these documents in an incidence matrix:

quick brown fox jumps lazy dog sleeps the and a
Document 1 1 1 1 1 0 0 0 1 0 0
Document 2 0 0 0 0 1 1 1 1 0 1
Document 3 0 1 1 0 1 1 0 2 1 0

Utilizing the same matrix, we can compute the frequency matrix using the formula:

Formula of TF

quick brown fox jumps lazy dog sleeps the and a
Document 1 0.25 0.25 0.25 0.25 0 0 0 0.25 0 0
Document 2 0 0 0 0 0.33 0.33 0.33 0.33 0 0.33
Document 3 0 0.33 0.33 0 0.33 0.33 0 0.67 0.33 0

Finally, applying the formula:

Formula TF-IDF

We can visualize the matrix employed in a tf-idf algorithm, which is used to calculate relevance:

quick brown fox jumps lazy dog sleeps the and a
Document 1 0.044 0.044 0.044 0.146 0 0 0 0 0 0
Document 2 0 0 0 0 0.058 0.058 0.195 0 0 0.058
Document 3 0 0.058 0.058 0 0.058 0.058 0 0 0.058 0

You may have noticed, as illustrated in the matrix above, that the values are predominantly concentrated around a few non-zero entries, while the majority remain at zero. This concentration of non-zero values amidst a sea of zeros characterizes what is referred to as “sparse vectors”.

The sparsity in these vectors allows for more efficient storage and computation: it reduce redundancy, as only the non-zero elements need to be stored, leading to more compact data structures. This efficiency is particularly beneficial when working with large datasets, as it enables faster computations and reduces memory requirements.

From sparse vectors to dense

Using sparse vectors, we have essentially represented the essence of each document by employing a set of numerical values, with a significant portion of them being 0s.

Dense vectors will allow us to do the same, without relying on 0s. In dense vectors, every document is represented by a vector where each values is included between 0 and 1. Each value will represent a feature of our document.

Let’s try to deep dive into how dense vectors work using an example. We have some documents, each of them representing a different animal. To describe every animal we will use some common features. For example, the size, the friendliness and the intelligence.

Feature extraction

For each feature we can assign a value from 0 to 1, where 0 means less and 1 more. For example, a cat is a quite friendly animal, small size and extremely clever (despite it likes to sleep all the time).

So we can transform our cat and all the animals of our collection into vectors:

Animal Size Friendliness Intelligence
Cat 0.25 0.85 0.80
Dog 0.30 0.90 0.80
Elephant 0.90 0.70 0.60
Dolphin 0.60 0.95 0.85
Parrot 0.15 0.80 0.75

As you may have noticed, we represented the collection using vectors without adopting 0 values. Additionally, each document is characterized by specific features, enabling easy comparisons. Indeed, we can easily represent this 3-dimensional vectorial space using a 3d-graph.

3d space

The process of creating dense vectors is typically accomplished using specific models. These models generate vectors of several dimensions, often in the hundreds. While the animal example was relatively straightforward with a three-dimensional vector, envision dealing with a 784-dimensional vector. This complexity suggests that generating these vectors can indeed be time-consuming and require more space than what we typically encounter with sparse vectors.

Semantic search using dense vectors

In vector or semantic search, we are essentially comparing our vectorial database to a query vector. The query vector is essentially our query transformed into a vector.

The comparison between the query vector and the database can be made through different mathematical functions, such as the cosine similarity or the dot product distance. The most similar documents to our query vector are the most relevant documents, and, normally, the documents we are looking for.

If you want to learn more about how vector search works, and how to implement it, you may want to give a look to Exploring Vector Search with JINA. An Overview and Guide and Diving into NLP with the Elastic Stack.

Semantic search through sparse vectors

Semantic search can be done also through sparse vectors, thanks to a technique called “Term expansion”.Term expansion involves broadening the representation of a document by incorporating additional relevant terms that may capture the meaning of the document. These expanded terms are indexed as if they were initially part of the document,. In this context, machine learning models, such as Elasticsearch’s “elser” or Amazon’s Neural Encoder for OpenSearch, can be employed for effective term expansion.

Let’s see how this works, once again, through examples:

As we did before, let’s consider this phrases

  1. “The quick brown fox jumps.”
  2. “A lazy dog sleeps.”
  3. “The brown fox and the lazy dog.”

and their incidence matrix:

quick brown fox jumps lazy dog sleeps the and a
Document 1 1 1 1 1 0 0 0 1 0 0
Document 2 0 0 0 0 1 1 1 1 0 1
Document 3 0 1 1 0 1 1 0 2 1 0

Term expansion aim to associate new relevants words to each documents. For example the term quick can be expanded using different synonymous, such as fast or swift. This can be done for each term and for each phrase in our database.

For example:

  • Quick: fast, swift
  • Brown: reddish-brown, tan
  • Fox: mammal, animal
  • Jumps: leaping, hopping, bounding
  • Lazy: lethargic, inactive
  • Dog: canine, domesticated animal
  • Sleeps: dozes, naps

which will result into a new incidence matrix

quick fast swift brown fox animal jumps lazy dog sleeps the and a
Document 1 1 1 1 1 1 1 1 0 0 0 1 0 0
Document 2 0 0 0 0 0 1 0 1 1 1 1 0 1
Document 3 0 0 0 1 1 1 0 1 1 0 2 1 0

Now, if we search for ‘naps,’ we will retrieve the second document where ‘A lazy dog sleeps.’

Similar to semantic dense vector search, sparse semantic search can also employ the machine learning model during the search phase to expand our query. In this case, we are referring to bi-encoders in dense vector search. If we use term expansion only during indexation, we are in ‘document-only’ mode. This second solution is much faster since there is no term expansion at runtime.

Sparse search has three big advantages compared to dense vector search:

  1. The index size is much smaller, since only no zeros values are stored in sparse vectors.
  2. Reduced runtime RAM cost, especially when we are using document-only modality.
  3. Lower computational cost, as we can use a native Lucene index to store the new terms.

Final assessments and conclusions

In this exploration of sparse and dense vectors, we delved into the foundational aspects of search engines. We covered the traditional use of sparse vectors in algorithms like tf-idf and bm-25, as well as more modern techniques involving both dense and sparse vectors for semantic search.

It’s crucial to remember that the choice between keyword search and semantic search depends on the specific requirements of a given application, and vector search may not always be the most efficient solution. Furthermore, deciding between sparse and dense vectors can be challenging and requires a thorough requirements analysis.

In conclusion, the landscape of information retrieval continues to evolve. Understanding the strengths of both sparse and dense vector approaches is essential for building effective and scalable search systems and we hope this article helped to clarify every difference.

A guide to a full Open-Source RAG

01/12/2023

Delving into Retrieval-Augmented Generation (RAG). In this article we explore the foundational concepts behind RAG, emphasizing its role in enhancing contextual understanding and information synthesis. Moreover, we provide a practical guide on implementing a RAG system exclusively using open-source tools and large language model.

Read the article

Return from the DevFest Toulouse conference

19/11/2023

We are back from DevFest Toulouse, an opportunity for us to attend several conferences, train ourselves and share a personalized version of our presentation Cloner ChatGPT with Hugging Face and Elasticsearch.

Read the article

The Art of Image Vectorization - A Guide with OpenSearch

01/10/2023

BLIP-2 is a model that combines the strengths of computer vision and large language models. This powerful blend enables users to engage in conversations with their own images and generate descriptive content. In this article, we will explore how to leverage BLIP-2 for creating enriched image descriptions, followed by indexing them as vectors in Opensearch.

Read the article

NLP in OpenSearch

18/06/2023

A practical guide about how to import and use NLP models into OpenSearch for text analysis and inference in your search and analytics workflows

Read the article

Diving into NLP with the Elastic Stack

01/04/2023

An overview about NLP and a practical guide about how it can be used with the Elastic stack to enhance search capabilities.

Read the article