Feedback - Indexing of media file transcripts

17/12/2021

Auteur(s) :

Roudy Khoury

Mouhcine Boutinzer

Temps de lecture : 4 minute(s)

all.site is a collaborative search engine. It works like Bing or Google but it has the advantage of being able to go further by indexing for example media content and organizing data from systems like Slack, Confluence or all the information present in a company's intranet.

Feedback - Indexing of media file transcripts

Introduction

In order to improve the relevance of our research, we have added to all.site the ability to find media such as videos or podcasts by searching for terms that exist only in the transcripts of those media.

The speech-to-text is a technology that has evolved enormously, especially following the advent of Machine Learning. Speech-to-text is for example used by voice assistants like Alexa and Siri.

In the case of all.site, media transcripts are extracted using the speech recognition toolVosk API.

Vosk API is an open source library that works in offline mode. It supports over 20 languages and dialects including English, French, Spanish, Chinese… Vosk provides speech recognition for chatbots or virtual assistants. The technology can also create subtitles for movies or transcripts for conferences.

Goal

The goal of this project is to transcribe media content and then index it in Elasticsearch, to allow our users to extend their search to content currently in audio format. In this way, the search engine all.site gains search relevance and significantly improves the user experience.

Flexibility and simplicity

Vosk API supports several programming languages (Java, PHP, Node, etc.) and runs on lightweight systems (Raspberry Pi, smartphone, etc.) without the need for a huge computing capacity. Moreover, the technology is easy to install, with light or heavy models depending on the needs. Vosk API, combined with ffmpeg, can analyze several media file formats including mp3, mp4, wav, etc. It is also possible to perform vocabulary customization and model adaptation for better speech recognition performance, a topic we will cover in a future post.

Results

In this example, we add a web source with the URL of the site to crawl:

Once the source is added, the web crawler will index the site data, and retrieve the media files it finds through HTML tags such as <audio>, <video>… then send them to the Vosk API. This API will extract the transcripts from these files, then index them in Elasticsearch which is the core of all.site.

The user of the platform all.site can now search for terms found in the media file transcripts of the specified site:

all.site in this case returns a portion of the media content transcript with the search term highlighted.

Problems encountered, solutions and integration

Media file format:

Vosk only accept “.wav” file formats. To handle this constraint, we had to use ffmpeg for conversion.

Sampling rate:

In order to recognize the dialogue, Vosk API operates on a sampling rate specified in the code. And since we don’t know in advance the sampling rate of the media files to be processed, we had to add to Vosk’s configuration these two lines:

--allow-downsample=true
--allow-upsample=true

so that Vosk can adapt the sampling rate of the received media.

Memory issue:

A memory problem appears when the crawler finds large media files to be processed by Vosk. This problem was solved by using WebSockets to stream the media instead of sending them all at once with http post. We also configured Vosk API to clear the memory buffer using the mode Print Partial Result which allowed us to return the transcript as we went along.

Proper names:

The last issue encountered during the implementation was the addition of proper names in the Vosk API in order to recognize them. To add new terms to the Vosk dictionary, you have to train the Kaldi model used by Vosk or use Phonetisaurus which is a set of scripts for training speech recognition models using the OpenFst framework. The constraint of this training is that it requires a powerful machine (32 GB of RAM and 100 GB of disk space minimum).

To go further

To learn more about open source technologies around voice and for an additional demonstration of this integration in all.site, we recommend this presentation by Aline Paponaud and Lucian Precup at the conference OpenSource Experience: De la voix au texte, la puissance de l'écosystème open source. For an enhanced experience, we advise you to access it via the website SIDO-OSXP.

And finally, if you need help with your Search or Elasticsearch project, especially to add such advanced features to your search engine, feel free to contact us. Our consultants will be delighted to bring you their expertise.

Scaling an online search engine to thousands of physical stores – ElasticON

10/03/2023

A summary of the talk Scaling an online search engine to thousands of physical stores by Roudy Khoury and Aline Paponaud at ElasticON 2023

Read the article

Question answering,a more human-based approach to our research on all.site.

19/01/2023

Everything about Question-Answering and how to implement it using a flask and elasticsearch.

Read the article

Feedback - Fine-tuning a VOSK model

05/01/2022

Read the article

From voice to text, the power of the Open Source ecosystem - return on the OSXP conference

01/12/2021

A summary of the talk Lucian Precup and Aline Paponaud gave at the Open Source Experience conference, with a link to the video, screenshots, photos and more.

Read the article

New Search & Data meetup - E-Commerce Search and Open Source

28/10/2021

The fifth edition of the Search and Data meetup is dedicated to e-commerce search and open source. A nice agenda to mark our return to the Meetup scene

Read the article

Shipping to Synonym Graph in Elasticsearch

21/04/2021

In this article, we explain how we moved from the old Elasticsearch synonym filters to the new Synonym Graph Token Filter.

Read the article

When queries are very verbose

22/02/2021

In this article, we present a simple method to rewrite user queries so that a keyword-based search engine can better understand them. This method is very useful in the context of a voice search or a conversation with a chatbot, context in which user queries are generally more verbose.

Read the article

Enrich the data and rewrite the queries with the Elasticsearch percolator

26/04/2019

This article is a transcript of the lightning talk we presented this week at Haystack - the Search and Relevance Conference. We showed a method allowing to enrich and rewrite user queries using Wikidata and the Elasticsearch percolator.

Read the article

A2 the engine that makes Elasticsearch great

13/06/2018

Elasticsearch is an open technology that allows integrators to build ever more innovative and powerful solutions. Elasticsearch

Read the article