Abstract | ||
---|---|---|
Existing audio search engines use one of two approaches: matching text-text or audio-audio pairs. In the former, text queries are matched to semantically similar words in an index of audio metadata to retrieve corresponding audio clips or segments, while in the latter, audio signals are directly used to retrieve acoustically-similar recordings from an audio database. However, independent treatment of text and audio has precluded information exchange between the two modalities. This is a problem because similarity in language does not always imply similarity in acoustics, and vice versa. Moreover, independent modeling can be error prone especially for ad hoc, user-generated recordings, which are noisy in both audio and their associated textual labels. To overcome this limitation, we propose a framework that learns joint embeddings from a shared lexico-acoustic space, where vectors from either modality can be mapped together and compared directly. Thus, we improve semantic knowledge and enable the use of either text or audio queries to search and retrieve audio. Our results break new ground for a cross-modal audio search engine, and further exploration of lexico-acoustic spaces. |
Year | DOI | Venue |
---|---|---|
2019 | 10.1109/icassp.2019.8682632 | 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) |
Keywords | Field | DocType |
Joint Audio-Text Embedding, Cross Modal Retrieval, Audio Search Engine, Content-Based Audio Retrieval, Query by Example, Siamese Neural Network | Audio signal,Mel-frequency cepstrum,Metadata,Search engine,Pattern recognition,Computer science,Information exchange,Audio search engine,Speech recognition,Artificial intelligence,Modal,Semantics | Conference |
ISSN | Citations | PageRank |
1520-6149 | 0 | 0.34 |
References | Authors | |
0 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Benjamin Elizalde | 1 | 359 | 22.38 |
Shuayb Zarar | 2 | 0 | 2.70 |
Raj, Bhiksha | 3 | 2094 | 204.63 |