arrow_back SPEECH-COCO
General information
Contributor: Laurent Besacier
Other contributors: Laurent Arnaud
Institution: Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG
Description: SPEECH-COCO is an augmentation of MS-COCO dataset where speech is added to image and text. Speech captions were generated using text-to-speech (TTS) synthesis resulting in 616,767 spoken captions (>600h) paired with images. Disfluencies and speed perturbation were added to the signal in order to sound more natural. Each speech signal (WAV) is paired with a JSON file containing exact timecode for each word/syllable/phoneme in the spoken caption. Such a corpus could be used for Language and Vision (LaVi) tasks including speech input or output instead of text.
Readme file
Archive files 227 downloads
inventory_2 DS80.zip
- ├ inventory_2 val2014.zip
- ├ inventory_2 SpeechCoco_API_README.zip
- └ inventory_2 train2014.zip
info You need to login to download this dataset.
Details
External identifier:
doi:10.18709/perscido.2017.06.ds80
Subjects:
Computer science,
Linguistics
Keywords:
nlp,
utd,
language and vision,
speech,
unsupervised term discovery
Encoding format:
json - wav - sqlite3
Citation
Havard W. N., Besacier L., Rosec O. (2017). SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set. Grounding Language Understanding GLU2017, Stockholm. Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG. SPEECH-COCO [data set] published 2017, doi:10.18709/PERSCIDO.2017.06.DS80. Published 2017 via Perscido-Grenoble-Alpes