arrow_back SPEECH-COCO

29 Jun 2017 Open Image data

General information

Contributor: Laurent Besacier

Other contributors: Laurent Arnaud

Institution: Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG

Description: SPEECH-COCO is an augmentation of MS-COCO dataset where speech is added to image and text. Speech captions were generated using text-to-speech (TTS) synthesis resulting in 616,767 spoken captions (>600h) paired with images. Disfluencies and speed perturbation were added to the signal in order to sound more natural. Each speech signal (WAV) is paired with a JSON file containing exact timecode for each word/syllable/phoneme in the spoken caption. Such a corpus could be used for Language and Vision (LaVi) tasks including speech input or output instead of text.

Archive files 227 downloads

inventory_2  DS80.zip

  •   inventory_2 val2014.zip
  •   inventory_2 SpeechCoco_API_README.zip
  •   inventory_2 train2014.zip

info  You need to  login  to download this dataset.

Details

External identifier:
doi:10.18709/perscido.2017.06.ds80

Subjects:
Computer science, Linguistics

Keywords:
nlp, utd, language and vision, speech, unsupervised term discovery

Encoding format:
json - wav - sqlite3

Tasks:

zoom_in clustering zoom_in pattern extraction

Citation

Havard W. N., Besacier L., Rosec O. (2017). SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set. Grounding Language Understanding GLU2017, Stockholm. Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG. SPEECH-COCO [data set] published 2017, doi:10.18709/PERSCIDO.2017.06.DS80. Published 2017 via Perscido-Grenoble-Alpes

content_copy Copy