Visual Writing Prompts Dataset (VWP)

Hugging Face Datasets (New!) Github Repository arXiv e-Print Website

The Visual Writing Prompts (VWP) dataset contains almost 2K selected sequences of movie shots, each including 5-10 images. The image sequences are aligned with a total of 12K stories which are collected via crowdsourcing given the image sequences and up to 5 grounded characters from the corresponding image sequence.


We have made version 2 of VWP available in Hugging Face Datasets and Github Repository.

We are presenting our paper at EACL poster session, 9:00, 03 May 2023.

30 Mar 2023. The first release of our dataset is online.

20 Jan 2023. The pre MIT press version of our paper is on arxiv.

We will make the full dataset available in the vwp repo before our conference presentation.



The Visual Writing Prompts (VWP) dataset is designed to facilitate the development and testing of natural language processing models that generate stories based on sequences of images. This dataset comprises nearly 2,000 curated sequences of movie shots, each sequence containing between 5 to 10 images. These images are meticulously selected to ensure they depict coherent plots centered around one or more main characters, enhancing the visual narrative structure for story generation. Aligned with these image sequences are approximately 12,000 stories, which were written by crowd workers using Amazon Mechanical Turk. This setup aims to provide a rich, visually grounded storytelling context that helps models generate more coherent, diverse, and engaging stories.



The dataset is in a CSV file. The explanation of each column is in this table.


Baseline Models

We also propose a character-based story generation model driven by coherence as a strong baseline. Evaluations show that our generated stories are more coherent, visually grounded, and more diverse than stories generated with the current state-of-the-art model.


If you use this dataset in your work, please cite this article in TACL 2023:

Xudong Hong, Asad Sayeed, Khushboo Mehra, Vera Demberg, and Bernt Schiele. 2023. Visual Writing Prompts: Character-Grounded Story Generation with Curated Image SequencesTransactions of the Association for Computational Linguistics, 11:565–581.


    author = {Hong, Xudong and Sayeed, Asad and Mehra, Khushboo and Demberg, Vera and Schiele, Bernt},
    title = "{Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences}",
    journal = {Transactions of the Association for Computational Linguistics},
    volume = {11},
    pages = {565-581},
    year = {2023},
    month = {06},
    issn = {2307-387X},
    doi = {10.1162/tacl_a_00553},
    url = {https://doi.org/10.1162/tacl\_a\_00553},
    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00553/2134487/tacl\_a\_00553.pdf},


Xudong Hong, xLASTNAME@coli.uni-saarland.de


All the images are extracted from the movie shots from the MovieNet dataset. The copyrights of all movie shots belong to the original copyright holders which can be found in the IMDb page of each movie. The IMDb page is indicated by the index in the imdb_id column. For example, for the first row of our data, the imdb_id is tt0112573 so the corresponding imdb page is https://www.imdb.com/title/tt0112573/companycredits/. Do not violate the copyrights while using these images. We only use these images for academic purposes. Please contact the author if you have any questions.


For any questions or issues with the dataset, please open an issue on this Github page or contact Xudong Hong.