People capture photos and videos to relive and share memories of personal
significance. Recently, media montages (stories) have become a popular mode of
sharing these memories due to their intuitive and powerful storytelling
capabilities. However, creating such montages usually involves a lot of manual
searches, clicks, and selections that are time-consuming and cumbersome,
adversely affecting user experiences.
To alleviate this, we propose task-oriented dialogs for montage creation as a
novel interactive tool to seamlessly search, compile, and edit montages from a
media collection. To the best of our knowledge, our work is the first to
leverage multi-turn conversations for such a challenging application, extending
the previous literature studying simple media retrieval tasks. We collect a new
dataset C3 (Conversational Content Creation), comprising 10k dialogs
conditioned on media montages simulated from a large media collection.
We take a simulate-and-paraphrase approach to collect these dialogs to be
both cost and time efficient, while drawing from natural language distribution.
Our analysis and benchmarking of state-of-the-art language models showcase the
multimodal challenges present in the dataset. Lastly, we present a real-world
mobile demo application that shows the feasibility of the proposed work in
real-world applications. Our code and data will be made publicly available.
Comment: 8 pages, 6 figures, 2 tables
Subject: Computer Science - Computation and Language