A Local-to-Global Approach to
Multi-modal Movie Scene Segmentation

Anyi Rao      Linning Xu      Yu Xiong      Guodong Xu      Qingqiu Huang      Bolei Zhou      Dahua Lin
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020


Scene, as the crucial unit of storytelling in movies, contains complex activities of actors and their interactions in a physical location. Identifying the composition of scenes serves as a critical step towards visual understanding of movies. This is very challenging – compared to the videos studied in conventional vision problems, e.g. action recognition, the scenes in movies usually contain much richer temporal structures and more complex semantic information. Towards this goal, we scale up the scene segmentation task by building a large video dataset MovieScenes, which contains 21K annotated scene segments from 150 movies. We further propose a local-to-global scene segmentation framework, which integrates multi-modal information across three levels, i.e. clip, segment, and movie. This framework is able to distill complex semantics from a hierarchical temporal structures over a long movie, thus providing top-down guidance for scene segmentation. Our experiments show that the proposed network is able to segment a movie into scenes with high accuracy, consistently outperforming previous methods. We also found that pretraining on our MovieScenes can bring significant improvements to the existing video understanding models.

Potential Application

It is going to help divide a long video into semantic continuous short videos and output a structural representation. And it also provides research opportunities towards story/plot understanding in long videos with a semantic unit.

Human Interaction Graph Generation

The first line is American Hustle scene segmentation coming from scene detection, where dark blue and light blue intertwine with each other to represent different Scenes. The second to fourth line corresponds to C.B. Christian Bale; B.C. Bradley Cooper; A.A. Amy Adams Scenes occurrence time lines in this movie respectively, where the dark green means occurring while light green does not. The graph below represents their interaction over the story line. The dark red represents a closer relationship while the light red represents a far-away relationship and two demo pictures are shown for closer relationships.

Cross Movie Scene Retrieval

Given a specific scene in one movie and we retrieve similar ones in other movies. For example, we choose a conversation scene and a party scene from Ted (2012) as query, then retrieved scenes from other movies.