Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

NeurIPS 2022

Elad Ben-Avraham1   Roei Herzig1,3   Karttikeya Mangalam2  
Amir Bar1   Anna Rohrbach2   Leonid Karlinsky4   Trevor Darrell2   Amir Globerson1,5  

1Tel-Aviv University   2UC Berkeley   3IBM Research   4MIT-IBM Lab   5Google Research

[SViT Framework Animated]

Our approach SViT brings structured scene representations from still images into video. We use the HAnd-Object Graph (HAOG) annotations of still images that are only available during training, and videos, which may come from a different domain, with their downstream task annotations. We design a shared video & image transformer that can handle both video and image inputs. During training, given an image, the patch and object tokens are processed together, and the object tokens learn to predict the HAOG, whereas given a video input, the transformer predicts the video label (e.g. Washing) based on the patch tokens. During inference, the learned object prompts, which have captured the structured scene information from images, are used to predict the video label.


Recent action recognition models have achieved impressive results by integrating objects, their locations and interactions. However, obtaining dense structured annotations for each frame is tedious and time-consuming, making these methods expensive to train and less scalable. At the same time, if a small set of annotated images is available, either within or outside the domain of interest, how could we leverage these for a video downstream task? We propose a learning framework StructureViT (SViT for short), which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model. SViT relies on two key insights. First, as both images and videos contain structured information, we enrich a transformer model with a set of object tokens that can be used across images and videos. Second, the scene representations of individual frames in video should ``align'' with those of still images. This is achieved via a Frame-Clip Consistency loss, which ensures the flow of structured information between images and videos. We explore a particular instantiation of scene structure, namely a Hand-Object Graph, consisting of hands and objects with their locations as nodes, and physical relations of contact/no-contact as edges. SViT shows strong performance improvements on multiple video understanding tasks and datasets.


Semantic understanding of videos is a key challenge for machine vision and artificial intelligence. It is intuitive that video models should benefit from incorporating scene structure, e.g., the objects that appear in a video, their attributes, and the way they interact. In fact, several works have shown that such ``object-centric'' models perform well on tasks such as action recognition. However, most of these models require ground-truth structured annotations of video during training. This is clearly very costly, time-consuming, and not scalable.
This begs the question, could we still benefit from scene structure in a less costly way? A possible direction is to explore methods that would only require sparsely annotated frames within a downstream video domain. Moreover, when it comes to structured annotations of static images, there are numerous datasets with annotations such as boxes, visual relationships, attributes, and segmentations. Since we can not always expect to have annotated images that perfect align with our downstream video task, how can these resources be used to build better video models? In this work, we propose such an approach, which is particularly well suited for transformer architectures and offers an effective mechanism to leverage image structure to improve video transformers.

SViT Approach

Our approach SViT learns a structured shared representation that arises from the interaction between the two modalities of images and videos. We consider the setting where the main goal is to learn video understanding downstream tasks while leveraging structured image data. In this paper, we focus on the following video tasks: action recognition and object state classification & localization. In training, we assume that we have access to task-specific video labels and structured scene annotations for images, Hand-Object Graphs. Based on the structured representations obtained by using the object tokens and regularized by the Frame-Clip Consistency loss. the inference is performed only for the downstream video task without explicitly using the structured image annotations.

As mentioned above, our motivation is to bring scene structure from still images into video. In order to achieve this, one component of this work is the construction of a semantic representation of the interactions between the hands and objects in the scene. Specifically, we propose to use a graph-structure we Hand-Object Graph (HAOG). The nodes of the HAOG represent two hands and two objects with their locations, whereas the edges correspond to physical properties such as contact/no contact.

Visualization of Hand-Object Graph (HAOG).


Our shared video & image transformer model processes two different types of tokens: standard patch tokens from the images and videos (blue) and the object prompts (purple), that are transformed into object tokens (purple) in the output. During training, the object tokens (purple) are trained to predict the HAOG for still images. For video frames that have no HAOG annotation, we use our ``Frame-Clip'' loss to ensure consistency between the ``frame object tokens'' (resulting from processing the frames separately) and the ``clip object tokens'' (resulting from processing the frame as part of the video). Last, the final video downstream task prediction results from applying a video downstream task head on the average of the patch tokens in the transformer output (after they have interacted with the clip object tokens (purple).


Hand-Object Graphs detected by SViT.


Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Leonid Karlinsky, Anna Rohrbach, Trevor Darrell, Amir Globerson
Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens
Hosted on arXiv

Related Works

Our work builds and borrows code from MViT. If you found our work helpful, consider citing MViT as well.


This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). Prof. Darrell’s group was supported in part by DoD including DARPA's XAI, and LwLL programs, as well as BAIR's industrial alliance programs.