Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving

ECCV 2024 Workshop @ Milano, Italy, Sep 30th Monday


Street view videos generated with MagicDrive.


Track 2: Corner Case Scene Generation

This track focuses on improving the capabilities of diffusion models to create multi-view street scene videos that are consistent with 3D geometric scene descriptors, including the Bird's Eye View (BEV) maps and 3D LiDAR bounding boxes. Building on MagicDrive for controllable 3D video generation, this track aims to advance scene generation for autonomous driving, ensuring better consistency, higher resolution, and longer duration.


Challenge Progress

Leaderboard


Task and Baseline

Task: In this challenge, participants should train a controllable multi-view street view video generation model, which is the same as MagicDrive. The generated video should accurately reflect the control signals from BEV road map, 3D bounding boxes and test description of weather and time-of-day. We will evaluate the generated videos from generation quality and controllability (detailed below).

Baseline: MagicDrive. Please checkout to video branch for video generation. To help the participants to start, we have provided a 16-frame version with config and pre-trained weights. Enjoy!


Data Description

We use the nuScenes dataset as the testbench for this challenge. nuScenes is a commonly used autonomous driving dataset that provides 6 camera view at 12Hz convering 360-degree of the scenes. Following the official train/val split, teams should use the 750 scenes for training and 150 scenes from validation for evaluation and submission only.


Detailed Rules and Guidelines

Data Preparation

To start, participants should download the original data from the nuScenes official site. Please note to download camera sweeps (12Hz), only keyframes (2Hz) is not enough. The provided codebase is based on mmdet3d, which re-organize the meta data for loading. We provide the following data for to facilitate participants:

Name Description OneDrive
data/nuscenes_mmdet3d_2 mmdet3d meta data for nuScenes (only keyframes, only for perception models) link
data/nuscenes/interp_12Hz_trainval and
data/nuscenes_mmdet3d-12Hz/*interp*.pkl
interpolated 12Hz annotation and mmdet3d meta data (for both training and testing) link
data/nuscenes/advanced_12Hz_trainval and
data/nuscenes_mmdet3d-12Hz/*advanced*.pkl
interpolated 12Hz annotation with advanced method from ASAP, and mmdet3d meta data (optional for training) link
nuscenes_interp_12Hz_infos_track2_eval.pkl sampled from "interp", for 1st round evaluation and submission link
nuscenes_interp_12Hz_infos_track2_eval_long.pkl sampled from "interp", for 2nd round evaluation and submission link

Since the annotation is only at 2Hz, we provide the interpolated annotations from ASAP in the table above. After download the files above, you should be also to organize them as described in Datasets.

If you find any resources unavailable, or need other resource to run our baseline, please do not hasitate to let us know!

Training

Please setup your python environment as described in Environment Setup and prepare the pretrained weights from Pretrained Weights. Then start your training/testing as described in Train MagicDrive-t and Video Generation.

To run our baseline, please download our pretrained model from OneDrive

Note: Our baseline use the config from rawbox_mv2.0t_0.4.3.

Participate

Register. Before submission, please register your team information here. Each team can only have one registration, and each registration is valid for one single track. Note that your Team Name will be used as reference for evaluation latter.

Submission. For submission, we require each team to:

  1. generate 4 16-frame videos for each of the 150 scenes from the eval-standard set (600 videos).
  2. generate 3 any-length videos for each of the 3 scenes from the eval-long set (9 videos).
For simplicity, you can use meta data in nuscenes_interp_12Hz_infos_track2_eval.pkl to generate 16-frame videos for submission. If your model is capable of longer video generation, please use nuscenes_interp_12Hz_infos_track2_eval_long.pkl, and add clips of the first 16 frames in your submission to round 1. In other words, For the 9 long videos, one of the three should contains the exact content in the 600 videos for the same scene (see Evaluation for more details). Videos should be generated starting from the first frame of each scene.

Videos (mp4 files) should be compressed with the provided scripts and renamed accordingly. Please check more details in readme. We also provide a submission sample to let participants better understand the submission format.

Please upload your submission via the Upload link on OneDrive. Enter your Track and Team Name and click Upload (Our server gathers results every 5 minutes. Please refrain from uploading too frequently).

Please strictly follow the naming format below when submitting, otherwise your submission will be considered invalid!
  • File name: must be results.zip, instead of answer.zip, sample_answer.zip, etc.
  • First name: must be Track2, instead of Track 2, Track_2, etc.
  • Last name: must be your full Team name in the registration form, which will be used as the key to match your registration information. You must not separate your Team name into different parts!
  • Any violation will result in an error when dealing with the submission file, and thus, considered as invalid.
onedrive_track2

Evaluation

There will be 2 rounds for evaluation:

  • 1st round (Jun. 15-Aug. 15): score evaluation for 16-frame videos
  • 2nd round (Aug. 15): human evaluation for long videos

For the 1st round: we evaluate the generated videos by generation quality and controllability with three metrics:

  • FVD: commonly used metric for video generation quality evaluation.
  • mAP: from 3D object detection with BEVFormer. As defined by nuScenes.
  • mIoU: from BEV segmentation with BEVFormer. As defined by their paper.
The final score is calculated with $$\text{score}_{1}= \frac{A-\text{FVD}}{A} + \frac{\text{mAP}-B}{B} + \frac{\text{mIoU}-C}{C}\text{, where } A=218.1200, B=11.8617, C=18.3429 $$ For more details about how each metric is calculated, please check the provided evaluation code.

For the 2nd round: each team is required to submit 9 videos (12-fps) from 3 scenes, each with 3 different conditions (sunny day, rainy day, and night). Our human evaluators will rank each video from quality and controllability. In general, we prefer long videos with high resolutions.

We only consider the top-10 participants from the 1st round and use the submitted long videos for the 2nd round.

Note that, we require results generated from one single model for both rounds. To ensure this, the "sunny day" video in the 2nd round should contains the exact content submitted for the 1st round (i.e., the 16-frame can be clipped from longer videos).


General Rules

  • To ensure fairness, the top-3 winners are required to submit a technical report.
  • Each entry is required to be associated to a team and its affiliation (all members of one team must register as one).
  • Using multiple accounts to increase the number of submissions is strictly prohibited.
  • Results should follow the correct format and must be uploaded to the submission to OneDrive (refer to the Submission section), which, otherwise, will not be considered as valid submissions. Detailed information about how results will be evaluated is provided in the Evaluation section.
  • The best entry of each team is public in the leaderboard, and we will update once a week.
  • The organizer reserves the absolute right to disqualify entries that are incomplete, illegible, late, or violating the rules.


Awards

Participants with the most successful and innovative entries will be invited to present at this workshop and receive awards. A 1,000 USD cash prize will be awarded to the top team, while the 2nd and 3rd will be awarded with 800 USD and 600 USD separately.


Contact

To contact the organizers please use w-coda2024@googlegroups.com