Task1: Pedestrian & Vehicle Detection

Introduction

Deep learning based computer vision algorithms have surpassed the human-level performance for many CV tasks, like object recognition and face verification. Object detection is a fundamental task for human-centric visual analysis. The extremely high resolution of PANDA makes it possible to detect objects from a long distance. However, the significant variance in scale, posture, and occlusion severely degrade the detection performance.

This task is designed to push the state-of-the-art in object detection on giga-pixel images forward. Teams are required to predict the bounding boxes of objects of pedestrians and vehicles with real-valued confidences.

Challenge participants are required to detect two types of targets, pedestrians and vehicles. For each pedestrian, three bounding-boxes should be submitted: visible body bbox, full body bbox, and head bbox. For each vehicle, a visible part bbox needs to be submitted. Some special regions (e.g., fake persons, extremely crowded regions, heavily occluded persons, etc.) are ignored in evaluation.

The challenge is based on PANDA-Image dataset which contains 555 static giga-pixel images (390 for training, 165 for testing) captured by giga-pixel camera in different places at different height. We manually annotate the bounding boxes of different categories of objects in each image. Specifically, each person is annotated by 3 box, visible body box, full body box, and head box. All data and annotations on the training set are publicly available. Please see the Download page for more details about annotation.

Results Format

The format of the result file is the same as that of the COCO Challenge. We require the participator to submit the results as a single det_results.json file (save via gason in Matlab or json.dump in Python). This .json file should contain a list whose each element is a dictionary. Each dictionary contains information about a result box, whose format is as follows:

[{ "image_id": int, "category_id": int, "bbox": [bbox_left, bbox_top, bbox_width, bbox_height], "score": float }]

The meaning of each value is listed as follows:

Key Description
image_id The serial number of the image, which shall be consistent with the annotation file
category_id the type of detection result box. shall be consistent with the following table
bbox_left The x coordinate of the top-left corner of the predicted bounding box
bbox_top The y coordinate of the top-left corner of the predicted object bounding box
bbox_width The width in pixels of the predicted object bounding box
bbox_height The height in pixels of the predicted object bounding box
score The confidence of the predicted bounding box enclosing an object instance

Object category_id
person visible body 1
person full body 2
person head 3
vehicle visible part 4

Evaluation Metrics

We require each evaluated algorithm to output a list of detected bounding boxes with confidence scores for each test image in the predefined format. Please see the results format above for more detail. Similar to the evaluation protocol in COCO Challenge [1], we use AP, APIOU=0.50, APIOU=0.75, ARmax=10, ARmax=100, and ARmax=500 metrics to evaluate the results of detection algorithms. Unless otherwise specified, the AP and AR metrics are averaged over multiple intersection over union (IoU) values. Specifically, we use ten IoU thresholds of [0.50:0.05:0.95]. All metrics are computed allowing for at most 500 top-scoring detections per image (across all categories). These criteria penalize missing detection of objects as well as duplicate detections (two detection results for the same object instance). The AP metric is used as the primary metric for ranking the algorithms. The metrics are described in the following table.

The above metrics are calculated over object categories of interest. For comprehensive evaluation, we will report the performance of each object category. Some special regions (e.g., fake persons, extremely crowded regions, heavily occluded persons, etc.) are ignored in evaluation. Please also see the Download page for more details about annotation. The evaluation code for object detection in images is available on the PANDA-Toolkit.

Measure Perfect Description
AP 100% The average precision over all 10 IoU thresholds (i.e., [0.5:0.05:0.95]) of all object categories
APIOU=0.50 100% The average precision over all object categories when the IoU overlap with ground truth is larger than 0.50
APIOU=0.75 100% The average precision over all object categories when the IoU overlap with ground truth is larger than 0.75
ARmax=10 100% The maximum recall given 10 detections per image
ARmax=100 100% The maximum recall given 100 detections per image
ARmax=500 100% The maximum recall given 500 detections per image

Baseline Results

Table: Performance of detection methods on PANDA. FR, CR, and RN denote Faster R-CNN, Cascade R-CNN and RetinaNet respectively. Sub means subset of different target sizes, where Small, Middle, and Large indicate object size being smaller than 32 × 32, 32 × 32 ~ 96 × 96, and large than 96 × 96.

Data and Annotations

For PANDA-Image, all data and annotations for training set are available on the Download page.

Tools and Instructions

We provide extensive toolkit support for the PANDA in which APIs for data visualization, split, merge, and result evaluation are provided. Please visit our GitHub repository page. For addition questions, please find the answers in FAQ or contact us.

Top