Introduction

Human Skeletal System Keypoints Detection competition track requires precise identification of people's bodily positions in challenging and changing conditions, and are commonly used in movement classification, abnormal behavior detection, and autonomous driving. This competition involves simultaneous detection of people's movements and localizing their skeletal keypoints. For evaluation, the results will be measured by similarity between ground truth keypoints and predicted keypoints, and the score will be combined with contestants' final presentations.

Annotations

The dataset is split into training, validation, test A and test B. For each salient human figure in the dataset, we labeled it with 14 human skeletal keypoints, and the numeric orders of these keypoints are showed in Table 1, which are: 1-right shoulder, 2-right elbow, 3-right wrist, 4-left shoulder, 5-left elbow, 6-left wrist, 7-right hip, 8-right knee, 9-right ankle, 10-left hip, 11-left knee, 12-left ankle, 13-top of the head and 14-neck. Each keypoint has one of three visibility flags: labeled and visible, labeled but not visible, or not labeled.

 1/right shoulder 2/right elbow 3/right wrist 4/left shoulder 5/left elbow 6/left wrist 7/right hip 8/right knee 9/right ankle 10/left hip 11/left knee 12/left ankle 13/top of the head 14/neck

A visualization of an annotated figure is shown in Figure1, where red dots imply visible keypoints, gray dots imply obscured keypoints. The numbers next to the keypoints are the corresponding id numbers of the given keypoints:

Figure1: An annotation example of human skeletal keypoints

The annotations are stored in JSON format, and each divided dataset corresponds to a single JSON file. For training and validation files, the JSON file stores the positions of human bounding boxes and positions of skeletal system keypoints in all images within the dataset. A sample file is shown as followed:

    
[
{
"image_id": "a0f6bdc065a602b7b84a67fb8d14ce403d902e0d",
"human_annotations":
{
"human1": [178,250,290,522],
"human2": [293,274,352,473],
"human3": [315,236,389,495],
...},
"keypoint_annotations":
{
"human1": [261, 294, 1, 281, 328, 1, 259, 314, 2,
213, 295, 1, 208, 346, 1, 192, 335, 1,
245, 375, 1, 255, 432, 1, 244, 494, 1,
221, 379, 1, 219, 442, 1, 226, 491, 1,
226, 256, 1, 231, 284, 1],
"human2": [313, 301, 1, 305, 337, 1, 321, 345, 1,
331, 316, 2, 331, 335, 2, 344, 343, 2,
313, 359, 1, 320, 409, 1, 311, 454, 1,
327, 356, 2, 330, 409, 1, 324, 446, 1,
337, 284, 1, 327, 302, 1],
"human3": [373, 304, 1, 346, 286, 1, 332, 263, 1,
363, 308, 2, 342, 327, 2, 345, 313, 1,
370, 385, 2, 368, 423, 2, 370, 466, 2,
363, 386, 1, 361, 424, 1, 361, 475, 1,
365, 273, 1, 369, 297, 1],
...}
},
...
]



"image_id"：filename of the given image.

"human_annotations"：contains the positions of human bounding boxes. The first two parameters indicate the top left coordinates of the human bounding box, and the last two parameters are the lower right coordinates.

"keypoint_annotations"：contains the positions of each human skeletal system keypoints. The format of each array follows：$$[x_1,y_1,v_1,x_2,y_2,v_2,\cdot\cdot\cdot,x_{14},y_{14},v_{14}]$$, in which $$(x_i,y_i)$$ indicates the location of the given keypoint and $$v_i$$ is the visibility flag（$$v_i=1$$ : labeled and visible, $$v_i=2$$ : labeled but not visible, $$v_i=3$$ : not labeled）。 The order of the human skeletal system keypoints, also shown in Table1, is: 1-right shoulder, 2-right elbow, 3-right wrist, 4-left shoulder, 5-left elbow, 6-left wrist, 7-right hip, 8-right knee, 9-right ankle, 10-left hip, 11-left knee, 12-left ankle, 13-top of the head and 14-neck.

Submission

Participants are expected to predict precise location of people's bodily positions. The results should be saved as a JSON file in the following format:

    
[
{
"image_id": "a0f6bdc065a602b7b84a67fb8d14ce403d902e0d",
"keypoint_annotations": {
"human1": [261, 294, 1, 281, 328, 1, 0, 0, 0, 213, 295, 1, 208, 346, 1, 192, 335, 1, 245, 375, 1, 255, 432, 1, 244, 494, 1, 221, 379, 1, 219, 442, 1, 226, 491, 1, 226, 256, 1, 231, 284, 1],
"human2": [313, 301, 1, 305, 337, 1, 321, 345, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 313, 359, 1, 320, 409, 1, 311, 454, 1, 0, 0, 0, 330, 409, 1, 324, 446, 1, 337, 284, 1, 327, 302, 1],
"human3": [373, 304, 1, 346, 286, 1, 332, 263, 1, 0, 0, 0, 0, 0, 0, 345, 313, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 363, 386, 1, 361, 424, 1, 361, 475, 1, 365, 273, 1, 369, 297, 1],
...
}
}
...
]



The "keypoint_annotations" field contains an array of integer, for a total length of 42. The keypoint coordinates in the array should follow the same order as the annotation file, and the undetected human skeletal system keypoints can be marked as (0,0,0).

Evaluation

The evaluation metric of the human skeleton keypoint detection is similar to common object detection task, where the submission is scored in mean Average Precision (mAP). In common object detection tasks, Intersection over Union (IoU) is used to evaluate the similarity between a predicted bounding box and a ground truth bounding box. While in the human skeletal system keypoints detection task, we use Object Keypoint Similarity (OKS) instead of IoU, to measure the similarity between the predicted keypoints and the ground truth keypoints.[1]

The mAP score is the mean value of the Average Precision (AP) score under different OKS thresholds, and is calculated as follows:

$$mAP=mean\left\{AP@\left(0.50:0.05:0.95\right)\right\}$$

Where s is the threshold of OKS score.

The AP (Average Precision) score is calculated in the same way as in the common object detection, but instead of IoU, OKS is used as similarity metric. Given the OKS threshold s, the AP under s (AP@s) of the test results predicted by the participants over the entire test set can be calculated by OKS as follows:

$$AP@s=\frac{\sum_{p}\delta\left(OKS_p>s\right)}{\sum_{p}1}$$

The OKS score is similar to the IoU score in the common object detection task, which measures the similarity between the prediction and the ground truth. The main idea of OKS is the weighted Euclidean distance of the predicted keypoints and the ground truth keypoints, and for each human figure p, the OKS score is defined as follows:

$$OKS_p=\frac{\sum_{i}exp\left\{-d_{pi}^2/2s_p^2\sigma_{i}^2\right\}\delta\left(v_{pi}=1\right)}{\sum_{i}\delta\left(v_{pi}=1\right)}$$

Where $$p$$ is the index of human annotations; $$i$$ is the id number of the given human skeleton keypoint; $$d_{pi}$$ is the Euclidean distance between the predicted keypoint position and the ground truth; $$s_p$$ is the scale factor of human figure $$p$$, which is defined as the square root of the human bounding box area of human figure $$p$$; $$\sigma_{i}$$ is the normalized factor of the human skeletal keypoint, which is calculated by the standard deviation of human annotation result; $$v_{pi}$$ is the the visibility flag of the $$i$$ keypoint of the human figure $$p$$; $$\delta\left(\cdot\right)$$ is the Kronecker function, which means only visible human skeletal keypoints （$$v=1$$）are considered during evaluation.

An evaluation script will be provided to facilitate offline evaluation. The script will be released soon alongside the validation set.

[1] Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Dollar, Piotr and Zitnick, C Lawrence, "Microsoft coco: Common objects in context", European conference on computer vision, pp.740--755, Springer, 2015.
Training Set 14.8G
sha1sum:
6a9f0fb8b5562ffcbce8731b4142421a836c4603
Validation Set 2.2G
sha1sum:
16cd0b5ac4d38e664806c80028f8d6670f918f15
Test A 1.9G
sha1sum:
7e81d0b9b885b7ba27acb65eb27c34c707fdab69
Test B 2.0G
sha1sum: