Evaluation of the performance

Polyp localization

A polyp localization provided by a given method will be considered a true positive (TP) if it falls inside the ground truth. It can happen that multiple detections fall inside the ground truth region. In that case, they will be counted as one true positive. Each detection that falls outside the ground truth region will be counted as one false positive (FP). A false negative (FN) is also defined as a polyp that has not been detected by a given method---no detections have fallen inside the corresponding truth. This is illustrated in Fig. 1 where a colonoscopy frame, the corresponding binary mask (ground truth), and some detections are depicted. According to the above definitions, we have one true positive, two false positives, and zero false negative in this frame.

Figure 1: Example of evaluation of a given polyp localization method. Positive localizations (TP) are marked as green crosses whereas erroneous localizations (FP) are marked as red crosses. According to our definition, we have only one TP and two FP for this particular frame.

Polyp detection

In this challenge, we treat polyp detection as a further stage in polyp localization schemes. A proposed method should be able to locate the polyp in the image whenever there is polyp presence but, in order to help clinicians in proper polyp detection tasks, it should give no output when there is no polyp in the image. Considering this we believe that methods provided by participants should be able to discern polyp presence by means of model of appearance used for polyp localization. For instance, if a given method uses a confidence value to localize the polyp in the image, if a given image does not have confidence value up to a given threshold value we can interpret there is no polyp in the image.

Figure 2: Example of the output of a polyp detection method.

We can observe in Figure 2 that the given example method offers good performance for polyp localization (only one green cross for Figure 1 (a)) but a bad performance regarding polyp detection, because it gives a potential polyp localization where there is no polyp in the image (red cross).

Performance metrics

Polyp localization metrics (database: CVC-ClinicDB and ASTRE-Frame DB)

Localization score: Loc-score measured as Loc-score = TP/(TP+FN).
Localization precision: Loc-prec, measured as Loc-prec = TP/(TP+FP).

In this sense, a good localization method should provide both hihg Loc-score and Loc-prec values in a way such a very minumum number of polyps can be missed with a low number of false alarmes (FP).

Polyp detection metrics (databases: Asu-Mayo Testing Video Database)

Overall F-score: We rank the polyp detection systems according to the overall F-score, defined as 2 * (precision * sensitivity) / (precision + sensitivity) where precision = TP/(TP+FP) and recall = TP/(TP+FN). In this case, we treat all the test videos as one single dataset, meaning that we compute one precision and one recall for all the test videos.
Average F-score: The ranking scheme according to overall F-score may be influenced by the results for the videos that have a very large number of frames with polyps. To overcome this problem, we rank the methods according to the F-scores computed for each individual video. In this way, we obtain as many ranking as the number of test videos (e.g., we get 10 rankings if 10 test videos are given).The final placing of the methods is based on the average ranking from all the videos.
FROC analysis: For each polyp detection method, we change a threshold on the confidence of detection results and compute the FROC curve where the horizontal axis will be the number of false positives per frame and the vertical axis will be the sensitivity to polyps or recall. We then sort the methods according to their sensitivity at a low false positive rate.
Detection latency: In colonoscopy, it is imperative to detect a polyp upon appearance in the video. The longer the polyps stay in the colonoscopic view, the more likely the colonoscopists can detect them on their own. To evaluate the promptness of the polyp detection systems, we use a new performance curve. As with FROC, we change a threshold on detection results and then at each operating point, we compute the median detection latency of all videos as well as the number of generated false positives per frame, where the detection latency is defined as the time from the first appearance of a polyp in the colonoscopy video to the time of its first detection by a given system. For instance, if a polyp appears in frame 10, and a given method detects the polyp in frame 50, then the polyp detection latency for this given video will be (50-10)/(frame_rate).