Assymetric Loss for Multi-Label Classification

1) Multi Label Classification

(1) 개요

Surprisingly, focal loss is seldom used for multi-label classification, and cross-entropy is often the default choice.

Since high negative-positive imbalance is also encountered in multi-label classification, focal loss might provide better results, as it encourages focusing on relevant hard-negative samples, which are mostly related to images that do not contain the positive class, but do contain some other confusing categories.

Nevertheless, for the case of multi-label classification, treating the positive and negative samples equally, as proposed by focal loss, is sub-optimal, as it results in the accumulation of more loss gradients from negative samples, and down-weighting of important contributions from the rare positive samples. In other words, the network might focus on learning features from negative samples while underemphasizing learning features from positive samples.

결국에는 gradient accumultation이 negative sample에 대해 더 많아질 것이라고 한다.
왜냐하면 focal loss는 class imbalanced loss와 달리 data distribution의 분포에 대해 학습하는 것이 아니기 때문

In this paper, we introduce an asymmetric loss (ASL) for multi-label classification, which explicitly addresses the negative-positive imbalance. ASL is based on two key properties:

First, to focus on hard negatives while maintaining the contribution of positive samples, we decouple the modulations of the positive and negative samples and assign them different exponential decay factors.

Second, we propose to shift the probabilities of negative samples to completely discard very easy negatives (hard thresholding). By formulating the loss derivatives, we demonstrate that probability shifting also enables to discard very hard negative samples, suspected as mislabeled, which are common in multi-label problems [10].

Positive Class와 Negative class를 분리해 둘에게 다른 exponential decay factor를 부여했다.

이것은 multi label classification 과제에서 sample 내에 positive class는 존재하지 않지만 매우 어렵고 헷갈리는 대상들이 존재할 때 이것에 대해 학습하는 것이 쉽지 않기 때문이다.

그렇게 되면 이 sample들은 loss가 너무 크게 누적되어 hard positive sample을 학습하는 것이 아니라 너무 많은 hard negative sample들로 grdient가 구성될 것이라 말한다.

따라서 위의 방식으로 이를 보완하고자 했다.