Content area
Full Text
Introduction
The success of deep networks can be attributed in large part to their extreme modularity and compositionality, coupled with the power of automatic optimization routines such as stochastic gradient descent [73]. While a number of innovative components have been proposed recently (such as attention layers [66], neural ODEs [16], and graph modules [58]), the vast majority of deep networks is designed as a sequential stack of (differentiable) layers, trained by propagating the gradient from the final layer inwards.
Even if the optimization of very large stacks of layers can today be greatly improved with modern techniques such as residual connections [72], their implementation still brings forth a number of possible drawbacks. Firstly, very deep networks are hard to parallelize because of the gradient locking problem [49] and the purely sequential nature of their information flow. Secondly, in the inference phase, these networks are complex to implement in resource-constrained or distributed scenarios [32, 53]. Thirdly, overfitting and vanishing gradient phenomena can still happen even with strong regularization, due to the possibility of raw memorization intrinsic to these architectures [27].
When discussing overfitting and model selection, these are generally considered properties of a given network applied to a full dataset. However, recently, a large number of contributions, e.g., [3, 7, 36, 45, 68], have shown that, even in very complex datasets such as ImageNet, the majority of patterns can be classified by resorting to smaller architectures. For example, [38] shows that the features extracted by a single convolutional layer already achieve a top-5 accuracy exceeding 30% on ImageNet. Even more notably, [30, 70] show that predictions that would be correct with smaller architectures can become incorrect with progressively deeper architectures, a phenomenon the authors call over-thinking.
In this paper, we explore a way to tackle with all these problems simultaneously, by endowing a given deep network with multiple early exits (auxiliary classifiers) departing from separate points in the architecture (see Fig. 1). Having one/two auxiliary classifiers is a common technique to simplify gradient propagation in deep networks, such as in the original Inception architecture [62]. However, recently it was recognized that these multi-exit networks have a number of benefits, including the possibility of devising layered training strategies, improving the efficiency of the inference...