Deep Sparse Rectifier Neural Networks
Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. AISTATS ’11: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 15, 315–323. https://doi.org/10.1.1.208.6449
- The rectifier function $\max(0,x)$ is both a useful model for neuron activation in neuroscience and an efficient activation function in neural networks.
- Biological neurons can have activations that are asymmetric ($1 \rightarrow -1$), symmetric ($1 \rightarrow 1$), or one-sided ($1 \rightarrow 0$)
- Neurons encode information in a sparse manner: only 1-4% are active at once. This is a tradeoff between expressiveness and low energy use. Neural nets without $\ell_1$ regularisation do not have this property.
- Biological neurons use very different activations to NNs:
- their firing rate is zero for all current less than zero, then increases sub-linearly.
- artificial neurons use sigmoid or tanh activations, which are both asymmetrical about zero (not present in biology)
- tanh is preferred to sigmoid because its steady state is zero.
Consequences of sparsity
- Information disentangling – the networks will be robust to small input changes
- Efficient variable-size representation – inputs can be in a variable-sized data structure.
- More likely to be linearly separable
- Excess sparsity may reduce predictive capability.
Rectifier neurons
- Because real neurons rarely reach their saturation (where increases in current no longer affect firing rate), they can be well approximated with the rectifier function.
- A rectifier automatically produces sparsity, 50% of activations will be initialised to zero.
- Much easier to compute than tanh or sigmoid.
- If one is worried about differentiability, the softplus function $f(x) = \ln(1+e^x)$ can be used. However, experiments suggest that the hard zeroes are good for NN performance.
- Rectifier nets need more hidden units to represent any antisymmetry in data.
Experimental results
- For image classification, sparsity does not hurt performance until 85% of neurons are zeroes.
- Rectifiers outperform softplus activations
- No improvement using pretraining autoencoders – easier to use.
- Rectifiers work very well with supervised and semi-supervised problems, but in the latter case pretraining is needed.
- Very strong performance on text sentiment analysis: lower RMSE than tanh at 50% sparsity.