## Deep Sparse Rectifier Neural Networks

Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. AISTATS ’11: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 15, 315–323. https://doi.org/10.1.1.208.6449

• The rectifier function $\max(0,x)$ is both a useful model for neuron activation in neuroscience and an efficient activation function in neural networks.
• Biological neurons can have activations that are asymmetric ($1 \rightarrow -1$), symmetric ($1 \rightarrow 1$), or one-sided ($1 \rightarrow 0$)
• Neurons encode information in a sparse manner: only 1-4% are active at once. This is a tradeoff between expressiveness and low energy use. Neural nets without $\ell_1$ regularisation do not have this property.
• Biological neurons use very different activations to NNs:
• their firing rate is zero for all current less than zero, then increases sub-linearly.
• artificial neurons use sigmoid or tanh activations, which are both asymmetrical about zero (not present in biology)
• tanh is preferred to sigmoid because its steady state is zero.

### Consequences of sparsity

• Information disentangling – the networks will be robust to small input changes
• Efficient variable-size representation – inputs can be in a variable-sized data structure.
• More likely to be linearly separable
• Excess sparsity may reduce predictive capability.

### Rectifier neurons

• Because real neurons rarely reach their saturation (where increases in current no longer affect firing rate), they can be well approximated with the rectifier function.
• A rectifier automatically produces sparsity, 50% of activations will be initialised to zero.
• Much easier to compute than tanh or sigmoid.
• If one is worried about differentiability, the softplus function $f(x) = \ln(1+e^x)$ can be used. However, experiments suggest that the hard zeroes are good for NN performance.
• Rectifier nets need more hidden units to represent any antisymmetry in data.

### Experimental results

• For image classification, sparsity does not hurt performance until 85% of neurons are zeroes.
• Rectifiers outperform softplus activations
• No improvement using pretraining autoencoders – easier to use.
• Rectifiers work very well with supervised and semi-supervised problems, but in the latter case pretraining is needed.
• Very strong performance on text sentiment analysis: lower RMSE than tanh at 50% sparsity.