Appendix B — Universal Approximation Theorem

The Universal approximation theorem (Hornik, 1991) says that ‘’a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units’’. The result applies for sigmoid, tanh and many other hidden layer activation functions.

The aim of this note is to provide an indication of why this is the case. To make things easier, we are going to look at 1D functions x \mapsto f(x), and show that the theorem holds for binary neurons. Binary neurons are neuron units for which the activation is a simple threshold, i.e. [{\bf x}^{\top}{\bf w} \geq 0]. In our case with a 1D input:

\begin{equation} x \mapsto [ xw_1 + w_0 \geq 0] = \begin{cases} 1 & \text{if } xw_1 + w_0 \geq 0 \\ 0 & \text{otherwise} \end{cases} \end{equation}

We use binary neurons but the idea can be extended to multiple inputs and different activation functions (eg. ReLU, sigmoid, etc.).

The argument starts with the observation that any given continuous function f can be approximated by a discretisation as follows:

\begin{equation} \tilde{f}(x) \approx \sum_{i=0}^n f(x_i) [ x_i \leq x < x_{i+1}]\,, \end{equation}

where x \mapsto [ x_i \leq x < x_{i+1}] is a pulse function that is 1 only in the interval [x_i, x_{i+1}[ and zero outside.

Figure B.1: Discretisation of the function x \mapsto f(x) over the interval [x_0, x_4].

The trick is that this pulse can be modelled as the addition of two binary neuron units (see Figure B.2):

\begin{equation} [ x_i \leq x < x_{i+1}] = [ x - x_{i}\geq 0] - [ x - x_{i+1} \geq 0] \end{equation}

Figure B.2: Modelling a pulse with two binary neurons.

We can substitute this for all the pulses: \begin{equation} \begin{aligned} f(x) &\approx \sum_{i=0}^n f(x_i) \left( [ x \geq x_{i}] - [ x \geq x_{i+1}] \right)\\ &\approx f(x_0) + \sum_{i=1}^n \left(f(x_i) - f(x_{i-1})\right) [ x \geq x_{i}] \end{aligned} \end{equation}

The network corresponding to the discretisation of Figure B.1 is illustrated in Figure B.3 below.

Figure B.3: Neural Network for the discretisation of the function x \mapsto f(x) over the interval [x_0, x_4].