Implementing neural networks in C# - Part 4

In this post, we explore how we can naturally derive neural networks from logistic regression and enable them to automatically handle complex data configurations.

What are neural metworks ?

We will discuss the rationale behind neural networks for multiclass classification (with $K$ classes), but the following can be readily adapted to regression.

It's crucial to recall that the logistic regression algorithm relies on linear combinations of the features $x_{1}$, ..., $x_{D}$ of the input variable $x$, where $D$ is the count of features. It endeavors to model the probability of class $C_{1}$ according to the following formula. $$p(C_{1}|x)=\sigma (\displaystyle\sum_{i=1}^D w_{i}x_{i})$$

In the last post, we observed that instead of working directly with the original inputs, we could apply a fixed non-linear transformation to them. $$p(C_{1}|x)=\sigma (\displaystyle\sum_{i=1}^D w_{i}\phi_{i}(x))$$

The goal is to extend this model by making the basis functions $\phi_{1}$, ..., $\phi_{D}$ depend on parameters and then to allow these parameters to be adjusted along with the coefficients $w_{1}$, ..., $w_{D}$ during training.

The idea of neural networks is to use basis functions that are themselves nonlinear functions of a linear combination of the inputs. That leads to the basic neural network model which can be described as a series of functional transformations.

How does it work in concrete terms ?

We continue to assume that we have a dataset composed of $N$ records, each of which possesses $D$ features. To illustrate, our previous toy example had 2 features ($X$ and $Y$).

  • Construct $M$ linear combinations $a_{1}$, ..., $a_{M}$ of the input variables $x_{1}$, ..., $x_{D}$

$$a_{j}=\displaystyle\sum_{i=1}^D w_{ji}x_{i} + w_{j0}$$

Information

What is $M$, and what does it stand for ? That is not very important for the moment, but the intuitive idea behind this formula is to construct a mapping from our original $D$-dimensional input space to another $M$-dimensional feature space. We will revisit this point later in this post.

  • Each of these quantities is transformed using a differentiable, nonlinear activation function $h$.

$$z_{j}=h(a_{j})$$

  • These values are again linearly combined to give $K$ output unit activations $b_{1}$, ..., $b_{k}$ (remember that $K$ is the count of classes).

$$b_{k}=\displaystyle\sum_{j=1}^M wo_{kj}z_{j} + wo_{k0}$$

  • Finally, the output unit activations are transformed using an appropriate activation function to give a set of network outputs $y_{1}$, ,...,$y_{K}$.

$$y_{k}=\sigma (b_{k})$$

$\sigma$ can be the sigmoid (or other activation function) in the case of binary classification or the identity in the case of regression.

We can combine these various stages to give the overal network function.

$$y_{k}(x, w)=\sigma (\displaystyle\sum_{j=1}^M wo_{kj}h(\displaystyle\sum_{i=1}^D w_{ji}x_{i} + w_{j0}) + wo_{k0})$$

Thus the neural network model is simply a nonlinear function from a set of input variables $x_{1}$, ..., $x_{D}$ to a set of output variables $y_{1}$, ..., $y_{K}$ controlled by a vector $w$ of adjustable parameters.
Bishop (Pattern Recognition and Machine Learning)

Information 1

We can observe that the final expression of the output is much more complex than that of simple logistic regression. This complexity provides flexibility, as now we will be able to model complex configurations (such as non-linearly separable data), but it comes at the expense of learning complexity. We now have many more weights to adjust, and the algorithms dedicated to this task are not trivial. It is this complexity that hindered neural network development in the late 80s.

Information 2

The neural network can be further generalized by customizing the activation functions; for example, we can use $tanh$ instead of the sigmoid.

It is quite customary to represent a neural network with a graphical form, as shown below.

This graphical notation has several advantages over the mathematical formalism, as it can emphasize two key points.

  • First, we can consider additional layers of processing.

Information

The more layers, the more flexible the neural network becomes. In this context, one might think that increasing the number of layers will eventually enable the model to handle all real-world situations (which is the premise of deep learning). However, this approach significantly increases the complexity of learning parameters and, consequently, demands substantial computing resources.

  • Second, we can see that $M$ represents the number of "hidden" units. The term "hidden" stems from the fact that these units are not part of the input or output layers but exist in intermediate layers, where they contribute to the network's ability to learn complex patterns and representations from the data.
Information

$M$ must be chosen judiciously: if it's too small, the neural network may struggle to generalize accurately. On the other hand, if it's too large, there is a risk of overfitting, accompanied by an increase in the learning of parameters.

The number of input and output units in a neural network is generally determined by the dimensionality of the dataset, whereas the number $M$ of hidden units is a free parameter that can be adjusted to give the best predictive performance.
Bishop (Pattern Recognition and Machine Learning)

Now that we have introduced the formalism and demonstrated how complex configurations can be represented by neural networks, it's time to explore how the parameters and weights involved in the process can be learned. We will delve into the dedicated procedure developed for this purpose.

Implementing neural networks in C# - Part 5