Title:

Finite size effects in neural network algorithms

One approach to the study of learning in neural networks within the physics community has been to use statistical mechanics to calculate the expected error that a network will make on a typical novel example, termed the generalisation error. Such average case analyses have been mainly carried out with recourse to the thermodynamic limit in which the size of the network is taken to infinity. For the case of a finite sized network, however, the error is not selfaveraging i.e., it remains dependent upon the actual set of examples used to train and test the network. The error estimated on a specific test set realisation, termed the test error, forms a finite sample approximation to the generalisation error. We present in this thesis a systematic examination of test error variances in finite sized networks trained by stochastic learning algorithms. Beginning with simple single layer systems, in particular, the linear perceptron, we calculate the test error variance arising from randomness in both the training examples and the stochastic Gibbs learning algorithm. This quantity enables us to examine the performance of networks in a limited data scenario, including the optimal partitioning of a data set into a training and testing set in order to minimize the average error that the network makes, whilst remaining confident that the average test error is representative. A detailed study of the variance of crossvalidation errors is carried out, and a comparison made between different crossvalidation schemes. We examine also the test error variance of the binary perceptron, comparing the results to the linear case. Employing the results for the variance of errors, we calculate how likely are worst case errors as derived from the PAC theory, finding that the probability of such worst case occurrences is extremely small. In addition, we study the effect of a finite system size on the online training of multilayer networks, in which we track the dynamic evolution of the error variance under the stochastic gradient descent algorithm used to train the network on an increasing amount of data. We find that the hidden unit symmetries of the multilayer network give rise to relatively large finite size effects around the point at which the symmetries are broken.
