The following applet can be used to experiment with backprop learning
for function approximation problems. You can choose an underlying
function to be approximated, then choose a number of training samples,
network size, and learning rate.
It is interesting to observe the generalization that the multilayer
neural network achieves as a function of training cycles.
In a recent paper (see below) we have advocated a method to stop training early
in order to achieve good generalization. This method of training does not
require threee separate data sets: training, testing, and validation, as
required by the method of cross-validation.
Generalization in backprop-trained multilayer neural networks is discussed for problems where training data is either scarce or else extremely costly to obtain. In this case, the usual method of cross-validation, whereby the data set is partitioned into training, testing, and validation sets, is not feasible. In this paper, we demonstrate that it is sometimes possible to use all available data for training a large network (a network capable of overfitting the data) and yet still determine an appropriate stopping point to ensure that the network generalizes properly.
1. Introduction
A single hidden layer neural net structure for the approximation of a real-valued function is shown in Figure 1. Assuming noisy training data and a large hidden layer with backprop learning (Rummelhart, et al., 1986; Werbos, 1994), it is well known that poor generalization (Weigend, et al., 1991) performance can result from "overtraining" the network (e.g., see Hassoun, 1995). The reason for this phenomenon is that the network "overfits" the data, i.e., the network tries to fit the noise in the data as well as the underlying function to be approximated.
The method of cross-validation (Morgan and Bourlard, 1990; Weigend et al., 1991; Herget, et. al, 1992) is commonly used to overcome the problem of data overfitting. In this case, the data set is broken into 3 sets: a training set, a validation set, and a testing set. The network's performance on the validation set is used to determine when to stop training.
In this paper, we discuss a method of validation for problems where the number of data points for training is limited or extremely costly to obtain. In such cases, it is desirable to use all the data for training and not reserve a separate portion for validation. The validation method proposed here is to train with all available data and then terminate training by observing the behavior of the training error curve. In particular, training is terminated after the first steep drop in the training error curve.
To be precise, the task is to approximate a given real-valued function from noisy samples of the function; hence we have a training set of the form . We assume
(1)
where is a noise or perturbation term which creates a fine-grained nonlinear structure in the graph of .
Figure 1. An artificial neural network for function approximation.
The idea here is that the fitting of is a two step process, whereby the network fits and then, after some time, fits . The positioning to fit noise slows down training because of local competition of training samples in the fine grained structure. This results in a training error curve signature which can be detected in order to terminate training, as shown in Figure 2. In this case, a validation set is not required, and training can be terminated shortly after the initial drop in training error. In the next section, simulations results are presented which demonstrate the usefulness of this approach.
Figure 2. A conceptual diagram of how the training error function can be used to tell when to stop training.
2. Simulation Results
In this section, we present simulation results which verify our hypothesis that the training set error plot can be used to determine when to stop training. First, suppose that the underlying function to be approximated is the rational function given by:
To construct the training set, we produced 15 noisy samples according to (1) using a Gaussian noise term with mean 0 and variance 0.3. The network parameters we used are as follows: the number of hidden units was set at , each with bipolar sigmoidal activation (slope ), and the single output unit had a linear activation function with slope . The hidden and output layer learning rates were set at and , respectively. Note that no attempt has been made to optimize the network size or the value of the learning rates.
The results of the neural net approximation are shown in Figure 3(a)-(g). In (a), the training set error (rms error) is shown (solid line) as a function of the number of training cycles, k (incremental backprop was used). In (b)-(g), the function is shown as a dotted line, and the neural net approximation (based on the noisy samples shown as circles) is shown as the solid line.
Notice how the network quickly moves to approximate , with a corresponding quick drop in training set error. It takes some time, though, for the network to be able to fit the noise. This "slowing down" in the learning appears as a "flat spot" in the training set error. After some time, though, the network is in a position to fit the noise, and the error undergoes another steep decrease.
Figure 3. (a) The training set error versus cycle number. (b) The neural net approximation after 10 learning cycles, (c) 1000 cycles, (d) 5000 cycles, (e) 10,000, (f) 20,000 cycles, (g) 40,000 cycles. In each case, the dotted line is the underlying function to be approximated, the solid line is the neural net output, and the open circles indicate the data points used for training.
As a second example, consider the function shown in Figure 4(b). Here, the underlying function to be approximated is a sine wave, and is a perturbation term, producing small "blips" or kinks near the peaks of the sine curve.
Figure 4. (a) The training set error versus cycle number. (b) The neural net approximation after 10 learning cycles, (c) 1000 cycles, (d) 5000 cycles, (e) 10,000, (f) 40,000 cycles, (g) 100,000 cycles. In each case, the dotted line is the underlying function to be approximated, the solid line is the neural net output, and the open circles indicate the data points used for training.
We trained a 35 hidden unit neural net with 50 random samples from the perturbed function. Similar to the previous case, the error plot is shown in (a), and the neural net approximation at various stages during learning are shown in (b)-(g).
Notice how the network quickly discovers an approximation to the sine wave, but then takes time to fit each perturbation (kink) in the sine wave. The fitting of these perturbations can be correlated with the shape of the error function. There is a flat spot in the error plot just before the network is able to fit each of the kinks. Another interesting phenomenon is that the kinks are consistently fit from left to right as training proceeds.
Of course, if the task is to approximate the function and not , then the plateau in the error function is not an indication of a stopping criterion. With exact training data, one would just continue training until the training set error was as small as possible. In practice, though, the data is almost always noisy, and some stopping criterion is needed which guarantees good generalization.
3. Conclusion
The method of cross-validation works well for problems where data is plentiful. Unfortunately, there are a wide range of engineering problems where data is scarce or expensive to obtain. In this case, cross- validation is not feasible since there is simply not enough data to divide up for training, testing, and validation. In this paper, we presented a method whereby validation can be accomplished without the need of a separate validation set. Rather, we look directly at a plot of the training error to determine an appropriate point to stop training. It is clear, though, that this method of validation may not be applicable to noise-free function approximation problems.
There are several advantages of the validation method outlined in this paper. First, all of the available problem data can be used for training the network. Second, and common to all validation methods, the problem of choosing an "optimal" network size (number of hidden nodes) becomes much less critical to ensure a good function approximation. In this case, it is possible to overestimate the number of required hidden units, relying on the validation process to prevent the network from overfitting the data. In fact, for the simulations presented above, no attempt was made to optimize the size of the network. Finally, the proposed validation method is easy to apply.
In future work, we will apply this stopping criterion to large-scale practical problems (higher dimensional problems), as well as determine automatic methods for detecting the optimal stopping point, rather than relying on visual inspection of the training error plot. Also, we will study further the process by which the neural net fits a given data set. For example, we will investigate the interesting phenomenon of why the network has a left-to-right bias when fitting the fine-grained noise structure, as was seen in Figure 4.
4. References
1. Hassoun, M. H. (1995). Fundamentals of Artificial Neural Networks, MIT Press, Cambridge, Mass.
2. Herget, F., Finnoff, W., and Zimmermann, H. (1992). "A Comparison of Weight Elimination Methods for Reducing Complexity in Neural Networks," in Proceedings of the International Joint Conference on Neural Networks (Baltimore, 1992), vol. III, pp. 980-987. IEEE, New York.
3. Morgan, N., and Bourlard, H. (1990) "Generalization and Parameter Estimation in Feedforward Nets: Some Experiments," in Advances in Neural Information Processing Systems 2(Denver, 1989), D. S. Touretzky, ed., 630-637. Morgan Kaufmann, San Mateo, Calif.
4. Rummelhart, D., Hinton, G. E., and Williams, R. (1986). "Learning Internal Representations by Error Propagation," Parallel and Distributed Processing: Exploration in the Microstructure of Cognition, Vol. 1, D. Rumelhart and J. McClelland (Eds.), MIT Press, Cambridge, Massachusetts, 318-362.
5. Weigend, A., Rummelhart, D., and Huberman, B. (1991). "Generalization by Weight Elimination with Application to Forecasting," in Advances in Neural Information Processing Systems 3 (Denver, 1990), R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds. pp. 875-882. Morgan Kauffmann, San Mateo, Calif.
6. Werbos, P. (1994). The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting, John Wiley & Sons, New York.