Content area
Full text
Supervised learning algorithms for multilayer neural networks face two problems: They require a teacher to specify the desired output of the network, and they require some method of communicating error information to all of the connections. The wake-sleep algorithm avoids both of these problems. When there is no external teaching signal to be matched, some other goal is required to force the hidden units to extract underlying structure. In the wake-sleep algorithm, the goal is to learn representations that are economical to describe but allow the input to be reconstructed accurately. We can quantify this goal by imagining a communication game in which each vector of raw sensory inputs is communicated to a receiver by first sending its hidden representation and then sending the difference between the input vector and its top-down reconstruction from the hidden representation. The aim of learning is to minimize the "description length," which is the total number of bits that would be required to communicate the input vectors in this way (1). No communication actually takes place, but minimizing the description length that would be required forces the network to learn economical representations that capture the underlying regularities in the data (2).
The neural network has two quite different sets of connections. The bottom-up "recognition" connections are used to convert the input vector into a representation in one or more layers of hidden units. The top-down "generative" connections are then used to reconstruct an approximation of the input vector from its underlying representation. The training algorithm for these two sets of connections can be used with many different types of stochastic neurons, but for simplicity, we use only stochastic binary units that have states of 1 or 0. The state of unit v is s sub v , and the probability that it is on is
(Equation 1 omitted.)
where b sub v is the bias of the unit and w sub uv is the weight on a connection from unit u. Sometimes the units are driven by the generative weights, and at other times they are driven by the recognition weights, but the same equation is used in both cases (Fig. 1). (Fig. 1 omitted.)
In the "wake" phase, the units are driven bottom-up with the recognition weights; this...





