Restricted boltzman Machine

Restricted Boltzman Machine (RBM) is a probabilistic model represented by an undirected graph as shown in the figure, and as the name implies, has a limited graph structure. The graph is a bipartite graph consisting of a visible layer and a hidden layer, and there is no interaction within the same layer. It also has a bias 𝑏𝑖,𝑐𝑗 corresponding to each unit and a weight 𝑤𝑖𝑗 corresponding to an edge. Here in after, these parameters are collectively written as 𝜃=(𝑏,𝑐,𝑤)

The following energy function is defined for this graph:

Under this energy function, a restricted Boltzmann machine is defined to have the following probability distribution:

Since such a probability distribution is called a Boltzmann distribution, it seems that this model is called a Boltzmann machine. 𝑍(𝜃) is a normalization factor, which in physics context is called a partition function.

Conditional independence

Due to the structure of RBM, when one layer is fixed, the variables in the other layer have conditional independence.

The formula for the conditional probability of the hidden layer when the visible layer is fixed is

In this way, the inner product of the values ​​of all visible layers and the edge weights is included in the sigmoid function, and the combination of values​​taken by the hidden layer is determined by the binomial distribution that is 1 with this probability. From this point, it can be said that this is a stochastic machine learning model, which is different from a machine learning model in which node values ​​are determined deterministically such as a neural network.

Learning of a limited Boltzmann machine

RBM learning means creating a model with parameters that are most likely to explain the given training data. When you put the training data in the visible layer, the value returns to the visible layer again through the hidden layer filter. Compare the result with the input and adjust the parameters by increasing the gradient in the direction that gives the best output. This optimization, and the derivation of the resulting learning equations, can be done using maximum likelihood estimation or Kullback, Leibler divergence, but we will not go into the details here. I will proceed with the following learning equation.

The equation for updating parameters by the gradient ascent method in the learning process of RBM is as follows.

represents the ith component of the kth training data. The terms on the right side are called the positive phase and the negative phase. The first term represents the term obtained from the training data, and the second term represents the expected value of the entire probability distribution. Due to the conditional independence inherent in RBM, the calculation of the first term is very easy.

However, the second term has a problem of so-called combination explosion, in which the calculation amount increases exponentially with an increase in the number of units, and it is difficult to calculate directly.

The expected value of the whole model in the second term is calculated as follows.

Thus, for example, if there are 100 variables, 2^100

calculations are required, and the calculation amount exponentially increases as the number of variables increases. You.

Therefore, there is a method that some kind of sampling is performed under this probability model, and the expected value is approximated by taking the average. There are several methods for sampling, as described below.

Gibbs sampling

He stated that it was difficult to calculate expected values ​​for the entire model in RBM. For this purpose, a method of obtaining an approximate value of the expected value by sampling using random numbers is used. That is, to obtain the expected value for the distribution 𝑃(𝑥)

you want to examine, generate N independent samples from that distribution and substitute the sample mean for the expected value.

MCMC (Markov chain Monte Carlo method) is used for sampling. Among MCMC, it is common to use Gibbs sampling (heat bath method). RBM has conditional independence due to its graph structure. Given a hidden layer realization, the sampling of the visible layer simply samples each independent component V_𝑖

from 𝑃(𝑣_𝑖|ℎ,𝜃)

Is good. Similarly, sampling of the hidden layer can be easily obtained by fixing the value of the visible layer.

This section describes the specific Gibbs sampling procedure.

1. Initializes the state of each unit in the visible layer randomly.

2. 𝑃(ℎ_𝑗|𝑣,𝜃) the value of each unit in the hidden layer from the binomial distribution of the probability of 𝑃(ℎ_𝑗|𝑣,𝜃)

3. From the state of the hidden layer obtained in (2), sample the value of the visible layer in the next step from the binomial distribution of the probability of 𝑃(𝑣_𝑖|ℎ,𝜃)

4. Repeat steps (2) and (3) to acquire the visible and hidden layers as a sample after a sufficient time (number of steps) has elapsed.

5. Acquire samples until the required number is reached, at intervals so that there is no correlation between the samples.

The advantage of this model is that the Markov chain is realized simply by alternately sampling the visible and hidden layers.

However, even with this sampling method, it takes a long time to acquire a sample after a sufficient time. To overcome this problem, the following Contrastive Divergence method was proposed.

Contrastive Divergence method (CD method)

This method makes the following changes to Gibbs sampling: 1. Use the input data 𝑣0

as the initial state of the visible layer. 2. 𝑃(ℎ_𝑗|𝑣,𝜃) the value of each unit in the hidden layer from the binomial distribution of the probability of 𝑃(ℎ_𝑗|𝑣,𝜃) . 3. From the state of the hidden layer obtained in (2), sample the value of the visible layer in the next step from the binomial distribution of the probability of 𝑃(𝑣_𝑖|ℎ,𝜃)

Do this only once and calculate each expected value using the state of the visible layer obtained in (3).

In this way, in normal Gibbs sampling, a Markov chain that continues until a sufficient time elapses ends boldly with only the first (or several) methods. The approximation obtained in this way is surprisingly known to work.

Sampling using an annealing machine

As mentioned above, sampling using MCMC is a bottleneck, and a formulation like Contrastive Divergence has been proposed, which has greatly improved the practicality.

Here, instead of using it, we will consider embedding the RBM structure in the annealing machine and performing optimization. If the old method is soft sampling, this method is an experimental sampling by hardware. The graph of RBM is a bipartite graph, but in order to simplify the embedding procedure, we have decided that each unit is fully connected this time. This limits the number of units that can be run to half the number of possible bipartite graphs.

Reconstruct data by trained RBM

RBM was trained using the following four 20-dimensional vectors as training data. The components of 0 and 1 are replaced with □ and ■ for easy understanding.

First, the results of Gibbs sampling only and the learning results of Gibbs sampling + CD sampling are shown.

1. Gibbs sampling

The learning rate was 100,50,50

times with 𝜂=0.1,0.01,0.001 100,50,50

In one training, 100 samples are taken by Gibbs sampling, and the sample average is approximated to the expected value.

The time required for one study was about 9 seconds. To train each of the four training data once is counted in the unit of one epoch, but the time required for one epoch is 36 seconds. The total is 100+50+50=200

epoch, so it took about 2 hours to learn everything.

After learning, here is the result of inputting the same training data 5 times and testing whether it can be reproduced.

There are errors in some places, but it can be said that it has been reproduced to some extent. However, the reason why it takes 2 hours at this size is also that Gibbs sampling requires too much computation.

2. Contrastive Divergence method

Next, the training data was the same, and learning by the Contrastive Divergence method was performed.

It is a calculation time, but it takes 0.00152s for one epoch to learn four data at a time. It was over 10,000 times faster. Of course, that’s not surprising, because we’ve made 100,000 samplings twice. The learning result at this time is almost unchanged, and the parameters have converged. It is characteristic that the accuracy of the sample does not decrease even if such a bold approximation is performed.

3. Annealing machine

In the learning with the annealing machine, when updating the parameters of the RBM, it communicates with the annealing machine and receives the sampling result. Processing other than the sampling part is performed by a normal computer. So the basic process is the same as 1. Gibbs sampling.

In the sampling procedure, basically, you only need to send the graph weights, bias parameters, and the number of samples. Create a QUBO polynomial from each of the current parameters (weights, bias terms). Then, send the QUBO polynomial and the desired number of samples to the annealing machine, and then calculate the expected sampling value from the returned results.

The above is the learning procedure using the annealing machine. This time, we ran 100 epochs each with a learning rate of 0.1 and 0.01. Here is the learning result.

You may also like...