Popis: |
This paper presents a Bayesian estimation procedure for single hidden-layer neural networks using $\ell_{1}$ controlled neuron weight vectors. We study the structure of the posterior density that makes it amenable to rapid sampling via Markov Chain Monte Carlo (MCMC), and statistical risk guarantees. Let the neural network have $K$ neurons with internal weights of dimension $d$ and fix the outer weights. With $N$ data observations, use a gain parameter or inverse temperature of $\beta$ in the posterior density. The posterior is intrinsically multimodal and not naturally suited to the rapid mixing of MCMC algorithms. For a continuous uniform prior over the $\ell_{1}$ ball, we demonstrate that the posterior density can be written as a mixture density where the mixture components are log-concave. Furthermore, when the number of parameters $Kd$ exceeds a constant times $(\beta N)^{2}\log(\beta N)$, the mixing distribution is also log-concave. Thus, neuron parameters can be sampled from the posterior by only sampling log-concave densities. For a discrete uniform prior restricted to a grid, we study the statistical risk (generalization error) of procedures based on the posterior. Using an inverse temperature that is a fractional power of $1/N$, $\beta = C \left[(\log d)/N\right]^{1/4}$, we demonstrate that notions of squared error are on the 4th root order $O(\left[(\log d)/N\right]^{1/4})$. If one further assumes independent Gaussian data with a variance $\sigma^{2} $ that matches the inverse temperature, $\beta = 1/\sigma^{2}$, we show Kullback divergence decays as an improved cube root power $O(\left[(\log d)/N\right]^{1/3})$. Future work aims to bridge the sampling ability of the continuous uniform prior with the risk control of the discrete uniform prior, resulting in a polynomial time Bayesian training algorithm for neural networks with statistical risk control. |