Signal Processing and Speech Communication Laboratory
homephd theses › Probabilistic Methods for Resource Efficiency in Machine Learning

Probabilistic Methods for Resource Efficiency in Machine Learning

Wolfgang Roth
Franz Pernkopf
Research Areas

Deep neural networks (DNNs) have gained substantial attention in the past decade. A key factor for many of their recent success stories is the availability of increasing hardware capabilities that have enabled the training of ever-growing network architectures. As a result, the computational requirements of the resulting DNNs are often too high, preventing their use in many interesting real-world applications that must be operated on resource-constrained devices with limited memory, computation power, and battery capacity.

This thesis is dedicated to methods that improve the computational efficiency of machine learning models with a particular focus on DNNs. Our contributions are probabilistic in nature and closely related to Bayesian inference techniques. Probabilistic methods have the advantage that they offer a principled approach to obtaining prediction uncertainties. Furthermore, it is shown that probabilistic methods provide an effective means of converting combinatorial optimization problems into continuous ones that are easier to optimize.

The thesis is divided into two parts. The first part provides a thorough overview of the relevant background on supervised learning, deep neural networks, and Bayesian inference, and continues with an extensive overview on current state-of-the-art methods that improve resource efficiency in deep learning. The second part presents three individual contributions along with extensive experiments showing the effectiveness of the presented methods.

The first contribution is closely related to variational inference and considers weight and activation quantization in DNNs. This reduces the memory footprint of DNNs and enables faster predictions at test-time. We propose to learn discrete weights and activations by learning a distribution over the weights. This is accomplished by propagating distributions through the network using a central limit argument and propagating Gaussians through common building blocks and nonlinear activation functions. Once the weight distribution has been learned, a discrete-valued DNN is inferred by either taking its most probable value or by sampling from it.

The second contribution is concerned with weight sharing in Bayesian DNNs to reduce the memory footprint of storing a large ensemble of DNNs. The weight sharing is obtained by introducing a Dirichlet process prior on top of the weight prior. A sampling based inference scheme is presented that alternates between sampling assignments of weights to connections and sampling the weights themselves. Several algorithmic techniques are presented to overcome computational challenges in order to obtain a tractable algorithm.

The third contribution complements our discussion with an outlook on how methods for improving the resource efficiency of DNNs can be transferred to other model classes. In particular, we present a structure learning algorithm and a quantization approach for Bayesian network classifiers. The presented structure learning approach is closely related to differentiable neural architecture search for DNNs. The method learns a distribution over graph structures using continuous optimization techniques and subsequently selects the most probable structure from that distribution. By introducing a model size penalty to the objective, the method can be used to effectively trade off between model size and accuracy. The presented quantization approach relies on quantization-aware training using the straight-through gradient estimator. Quantization-aware training is currently the most widely used technique for weight and activation quantization in DNNs which allows for effective quantization with minimal changes to existing training pipelines. In extensive experiments, we contrast quantized small-scale DNNs and Bayesian network classifiers and show that both model classes offer benefits in different regimes of computational efficiency and accuracy.

The full text of the thesis can be downloaded here.