The growing demand in both large-scale machine learning applications and AI models on embedded devices create the need to miniaturize networks. A popular general technique to this end is discretizing weights and activation functions to reduce memory and computational demand. While many such methods have been developed, they often rely on heuristic gradients or post-training quantization. Probabilistic methods based on variational inference, in contrast, allow training discrete networks directly without relying on heuristic gradients. However, this remains a poorly studied topic for discrete recurrent neural networks. In this thesis, we analyze multiple probabilistic training algorithms which have been studied on feed-forward and on convolutional networks, and show that the reparametrization trick and local reparametrization trick can be used effectively to train discrete LSTM networks. We also explain and demonstrate that the local reparametrization trick is biased when applied to shared weights. Finally, we analyze the impact of binarizing individual gates of the LSTM cells, and demonstrate that binarizing the candidate gate can even lead to a performance gain. When comparing the performance of our discrete LSTMs to continuous LSTMs, we get mixed results. On MNIST pen stroke dataset, our discrete LSTMs achieved an accuracy of 97.2 − 97.5%, similar to the continuous baseline. On Australian sign language dataset the discrete LSTMs substantially outperformed the continuous ones. Using ternary weights and binarized candidate and output gate, we achieved a performance of 98.44 % on this dataset, more than 5% higher than with the continuous LSTMs. Only on TIMIT dataset the discrete LSTMs performed worse, achieving an accuracy more than 5% short of the continuous LSTM. We thus show, that discretizing LSTMs is a great option for making LSTMs more computationally efficient, without, in general, incurring a performance loss.