Signal Processing and Speech Communication Laboratory

## Result of the Month

September 2022: Active Bayesian Causal Inference (Christian Toth)

Causal discovery and causal reasoning are classically treated as separate and consecutive tasks: one first infers the causal graph, and then uses it to estimate causal effects of interventions. However, such a two-stage approach is uneconomical, especially in terms of actively collected interventional data, since the causal query of interest may not require a fully-specified causal model. From a Bayesian perspective, it is also unnatural, since a causal query (e.g., the causal graph or some causal effect) can be viewed as a latent quantity subject to posterior inference—other unobserved quantities that are not of direct interest (e.g., the full causal model) ought to be marginalized out in this process and contribute to our epistemic uncertainty. In this work, we propose Active Bayesian Causal Inference (ABCI), a fully-Bayesian active learning framework for integrated causal discovery and reasoning, which jointly infers a posterior over causal models and queries of interest. In our approach to ABCI, we focus on the class of causally-sufficient, nonlinear additive noise models, which we model using Gaussian processes. We sequentially design experiments that are maximally informative about our target causal query, collect the corresponding interventional data, and update our beliefs to choose the next experiment. Through simulations, we demonstrate that our approach is more data-efficient than several baselines that only focus on learning the full causal graph. This allows us to accurately learn downstream causal queries from fewer samples while providing well-calibrated uncertainty estimates for the quantities of interest.

August 2022: Location-based Initial Access for Wireless Power Transfer with Physically Large Arrays (Benjamin Deutschmann)

Within the REINDEER H2020 project, we investigate the potential of using physically large, or distributed antenna arrays to transmit power wirelessly to batteryless energy neutral (EN) devices. An enabling milestone to make the technology feasible is solving the initial-access problem, i.e., waking up an EN device with unknown channel state information (CSI).

July 2022: How prosody affects ASR performance in conversational Austrian German (Saskia Wepner)

The performance of Automatic Speech Recognition (ASR) systems varies with the speaking style of the data that is to be recognised. Where read speech, voice commands and also broadcast news are nowadays well recognised by standard ASR systems, conversational speech remains to be challenging for multiple reasons.

June 2022: Conversational Speech Recognition Needs Data? Experiments with Austrian German (Julian Linke)

Left: Histogram showing conversation-dependent WERs of low-resource (LR) and data-driven (XLSR) 4-gram models. Right: Histogram showing speaker-dependent WERs of low-resource (LR) and data-driven (XLSR) 4-gram models.

April 2022: Car Occupancy Detection Using Ultra-Wideband Radar (Jakob Möderl)

We show, that the UWB nodes of the keyless-access system of a car can be used as radar sensors to detect if the car is occupied.

March 2022: Synwalk: community detection via random walk modelling (Christian Toth)

Complex systems, abstractly represented as networks, are ubiquitous in everyday life. Analyzing and understanding these systems requires, among others, tools for community detection. As no single best community detection algorithm can exist, robustness across a wide variety of problem settings is desirable. In this work, we present Synwalk, a random walk-based community detection method. Synwalk builds upon a solid theoretical basis and detects communities by synthesizing the random walk induced by the given network from a class of candidate random walks. We thoroughly validate the effectiveness of our approach on synthetic and empirical networks, respectively, and compare Synwalk’s performance with the performance of Infomap and Walktrap (also random walk-based), Louvain (based on modularity maximization) and stochastic block model inference. Our results indicate that Synwalk performs robustly on networks with varying mixing parameters and degree distributions. We outperform Infomap on networks with high mixing parameter, and Infomap and Walktrap on networks with many small communities and low average degree. Our work has a potential to inspire further development of community detection via synthesis of random walks and we provide concrete ideas for future research.

February 2022: On Efficient Uncertainty Estimation for Resource-Constrained Mobile Applications (Johanna Rock)

Uncertainty estimation and out-of-distribution robustness are vital aspects of modern deep learning. Predictive uncertainty supplements model predictions and enables improved functionality of downstream tasks including various resource-constrained embedded and mobile applications. Popular examples are virtual reality (VR), augmented reality (AR), sensor fusion, perception, and health applications including fitness indicators, arrhythmia detection, and skin lesion detection. Robust and reliable predictions with uncertainty estimates are increasingly important when operating on noisy in-the-wild data from sensory inputs. A large variety of deep learning architectures have been applied to various tasks with great success in terms of prediction quality, however, producing reliable and robust uncertainty without additional computational and memory overhead remains a challenge. This issue is further aggravated due to the limited computational and memory budget available in typical battery-powered mobile devices.

January 2022: Multipath-based Localization and Tracking considering Off-Body Channel Effects (Thomas Wilding, Erik Leitinger)

In this work we consider multipath-based positioning and tracking in off-body channels. We analyse the effects introduced by the human body and the resulting effects that are of interest in positioning and tracking based on channel measurements obtained in an indoor scenario. As the signal bandwidth is known to effect the achievable accuracy in positioning, the bandwidth dependency of the field of view (FOV) induced by human body via shadowing and the number of multipath components (MPCs) detected and estimated by a deterministic maximum likelihood (ML) algorithm is investigated. A multipath-based positioning and tracking algorithm is proposed that associates estimated MPC parameters with floor plan features and exploits a human body-dependent FOV function. The proposed algorithm is able to provide accurate position estimates even for an off-body radio channel in a multipath-prone environment with the signal bandwidth found to be a limiting factor.

December 2021: An Adaptive Algorithm for Joint Cooperative Localization and Orientation Estimation using Belief Propagation (Lukas Wielandner, Erik Leitinger)

In cooperative localization applications, measurement-model related model parameters are often assumed to be known even though they can depend strongly on the environment. This assumption can lead to a reduced localization accuracy due to parameter mismatch. In this paper, we propose an adaptive factor-graph-based algorithm for joint cooperative localization and orientation estimation which inherently estimates all unknown model parameters as well as the measurement uncertainty. We use RSS radio measurements and account for the directivity of the antennas with a parametric antenna pattern. We validate our proposed methods with simulations in a static scenario and show that there is only a small loss in positioning accuracy compared to known model parameters and measurement noise.

November 2021: Multipath-based SLAM using Belief Propagation with Interacting Multiple Dynamic Models (Erik Leitinger)

In this work, we present a Bayesian multipath-based simultaneous localization and mapping (SLAM) algorithm that continuously adapts interacting multiple models (IMM) parameters to describe the mobile agent state dynamics. The time-evolution of the IMM parameters is described by a Markov chain and the parameters are incorporated into the factor graph structure that represents the statistical structure of the SLAM problem. The proposed belief propagation (BP)-based algorithm adapts, in an online manner, to time-varying system models by jointly inferring the model parameters along with the agent and map feature states. The performance of the proposed algorithm is finally evaluating with a simulated scenario. Our numerical simulation results show that the proposed multipath-based SLAM algorithm is able to cope with strongly changing agent state dynamics.

October 2021: Developing an Annotation System for Communicative Functions for a Cross-Layer ASR System (Barbara Schuppler, Anneliese Kelterer)

The investigation of conversational speech requires the close collaboration of linguists and speech technologists to develop new modeling techniques that allow the incorporation of various knowledge sources. This paper presents a progress report on the ongoing interdisciplinary project “Cross-layer language models for conversational speech” with a focus on the development of an annotation system for communicative functions. We discuss the requirements of such a system for the application in ASR as well as for the use in phonetic studies of talk-in-interaction, and illustrate emerging issues with the example of turn management.

August 2021: Deep Convolutional Neural Networks for Massive MIMO Fingerprint-Based Positioning (Erik Leitinger)

This work provides an initial investigation on the application of convolutional neural networks (CNNs) for fingerprint-based positioning using measured massive MIMO channels. When represented in appropriate domains, massive MIMO channels have a sparse structure which can be efficiently learned by CNNs for positioning purposes. We evaluate the positioning accuracy of state-of-the-art CNNs with channel fingerprints generated from a channel model with a rich clustered structure: the COST 2100 channel model. We find that moderately deep CNNs can achieve fractional-wavelength positioning accuracies, provided that an enough representative data set is available for training.

July 2021: A Message Passing based Adaptive PDA Algorithm for Robust Radio-based Localization and Tracking (Alexander Venus, Erik Leitinger)

We present a message passing algorithm for localization and tracking in multipath-prone environments that implicitly considers obstructed line-of-sight situations. The proposed adaptive probabilistic data association algorithm infers the position of a mobile agent using multiple anchors by utilizing delay and amplitude of the multipath components (MPCs) as well as their respective uncertainties. By employing a nonuniform clutter model, we enable the algorithm to facilitate the position information contained in the MPCs to support the estimation of the agent position without exact knowledge about the environment geometry. Our algorithm adapts in an online manner to both, the time-varying signal-to-noise-ratio and line-of-sight (LOS) existence probability of each anchor. In a numerical analysis we show that the algorithm is able to operate reliably in environments characterized by strong multipath propagation, even if a temporary obstruction of all anchors occurs simultaneously

June 2021: Complex-valued Convolutional Neural Networks for Enhanced Radar Signal Denoising and Interference Mitigation (Alexander Fuchs, Johanna Rock)

Autonomous driving highly depends on capable sensors to perceive the environment and to deliver reliable information to the vehicles’ control systems. To increase its robustness, a diversified set of sensors is used, including radar sensors. Radar is a vital contribution of sensory information, providing high resolution range as well as velocity measurements. The increased use of radar sensors in road traffic introduces new challenges. As the so far unregulated frequency band becomes increasingly crowded, radar sensors suffer from mutual interference between multiple radar sensors. This interference must be mitigated in order to ensure a high and consistent detection sensitivity.

May 2021: Resource-Efficient Deep Neural Networks for Automotive Radar Interference Mitigation (Johanna Rock)

Radar sensors are crucial for environment perception of driver assistance systems as well as autonomous vehicles. Key performance factors are weather resistance and the possibility to directly measure velocity. With a rising number of radar sensors and the so-far unregulated automotive radar frequency band, mutual interference is inevitable and must be dealt with. Algorithms and models operating on radar data in early processing stages are required to run directly on specialized hardware, i.e. the radar sensor. This specialized hardware typically has strict resource constraints, i.e. a low memory capacity and low computational power.

April 2021: Data Fusion for Multipath-Based SLAM (Erik Leitinger)

Multipath-based simultaneous localization and mapping (SLAM) algorithms can detect and localize radio reflective surfaces and jointly estimate the time-varying position of mobile agents. A promising approach is to represent radio reflective surfaces by so called virtual anchors (VAs). In existing multipathbased SLAM algorithms, VAs are modeled and inferred for each physical anchor (PA) and each propagation path individually, even if multiple VAs represent the same physical surface. This limits timeliness and accuracy of mapping and agent localization. In this paper, we introduce an improved statistical model and estimation method that enables data fusion for multipath-based SLAM by representing each surface with a single master virtual anchor (MVA). Our numerical simulation results show that the proposed multipath-based SLAM algorithm can significantly increase map convergence speed and reduce the mapping error compared to a state-of-the-art method.

March 2021: Detection and Estimation of a Spectral Line in MIMO Systems (Erik Leitinger)

We consider the problem of detecting and estimating radio channel dispersion parameters of a single specular multipath component (SMC) embedded in spatially correlated noise from observations collected using a MIMO measurement setup. The corresponding detection threshold versus the false alarm probability is derived from $\chi^2$-random field with two degrees of freedom defined over a 5-dimensional dispersion space using the theoretical framework of the expected Euler characteristic of random excursion sets. We show that the probability of false alarm is in excellent accordance with the relative-frequency of estimating false alarms using a maximum likelihood estimator and detector for acquiring the 5-dimensional dispersion parameter vector of the SMC.

February 2021: Detection and Tracking of Multipath Channel Parameters Using Belief Propagation (Erik Leitinger)

In this work we present a belief propagation (BP) algorithm with probabilistic data association (DA) for detection and tracking of specular multipath components (MPCs). In real dynamic measurement scenarios, the number of MPCs reflected from visible geometric features, the MPC dispersion parameters, and the number of false alarm contributions are unknown and time-varying. We develop a Bayesian model for specular MPC detection and joint estimation problem, and represent it by a factor graph which enables the use of BP for efficient computation of the marginal posterior distributions. A parametric channel estimator is exploited to estimate at each time step a set of MPC parameters, which are further used as noisy measurements by the BP-based algorithm. The algorithm performs probabilistic DA, and joint estimation of the time-varying MPC parameters and mean false alarm rate. Preliminary results using synthetic channel measurements demonstrate the excellent performance of the proposed algorithm in a realistic and very challenging scenario. Furthermore, it is demonstrated that the algorithm is able to cope with a high number of false alarms originating from the prior estimation stage.

January 2021: Differentiable TAN Structure Learning for Bayesian Network Classifiers (Wolfgang Roth)

Title: Differentiable TAN Structure Learning for Bayesian Network Classifiers

December 2020: Quantized Neural Networks for Radar Interference Mitigation (Johanna Rock)

Radar sensors are crucial for environment perception of driver assistance systems as well as autonomous vehicles. Key performance factors are weather resistance and the possibility to directly measure velocity. With a rising number of radar sensors and the so far unregulated automotive radar frequency band, mutual interference is inevitable and must be dealt with. Algorithms and models operating on radar data in early processing stages are required to run directly on specialized hardware, i.e. the radar sensor. This specialized hardware typically has strict resource-constraints, i.e. a low memory capacity and low computational power.

November 2020: Towards building a cross-lingual speech recognition system for Slovenian and Austrian German (Barbara Schuppler)

Methods of cross-lingual speech recognition have a high potential to overcome limitations on resources of spoken language in under-resourced languages. Not only can they be applied to build automatic speech recognition (ASR) systems for such languages, they can also be utilized to generate further resources of spoken language. This paper presents a cross-lingual ASR system based on data from two languages, Slovenian and Austrian German. Both were used as a source and target language for cross-lingual transfer (i.e., the acoustic models were trained on material from the source language, and recognition was tested on material from the target language). The cross-lingual mapping between the Slovenian phone set (40 phones) and the Austrian German phone set (33 phones) was carried out using expert knowledge about the acoustic-phonetic properties of the phones. For the experiments, we used data from two speech corpora: the Slovenian BNSI Broadcast News speech database and the Austrian German GRASS corpus. We trained HMM and DNN acoustic models for monolingual and cross-lingual speech recognition. Evaluating the results (Table 1,2), it became clear that the DNN acoustic models outperformed the HMM models. The speech recognition results (Table 2) for Austrian German as the target language clearly outperformed those with Slovenian as the target language. Possible explanations for this difference in performance are: 1) The higher number of phones in the Slovenian language, 2) The speaking style discrepancies of the databases (i.e., a mix of read and spontaneous speech in the Slovenian data vs. read speech only in the Austrian data), and 3) the recording quality mismatch (i.e., GRASS is recorded under better conditions than BNSI). The full version of the paper can be found on The Phonetician.

October 2020: Modeling Human Body Influence in UWB Channels (Thomas Wilding)

In this paper we describe a simple and intuitive model for the effects of the human body of a user close by a receiver. We specifically investigate the UWB channel in off-body condition, where the agent device is located directly on the human body, and another device functioning as anchor is located in the environment. Due to the high time resolution of UWB signals, the effect of the human body can be modeled by means of a extended object producing multiple scattered paths.

September 2020: Reliability and Threshold-Region Performance of TOA Estimators in Dense Multipath Channels (Alexander Venus)

This work, we investigate the reliability of time-of-arrival (TOA) based ranging using maximum-likelihood (ML) estimation in a dense multipath (DM) channel in terms of both the conventional mean squared error (MSE) as well as confidence bounds. We show that in the presence of DM the ML estimator distorts the signal mainlobe due to its whitening property, resulting in a bandwidth (BW) dependent bias, even before the outlier driven threshold region is reached.

August 2020: Information-Criterion-Based Agent Selection for Cooperative Localization in Static Networks (Lukas Wielandner)

In this work, we propose a Bayesian agent network planning algorithm for information-criterion-based measurement selection for cooperative localization in static networks with anchors. This allows to increase the accuracy of the agent positioning while keeping the number of measurements between agents to a minimum. The proposed algorithm is based on minimizing the conditional differential entropy (CDE) of all agent states to determine the optimal set of measurements between agents. Such combinatorial optimization problems have factorial runtime and quickly become infeasible, even for a rather small number of agents. Therefore, we propose a Bayesian agent network planning algorithm that performs a local optimization for each state. Experimental results demonstrate a performance improvement compared to a random measurement selection strategy, significantly reducing the position RMSE at a smaller number of measurements between agents.

July 2020: CNNs for Interference Mitigation and Denoising in Automotive Radar Using Real-World Data (Johanna Rock)

Title: CNNs for Interference Mitigation and Denoising in Automotive Radar Using Real-World Data

June 2020: Computational Lung Sound Analysis using Deep Learning (Franz Pernkopf)

Computational methods for the analysis of lung sounds are beneﬁcial for computer-supported diagnosis, digital storage and monitoring in critical care. Pathological changes of the lung are tightly connected to characteristic sounds enabling a fast and inexpensive diagnosis. Traditional auscultation with a stethoscope has several disadvantages: subjectiveness, i.e. the lung sounds are evaluated depending on the experience of the physician, cannot provide continuous monitoring and a trained expert is required. Furthermore, the characteristics of the sounds are in the low frequency range, where the human hearing has limited sensitivity and is susceptible to noise artifacts.

May 2020: Phonation type contrasts and tone in Chichimec (Barbara Schuppler)

Chichimec (Otomanguean) has two tones, high and low, and a phonological three-way phonation contrast: modal /V/, breathy /V̤/ and creaky /V̰/. Tone and phonation type contrasts are used independently. This paper investigates the acoustic realization of modal, breathy and creaky vowels, the timing of phonation in non-modal vowels, and the production of tone in combination with different phonation types. The results of Cepstral Peak Prominence and three spectral tilt measures showed that phonation type contrasts are not distinguished by the same acoustic measures for women and men. In line with expectations for laryngeally complex languages, phonetic modal and non-modal phonation are sequenced in phonological breathy and creaky vowels. With respect to the timing pattern, however, the results show that non-modal phonation is not, as previously reported, mainly located in the middle of the vowel. Non-modal phonation is instead predominantly realized in the second half of phonological breathy and creaky vowels. Tone is distinguished in all three phonation types, and non-modal vowels do not exhibit distinct F0 ranges, except for creaky vowels produced by women, in which F0 declines in the creaky portion. The results of the acoustic analysis provide additional insights to phonological accounts of laryngeal complexity in Chichimec.

April 2020: Learning a Behavior Model of Hybrid Systems Through Combining Model-Based Testing and Machine Learning (Wolfgang Roth)

Models play an essential role in the design process of cyber-physical systems. They form the basis for simulation and analysis and help in identifying design problems as early as possible. However, the construction of models that comprise physical and digital behavior is challenging. Consequently, there is considerable interest in learning the behavior of such systems using machine learning. However, the performance of the machine learning techniques depends crucially on sufficient and representative training data covering the behavior of the system adequately not only in standard situations, but also in edge cases that are often particularly important.

March 2020: Deep Structured Mixtures of Gaussian Processes (Martin Trapp)

Gaussian Processes (GPs) are powerful non-parametric Bayesian regression models that allow exact posterior inference, but exhibit high computational and memory costs. In order to improve scalability of GPs, approximate posterior inference is frequently employed, where a prominent class of approximation techniques is based on local GP experts. However, the local-expert techniques proposed so far are either not well-principled, come with limited approximation guarantees, or lead to intractable models.

February 2020: Acoustic Scene Classification Using Deep Mixtures Of Pre-trained Convolutional Neural Networks (Thi Kim Truc Nguyen)

We propose a heterogeneous system of Deep Mixture of Experts (DMoEs) models using different Convolutional Neural Networks (CNNs) for acoustic scene classification (ASC). Each DMoEs module is a mixture of different parallel CNN structures weighted by a gating network. All CNNs use the same input data. The CNN architectures play the role of experts extracting a variety of features. The experts are pre-trained, and kept fixed (frozen) for the DMoEs model. The DMoEs is post-trained by optimizing weights of the gating network, which estimates the contribution of the experts in the mixture. In order to enhance the performance, we use an ensemble of three DMoEs modules each with different pairs of inputs and individual CNN models. The input pairs are spectrogram combinations of binaural audio and mono audio as well as their pre-processed variations using harmonic-percussive source separation (HPSS) and nearest neighbor filters (NNFs). The classification result of the proposed system is 72.1% improving the baseline by around 12% (absolute) on the development data of DCASE 2018 challenge task 1A.

January 2020: Recurrent Dilated DenseNets for a Time-Series Segmentation Task (Alexander Fuchs)

Efficient real-time segmentation and classification of time-series data is key in many applications, including sound and measurement analysis. We propose an efficient convolutional recurrent neural network (CRNN) architecture that is able to deliver improved segmentation performance at lower computational cost than plain RNN methods. We develop a CNN architecture, using dilated DenseNet-like kernels and implement it within the proposed CRNN architecture. For the task of online wafer-edge measurement analysis, we compare our proposed methods to standard RNN methods, such as Long Short Term Memory (LSTM) and Gated Recurrent Units (GRUs). We focus on small models with a low computational complexity, in order to run our model on an embedded device. We show that frame-based methods generally perform better than RNNs in our segmentation task and that our proposed recurrent dilated DenseNet achieves a substantial improvement of over 1.1 % F1-score compared to other frame-based methods.

December 2019: Bayesian Learning of Sum-Product Networks (Martin Trapp)

Sum-product networks (SPNs) are flexible density estimators and have received significant attention, due to their attractive inference properties. While parameter learning in SPNs is well developed, structure learning leaves something to be desired: Even though there is a plethora of SPN structure learners, most of them are somewhat ad-hoc, and based on intuition rather than a clear learning principle.

November 2019: Prosodic effects on plosive duration in German and Austrian German (Barbara Schuppler)

This study investigates the acoustic cues used to mark prosodic boundaries in two varieties of German, with a specific focus on variations in production of fortis and lenis plosives. Based on prosodic-boundary-adjacent and non-boundary-adjacent plosives from GRASS (Austrian German) and the Kiel Corpus of Read Speech (Northern German), we found that closure and burst duration features, as well as duration of a preceding adjacent segment,vary consistently in relationship to the presence or absence of a prosodic boundary, but that the relative weights of these features differ in the two varieties studied. Whereas stress marking in plosives is being driven more by burst duration in the Kiel Corpus data, it is driven more by closure duration in the GRASS data. This study suggests that boundary detection tools require variety-specific training materials, or else information from comparative studies such as the current work, in order to attain optimalfunction in specific varieties or dialects.

October 2019: Training Discrete-Valued Neural Networks with Sign Activations Using Weight Distributions (Wolfgang Roth)

Since resource-constrained devices hardly benefit from the trend towards ever-increasing neural network (NN) structures, there is growing interest in designing more hardware-friendly NNs. In this paper, we consider the training of NNs with discrete-valued weights and sign activation functions that can be implemented more efficiently in terms of inference speed, memory requirements, and power consumption. We build on the framework of probabilistic forward propagations using the local reparameterization trick, where instead of training a single set of NN weights we rather train a distribution over these weights. Using this approach, we can perform gradient-based learning by optimizing the continuous distribution parameters over discrete weights while at the same time perform backpropagation through the sign activation. In our experiments, we investigate the influence of the number of weights on the classification performance on several benchmark datasets, and we show that our method achieves state-of-the-art performance.

September 2019: Complex Signal Denoising and Interference Mitigation for Automotive Radar Using Convolutional Neural Networks (Johanna Rock)

Automotive radar is used to perceive the vehicle’s environment due to its capability to measure distance, velocity and angle of surrounding objects with a high resolution. With an increasing number of deployed radar sensors on the streets and because of missing regulations of the automotive radar frequency band, mutual interference must be dealt with in order to retain a sensitive detection capability.

August 2019: Belief Propagation Accurate Marginals or Accurate Partition Function (Christian Knoll)

The marginals and the partition function can be estimated in a straight-forward manner for tree-structured models but require efficient approximation methods if the graphical model contains loops. One such method is Belief Propagation (BP) that exploits the structure of probabilistic graphical models in order to approximate the marginal distribution and the partition function.

July 2019: Multipath-based SLAM Exploiting AoA and Amplitude Information (Erik Leitinger)

Simultaneous localization and mapping (SLAM) is important in many fields including robotics, autonomous driving, location-aware communication, and robust indoor localization. Specifically, robustness, i.e. achieving a low probability of localization outage, is still a challenging task in environments with strong multipath propagation. Therefore, new systems supporting multipath channels either take advantage of it by exploiting multipath components (MPCs) for localization [5], [6], [10], exploiting cooperation among agents, and/or exploiting robust signal processing against multipath propagation and clutter measurements in general.

June 2019: Analytical Investigation of Non-Coherent Mutual FMCW Radar Interference (Mate Toth)

Radar sensors are increasingly utilized in today’s cars. This inevitably leads to increased mutual sensor interference and thus a performance decrease, potentially resulting in major safety risks. Understanding signal impairments caused by interference accurately helps to devise signal processing schemes to combat said performance degradation. For the FMCW radars prevalent in automotive applications, it has been shown that so-called non-coherent interference occurs frequently and results in an increase of the noise floor. In this work we investigate the impact of interference analytically by focusing on its detailed description. We show, among others, that the spectrum of the typical interference signal has a linear phase and a magnitude that is strongly fluctuating with the phase parameters of the time domain interference signal. Analytical results are verified by simulation, highlighting the dependence on the specific phase terms that cause strong deviations from spectral whiteness.

May 2019: Bayesian Neural Networks with Weight Sharing Using Dirichlet Processes (Wolfgang Roth)

We extend feed-forward neural networks with a Dirichlet process prior over the weight distribution. This enforces a sharing on the network weights, which can reduce the overall number of parameters drastically. We alternately sample from the posterior of the weights and the posterior of assignments of network connections to the weights. This results in a weight sharing that is adopted to the given data. In order to make the procedure feasible, we present several techniques to reduce the computational burden. Experiments show that our approach mostly outperforms models with random weight sharing. Our model is capable of reducing the memory footprint substantially while maintaining a good performance compared to neural networks without weight sharing.

April 2019: On the use of acoustic features for automatic disambiguation of homophones in spontaneous German (Barbara Schuppler)

Homophones pose serious issues for automatic speech recognition (ASR) as they have the same pronunciation but different meanings or spellings. Homophone disambiguation is usually done within a stochastic language model or by an analysis of the homophonous word’s context. Whereas this method reaches good results in read speech, it fails in conversational, spontaneous speech, where utterances are often short, contain disfluencies and/or are realized syntactically incomplete. Phonetic studies have shown that words that are homophonous in read speech often differ in their phonetic detail in spontaneous speech. Whereas humans use phonetic detail to disambiguate homophones, this linguistic information is usually not explicitly incorporated into ASR systems.

March 2019: Single-Anchor, Multipath-Assisted Indoor Positioning with Aliased Antenna Arrays (Thomas Wilding)

Highly accurate indoor positioning is still a hard problem due to interference caused by multipath propagation and the resulting high complexity of the infrastructure. We focus on the possibility of exploiting information contained in specular multipath components (SMCs) to increase the positioning accuracy of the system and to reduce the required infrastructure, using a-priori information in form of a floor plan. The system utilizes a single anchor equipped with array antennas and wideband signals to allow separating the SMCs. We derive a closed form of the Cramér-Rao lower bound for array-based multipath-assisted positioning and examine the beneficial effect of spatial aliasing of antenna arrays on the achievable angular resolution and as a direct consequence onto the positioning accuracy. It is shown that ambiguities that arise due to the aliasing can be resolved by exploiting the information contained in SMCs. The theoretic results are validated by simulations.

February 2019: SALMA: UWB-based Single-Anchor Localization System using Multipath Assistance (Michael Rath)

Setting up indoor localization systems is often excessively time-consuming and labor-intensive, because of the high amount of anchors to be carefully deployed or the burdensome collection of fingerprints. In this work, we present SALMA, a novel low-cost ultra-wideband-based indoor localization system that makes use of only one anchor with minimized calibration and training efforts.
The system leverages the gained insights of our previous works, exploiting multipath reflections of radio signals to enhance positioning performance. To this end, only a crude floor plan is needed which enables SALMA to accurately determine the position of a mobile tag using a single anchor, hence minimizing the infrastructure costs, as well as the setup time.
We implement SALMA on off-the-shelf UWB devices based on the Decawave DW1000 transceiver and show that, by making use of multiple directional antennas, SALMA can also resolve ambiguities due to overlapping multipath components.
An experimental evaluation in an office environment with clear line-of-sight (LOS) has shown that 90% of the position estimates obtained using SALMA exhibit less than 20 cm error, with a median below 8 cm. We further study the performance of SALMA in the presence of obstructed line-of-sight (OLOS) conditions, moving objects and furniture, as well as in highly dynamic environments with several people moving around, showing that the system can sustain decimeter-level accuracy with a worst-case average error below 34 cm.

January 2019: Overcoming Covariance Matrix Phase Sensitivity in Single-Channel Speech Enhancement with Correlated Spectral Components (Johannes Stahl)

Single-channel speech enhancement refers to the reduction of noise signal components in a single-channel signal composed of both speech and noise. A wide range of single-channel speech enhancement algorithms is formulated in the short-time Fourier transform (STFT). Traditional approaches assume statistical independence between signal components from different time-frequency regions, resulting in estimators that are functions of diagonal covariance matrices. More recent approaches drop this assumption and explicitly model dependencies between STFT bins. Full covariance matrices of speech and noise are required in this case to obtain optimal estimates of the clean speech spectrum, where off-diagonal entries are complex-valued in general. We show that the performance of estimators resulting from such models is highly sensitive to the phase estimation accuracy of these off-diagonal entries. Since it is non-trivial to estimate the covariance phases from noisy speech data, we propose a linear multidimensional short-time spectral amplitude estimator that circumvents the need to estimate them. We evaluate the speech enhancement performance of this approach and compare it to relevant benchmarks that also take into account inter-channel dependencies.

December 2018: Computer-aided Lung Sound Analysis for the Diagnosis of Idiopathic Pulmonary Fibrosis - A First Study (Elmar Messner)

Early diagnosis of idiopathic pulmonary fibrosis (IPF) is of increasing importance, due to recent success to slow down the disease progression. Auscultation is a helpful mean for early diagnosis of IPF. Auscultatory findings for IPF are fine (or velcro) crackles during inspiration, which are heard over affected areas.

November 2018: Rethinking Reduction: Interdisciplinary Perspectives on Conditions, Mechanisms, and Domains for Phonetic Variation (Barbara Schuppler)

One main goal of the recently finished FWF funded project “Cross-layer pronunciation models for conversational speech” was to investigate interdisciplinary approaches towards studying pronunciation variation and to show how researchers in the fields of automatic speech recognition, psycholinguistics and phonetics/phonology can profit from integrating findings of the respective fields. Such new approaches, covering all mentioned disciplines, are presented in the book “Rethinking Reduction”. The book contains 11 peer reviewed chapters, of which two are overview chapters written by the editors, and 9 contain original research. With “Reduction” we refer to acoustically reduced words. In natural conversations, for instance, a word like “yesterday” might be pronounced as yeshay, and a word like “haben” might be pronounced like ham. Phonetically reduced forms are extremely plentiful (e.g., 62% of word tokens in spontaneous Austrian German conversations are reduced), theoretically interesting (e.g., how do people learn to produce and understand the multiple reduced pronunciation variants existing per word?), and a key challenge for automatic speech recognition systems (e.g., new methods for acoustic and pronunciation modelling are needed). Despite the high frequency of reduced pronunciation variants, the canonical forms are still central to models of production and perception. Drawing from different fields and diverse languages, this volume brings new insights to the debate on abstractions and canonical forms in linguistics: their psychological reality, descriptive adequacy, and technical implementability.

October 2018: A Simple and Effective Framework for A Priori SNR Estimation (Johannes Stahl)

In this work, we address the problem of estimating the a priori SNR for single-channel speech enhancement. Similar to the decision-directed (DD) approach we linearly combine the maximum likelihood estimate of the current a priori SNR with an estimate obtained from the previous frame. Based on the harmonic model for voiced speech we propose to smooth the a priori SNR estimate along harmonic trajectories instead of fixed discrete Fourier transform frequency bins. We interpolate from fixed DFT frequencies to harmonic frequencies by using a pitch-adaptive zero-padding in the time domain. The resulting pitch-adaptive decision-directed (PADDi) method increases the noise attenuation compared to the classical decision-directed approach and outperforms benchmark methods in terms of speech enhancement performance for several noise types at different SNRs, quantified by objective evaluation criteria.

September 2018: Heart Sound Segmentation—An Event Detection Approach Using Deep Recurrent Neural Networks (Elmar Messner)

We present a method to accurately detect the state-sequence first heart sound (S1)–systole–second heart sound (S2)–diastole , i.e., the positions of S1 and S2, in heart sound recordings. We propose an event detection approach without explicitly incorporating a priori information of the state duration. This renders it also applicable to recordings with cardiac arrhythmia and extendable to the detection of extra heart sounds (third and fourth heart sound), heart murmurs, as well as other acoustic events. We use data from the 2016 PhysioNet/CinC Challenge, containing heart sound recordings and annotations of the heart sound states. From the recordings, we extract spectral and envelope features and investigate the performance of different deep recurrent neural network (DRNN) architectures to detect the state sequence. We use virtual adversarial training, dropout, and data augmentation for regularization. We compare our results with the state-of-the-art method and achieve an average score for the four events of the state sequence of F1≈96% on an independent test set.

August 2018: Hybrid Generative-Discriminative Training of Gaussian Mixture Models (Wolfgang Roth)

Recent work has shown substantial performance improvements of discriminative probabilistic models over their generative counterparts. However, since discriminative models do not capture the input distribution of the data, their use in missing data scenarios is limited. To utilize the advantages of both paradigms, we present an approach to train Gaussian mixture models (GMMs) in a hybrid generative-discriminative way. This is accomplished by optimizing an objective that trades off between a generative likelihood term and either a discriminative conditional likelihood term or a large margin term using stochastic optimization. Our model substantially improves the performance of classical maximum likelihood optimized GMMs while at the same time allowing for both a consistent treatment of missing features by marginalization, and the use of additional unlabeled data in a semi-supervised setting. For the covariance matrices, we employ a diagonal plus low-rank matrix structure to model important correlations while keeping the number of parameters small. We show that a non-diagonal matrix structure is crucial to achieve good performance and that the proposed structure can be utilized to considerably reduce classification time in case of missing features. The capabilities of our model are demonstrated in extensive experiments on real-world data.

June 2018: AoA and ToA Accuracy for Antenna Arrays in Dense Multipath Channels (Thomas Wilding)

The accuracy that can be achieved in time of arrival (ToA) estimation strongly depends on the utilized signal bandwidth. In an indoor environment multipath propagation usually causes a degradation of the achievable accuracy due to the overlapping signals. A similar effect can be observed for the angle of arrival (AoA) estimation using arrays. This paper derives a closed-form equation for the Cramér-Rao lower bound (CRLB) of the achievable AoA and the ToA error variances, considering the presence of dense multipath. The Fisher information expressions for both parameters allow an evaluation of the influence of channel parameters and system parameters such as the array geometry. Our results demonstrate that the AoA estimation accuracy is strongly related to the signal bandwidth, due to the multipath influence. The theoretical results are evaluated for experimental data, with simulations performed for ULAs with M=2 and M=16 array elements.

May 2018: On the Unimportance of Phase-Coherent Measurements for Beampattern-Assisted Positioning (Josef Kulmer)

Accurate indoor radio positioning requires high-resolution measurements to either utilize or mitigate the impact of multipath propagation. This high resolution can be achieved using large signal-bandwidth, leading to superior time resolution and / or multiple antennas, leading to additional angle resolution. To facilitate multiple antennas, phase-coherent measurements are typically necessary. In this work we propose to employ non-phase-coherent measurements obtained from directional antennas for accurate single-anchor indoor positioning. The derived algorithm exploits beampatterns to jointly estimate multipath amplitudes to be used in maximum likelihood position estimation. Our evaluations based on measured and computer generated data demonstrate only a minor degradation in comparison to a phase-coherent processing scheme.

April 2018: Acoustic Design of a Recording Room (Jamilla Balint)

It is intended to achieve similar acoustic conditions as in an already existing live room. The challenges occurring especially in small spaces are introduced and a number of acoustical absorbers are presented. The types of absorbers capable of damping the low frequency room modes are discussed. The acoustic measurements are evaluated, the reverberation time is selected as a significant criterion and a low, frequency-independent target value is chosen. A 3D-model for the acoustic simulation software is built and on the basis of the simulations, various optimisation measures are developed. Concerning an adequate dampening of the room modes, edge or corner absorbers are selected as the basic concept for the enhancement and compound panel absorbers are planned to be installed on the walls. For prevention of flutter echoes and a sufficient gain of absorption and diffusion, a panel system on the ceiling is designed. Finally, the acoustical measures taken are presented and evaluated, specifically regarding the reverberation time, room modes and reflections.

March 2018: Anchorless Cooperative Tracking Using Multipath Channel Information (Josef Kulmer)

Highly accurate location information is a key facilitator to stimulate future services for the commercial and public sectors. Positioning and tracking of absolute positions of wireless nodes usually requires information provided from technical infrastructure, e.g. satellites or fixed anchor nodes, whose maintenance is costly and whose limited operating coverage narrows the positioning service. In this paper we present an algorithm aiming at tracking of absolute positions without using information from fixed anchors, odometers or inertial measurement units. We perform radio channel measurements in order to exploit position-related information contained in multipath components (MPCs). Tracking of the absolute node positions is enabled by estimation of MPC parameters followed by association of these parameters to a floorplan. To account for uncertainties in the floorplan and for propagation effects like diffraction and penetration, we recursively update the provided floorplan using the measured MPC parameters. We demonstrate the ability to localize two agent nodes without the employment of further infrastructure, using data from ultra-wideband channel measurements. Further, we show the potential performance gain if also one fixed anchor is available and we validate the algorithm for a range of different signal bandwidths and number of nodes.

February 2018: A Pitch-Synchronous Simultaneous Detection-Estimation Framework for Speech Enhancement (Johannes Stahl)

Speech enhancement methods formulated in the STFT domain vary in the statistical assumptions made on the STFT coefficients, in the optimization criteria applied or in the models of the signal components. Recently, approaches relying on a stochastic-deterministic speech model have been proposed. The deterministic part of the signal corresponds to harmonically related sinusoids, often used to represent voiced speech. The stochastic part models signal components that are not captured by the deterministic components. In this work, we consider this scenario under a new perspective yielding three main contributions. First, a pitch-synchronous signal representation is considered and shown to be advantageous for the estimation of the harmonic model parameters. Second, we model the harmonic amplitudes in voiced speech as random variables with frequency bin dependent Gamma distributions. Finally, distinct estimators for the different models of voiced speech, unvoiced speech, and speech absence are derived. To select from the arising estimates, we take into account the mutual impact of detection and estimation by proposing a binary decision framework that is derived from a Bayesian risk function.

January 2018: Competitive Linearity for Envelope Tracking (Karl Freiberger, Harald Enzinger)

The IMS student design competitions are an annual event at the IEEE MTT-S International Microwave Symposium. In 2017, the SPSC lab members Harald Enzinger and Karl Freiberger won the first prize in the competition “Power Amplifier Linearization through Digital Predistortion”. The aim of this competition was to linearize a highly efficient but nonlinear envelope tracking power amplifier in dual-band operation by means of digital predistortion. The winning solution combines several state-of-the-art methods for crest factor reduction (CFR) and digital predistortion (DPD) with new extensions, developed specifically for this competition. You can find out more about this winning solution in the Jannuary/February issue of the IEEE Microwave Magazine. A preprint version of the paper can be downloaded from the SPSC website.

December 2017: Frame and Segment Level Recurrent Neural Networks for Phone Classification (Martin Ratajczak)

We introduce a simple and efficient frame and segment level RNN model (FS-RNN) for phone classification. It processes the input at frame level and segment level by bidirectional gated RNNs. This type of processing is important to exploit the (temporal) information more effectively compared to (i) models which solely process the input at frame level and (ii) models which process the input on segment level using features obtained by heuristic aggregation of frame level features. Furthermore, we incorporated the activations of the last hidden layer of the FS-RNN as an additional feature type in a neural higher-order CRF (NHO-CRF). In experiments, we demonstrated excellent performance on the TIMIT phone classification task, reporting a performance of 13.8% phone error rate for the FS- RNN model and 11.9% when combined with the NHO-CRF. In both cases we significantly exceeded the state-of-the-art performance.

November 2017: A Rate-Distortion Approach to Caching (Bernhard Geiger)

We consider a lossy single-user caching problem with correlated sources – just think of streaming compressed videos! Most users will watch these videos in the evening, leading to network congestion. If you have a player with a cache, though, you can fill this cache with data during times of low network usage, even though you may not know which video the user wants to watch in the evening. In our paper, we characterize the transmission rate required in the evening as a function of the cache size and as a function of the distortion one accepts when watching the videos. We furthermore hint at what should be put in the cache such that it is useful for a variety of videos, and we connect these results to the common-information measures proposed by Wyner, Gacs and Koerner.

October 2017: On Loopy Belief Propagation -- Local Stability Analysis for Non-Vanishing Fields (Christian Knoll)

In this work, we obtain all fixed points of belief propagation and perform a local stability analysis. We consider pairwise interactions of binary random variables and investigate the influence of non-vanishing fields and finite-size graphs on the performance of belief propagation; local stability is heavily influenced by these properties. We show why non-vanishing fields help to achieve convergence and increase the accuracy of belief propagation. We further explain the close connections between the underlying graph structure, the existence of multiple solutions, and the capability of belief propagation (with damping) to converge. Finally, we provide insights into why finite-size graphs behave better than infinite-size graphs.

September 2017: Multipath-assisted Indoor Positioning Enabled by Directional UWB Sector Antennas (Michael Rath)

High-accuracy indoor radio positioning can be achieved by using high signal bandwidths to increase the time resolution. Multiple fixed anchor nodes are needed to compute the position or alternatively, reflected multipath components can be exploited with a single anchor. In this work, we propose a method that explores the time and angular domains with a single anchor. This is enabled by switching between multiple directional ultra-wideband (UWB) antennas. The UWB transmission allows to perform multipath resolved indoor positioning, while the directionality increases the robustness to undesired, interfering multipath propagation with the benefit that the required bandwidth is reduced. The positioning accuracy and performance bounds of the switched antenna are compared to an omni-directional antenna. Two positioning algorithms are presented based on different prior knowledge available, one using floorplan information only and the other using additionally the beampatterns of the antennas. We show that the accuracy of the position estimate is significantly improved, especially in tangential direction to the anchor.

August 2017: SLIC EVM - Error Vector Magnitude without Demodulation (Karl Freiberger)

We present a method for measuring a communication signal’s inband error caused by a non-ideal device under test (DUT). In contrast to the established error vector magnitude (EVM), we do not demodulate the data symbols. Rather, we subtract linearly correlated (SLIC) parts from the DUT output and analyze the power spectral density of the remaining error signal. Consequently, we do not require in-depth knowledge of the modulation standard. This makes our method well suited for measurements with cutting-edge communication signals, without the need to purchase or implement a dedicated EVM analyzer. We show that our SLIC-EVM approach allows for estimating the subcarrier-dependent EVM for typical transceiver impairments like IQ mismatch, phase noise, and power amplifier (PA) nonlinearity. We present measurement results of a WLAN PA, showing less than 0.2 dB absolute deviation from the regular EVM with demodulation.

July 2017: Impact of phase estimation on single-channel speech separation based on time-frequency masking

Time-frequency masking is a common solution for the single-channel source separation (SCSS) problem where the goal is to find a time-frequency mask that separates the underlying sources from an observed mixture. An estimated mask is then applied to the mixed signal to extract the desired signal. During signal reconstruction, the time-frequency–masked spectral amplitude is combined with the mixture phase. This article considers the impact of replacing the mixture spectral phase with an estimated clean spectral phase combined with the estimated magnitude spectrum using a conventional model-based approach. As the proposed phase estimator requires estimated fundamental frequency of the underlying signal from the mixture, a robust pitch estimator is proposed. The upper-bound clean phase results show the potential of phase-aware processing in single-channel source separation. Also, the experiments demonstrate that replacing the mixture phase with the estimated clean spectral phase consistently improves perceptual speech quality, predicted speech intelligibility, and source separation performance across all signal-to-noise ratio and noise scenarios.

June 2017: UHF-RFID Backscatter Channel Analysis for Accurate Wideband Ranging (Stefan Grebien)

Positioning and ranging within UHF RFID are highly dependent on the channel characteristics. The accuracy of time-of-flight based ranging systems is fundamentally limited by the available bandwidth. We thus analyze the UHF RFID backscatter channel formed by convolution of the individual constituent channels. For this purpose, we present comprehensive wideband channel measurements in two representative scenarios and an analysis with respect to the Rician K-factor for the line-of-sight component, the root-mean-square delay spread, and the coherence distance, which all influence the potential positioning performance.

May 2017: Eigenvector-based Speech Mask Estimation for Multi-Channel Speech Enhancement (Franz Pernkopf)

Using speech masks for multi-channel speech enhancement gained attention over the last years, as it combines the benefits of digital signal processing (beamforming) and machine-learning (learn the speech mask from data). We demonstrate how a speech mask can be used to construct the Minimum Variance Distortionless response (MVDR), Generalized Sidelobe Canceler (GSC) and Generalized Eigenvalue (GEV) beamformers, and a MSE-optimal postfilter. We propose a neural network architecture that learns the speech mask from the spatial information hidden in the multi-channel input data, by using the dominant eigenvector of the Power Spectral Density (PSD) matrix of the noisy speech signal as feature vector. We use CHiME-4 audio data to train our network, which contains a single speaker engulfed in ambient noise. Depending on the speakers location and the geometry of the microphone array the eigenvectors form local clusters, whereas they are randomly distributed for the ambient noise. The neural network learns this clustering from the training data. In a second step, we use the cosine similarity between neighboring eigenvectors as feature vector, which makes our approach less dependent on the array geometry and the speaker’s position. We compare our results against the most prominent model-based and data-driven approaches, using PESQ and PEASS/OPS scores. Our system yields superior results, both in terms of perceptual speech quality and speech mask prediction error.

April 2017: Low-cost or compost? Using DecaWave UWB Transceivers for High-accuracy Multipath-assisted Indoor Positioning (Josef Kulmer)

Robust indoor positioning at sub-meter accuracy typically requires highly accurate radio channel measurements to extract precise time-of-flight measurements. Emerging UWB transponders like the DecaWave DW1000 chip offer to estimate channel impulse responses with a reasonably high bandwidth, yielding a ranging precision below 10 cm. The competitive pricing of these chips allows scientists and engineers for the first time to exploit the benefits of UWB for indoor positioning without the need for a massive investment into experimental equipment.

March 2017: A Noise Power Ratio Measurement Method for Accurate Estimation of the Error Vector Magnitude (Karl Freiberger)

Error vector magnitude (EVM) and noise power ratio (NPR) measurements are well-known approaches to quantify the inband performance of communication systems and their respective components. In contrast to NPR, EVM is an important design specification and is widely adopted by modern communication standards such as 802.11 (WLAN). However, EVM requires full demodulation, whereas NPR excels with simplicity requiring only power measurements in different frequency bands. Consequently, NPR measurements avoid bias due to insufficient synchronization and can be readily adapted to different standards and bandwidths. We argue that NPR-inspired measurements can replace EVM in many practically relevant cases. We show how to set up the signal generation and analysis for power-ratio-based estimation of EVM in orthogonal frequency-division multiplexing systems impaired by additive noise, power amplifier nonlinearity, phase noise, and in-phase–quadrature (IQ) imbalance. Our method samples frequency-dependent inband errors via a single measurement and can either include or exclude the effect of IQ mismatch using asymmetric or symmetric stopband locations, respectively. We present measurement results using an 802.11ac PA at different backoffs, corroborating the practicability and accuracy of our method. Using the same measurement chain, the mean absolute deviation from the EVM is less than 0.35 dB.

February 2017: Variational Inference in Neural Networks using an Approximate Closed-Form Objective (Wolfgang Roth)

We propose a closed-form approximation of the intractable KL divergence objective for variational inference in neural networks. The approximation is based on a probabilistic forward pass where we successively propagate probabilities through the network. Unlike existing variational inferences schemes that typically rely on stochastic gradients that often suffer from high variance our method has a closed-form gradient. Furthermore, the probabilistic forward pass inherently computes expected predictions together with uncertainty estimates at the outputs. In experiments, we show that our model improves the performance of plain feed-forward neural networks. Moreover, we show that our closed-form approximation works well compared to model averaging and that our model is capable of producing reasonable uncertainties in regions where no data is observed.

January 2017: Fixed Points of Belief Propagation (Christian Knoll)

Belief propagation is an iterative method to perform approximate inference on arbitrary graphical models. Whether it converges and if the solution is a unique fixed point depends on both, the structure and the parametrization of the model. To understand this dependence it is interesting to find all fixed points.
We formulate a set of polynomial equations, the solutions of which correspond to BP fixed points. Experiments on binary Ising models show how our method is capable of obtaining all fixed points.

December 2016: Iterative Joint MAP Single-Channel Speech Enhancement Given Non-uniform Phase Prior

Within the last three decades research in single-channel speech enhancement has been mainly focused on filtering the noisy spectral amplitude without that much focus on the integration of phase-based signal processing. Recently, several phase-aware algorithms based on phase-sensitive signal models were proposed for speech enhancement using the minimum mean squared error (MMSE). Improved performance over the conventional phase-insensitive approaches has been achieved. In this paper, we propose an iterative joint maximum a posteriori (MAP) amplitude and phase estimator (ijMAP) assuming a non-uniform phase distribution. Experimental results demonstrate the effectiveness of the proposed method in recovering both amplitude and phase in noise, justified by perceived quality, speech intelligibility and phase estimation error instrumental measures. The proposed method, brings joint improvement in perceived quality and speech intelligibility compared to the phase-blind joint MAP estimator with a comparable performance to the complex MMSE estimator.

November 2016: Phase-Aware Signal Processing in Speech Communication: Theory and Practice

An overview on the challenging new topic of phase-aware signal processing

October 2016: Iterative Harmonic Speech Enhancement (Johannes Stahl)

In digital speech transmission the transmitted speech signal is often corrupted by noise arising from various kinds of sources such as passing cars or chatting people in a restaurant. The aim of speech enhancement is to compensate for the detrimental effects these interferences have on the speech quality. In this work we present a method to enhance voiced speech segments only, which are often modeled as a sum of harmonically related sinusoids. We propose an iterative estimation scheme to jointly estimate the harmonic parameters, i.e., amplitude, frequency and phase of the harmonics of the underlying speech signal. Here we utilize the expectation-maximazation (EM) algorithm to obtain the harmonic parameters which are then used to reconstruct voiced speech segments. The potential of the proposed speech enhancement method in terms of harmonic parameter estimation is validated on synthetic harmonic signals. Further, by applying it to noise corrupted speech files we demonstrate its effectiveness in improving instrumentally predicted speech intelligibility and perceived speech quality.

September 2016: Virtual Adversarial Training Applied to Neural Higher-Order Factors for Phone Classification (Martin Ratajczak)

We explore virtual adversarial training (VAT) applied to neural higher-order conditional random fields for sequence labeling. VAT is a recently introduced regularization method promoting local distributional smoothness: It counteracts the problem that predictions of many state-of-the-art classifiers are unstable to adversarial perturbations. Unlike random noise, adversarial perturbations are minimal and bounded perturbations that flip the predicted label. We utilize VAT to regularize neural higher-order factors in conditional random fields. These factors are for example important for phone classification where phone representations strongly depend on the context phones. However, without using VAT for regularization, the use of such factors was limited as they were prone to overfitting. In extensive experiments, we successfully apply VAT to improve performance on the TIMIT phone classification task. In particular, we achieve a phone error rate of 13.0%, exceeding the state-ofthe-art performance by a wide margin.

August 2016: A Joint Linearity-Efficiency Model of Radio Frequency Power Amplifiers (Harald Enzinger)

We present an analytical model of the joint linearity-efficiency behavior of radio frequency power amplifiers. The model is derived by Fourier series analysis of a generic amplifier circuit including both strong nonlinearity due to current-clipping as well as weak nonlinearity due to transconductance variation. By selection of the biasing point, common amplifier classes like class A, class B and class AB can be modeled. For numerical evaluation, the model reduces to two lookup-tables, which makes it well suited for high-level system simulations. In an application example we demonstrate how the model can be used to simulate the error-vector-magnitude and the average efficiency for specific single-carrier and multi-carrier modulation schemes.

July 2016: Localization and Characterization of Multiple Harmonic Sources (Hannes Pessentheiner)

We introduce a new and intuitive algorithm to characterize and localize multiple harmonic sources intersecting in the spatial and frequency domains. It jointly estimates their fundamental frequencies, their respective amplitudes, and their directions of arrival based on an intelligent non-parametric signal representation. To obtain these parameters, we first apply variable-scale sampling on unbiased cross-correlation functions between pairs of microphone signals to generate a joint parameter space. Then, we employ a multidimensional maxima detector to represent the parameters in a sparse joint parameter space. In comparison to others, our algorithm solves the issue of pitch-period doubling when using cross-correlation functions, it estimates multiple harmonic sources with a signal power smaller than the signal power of the dominant harmonic source, and it associates the estimated parameters to their corresponding sources in a multidimensional sparse joint parameter space, which can be directly fed into a tracker. We tested our algorithm and three others on synthetic data and speech data recorded in a real reverberant environment and evaluated their performance by employing the joint recall measure, the root-mean-square error, and the cumulative distribution function of fundamental frequencies and directions of arrival. The evaluations show promising results: Our algorithm outperforms the others in terms of the joint recall measure, and it can achieve root-mean-square errors of one Hertz or one degree and smaller, which facilitates, e.g., distant-speech enhancement or source separation.

June 2016: AMISCO: The Austrian German Multi-Sensor Corpus (Hannes Pessentheiner)

We introduce a unique, comprehensive Austrian German multi-sensor corpus with moving and non-moving speakers to facilitate the evaluation of estimators and detectors that jointly detect a speaker’s spatial and temporal parameters. The corpus is suitable for various machine learning and signal processing tasks, linguistic studies, and studies related to a speaker’s fundamental frequency (due to recorded glottograms). Available corpora are limited to (synthetically generated/spatialized) speech data or recordings of musical instruments that lack moving speakers, glottograms, and/or multi-channel distant speech recordings. That is why we recorded 24 spatially non-moving and moving speakers, balanced male and female, to set up a two-room and 43-channel Austrian German multi-sensor speech corpus. It contains 8.2 hours of read speech based on phonetically balanced sentences, commands, and digits. The orthographic transcriptions include around 53,000 word tokens and 2,070 word types. Special features of this corpus are the laryngograph recordings (representing glottograms required to detect a speaker’s instantaneous fundamental frequency and pitch), corresponding clean-speech recordings, and spatial information and video data provided by four Kinects and a camera.

May 2016: Advances in Phase-Aware Signal Processing in Speech Communication

During the past three decades, the issue of processing spectral phase has been largely neglected in speech applications. There is no doubt that the interest of speech processing community towards the use of phase information in a big spectrum of speech technologies, from automatic speech and speaker recognition to speech synthesis, from speech enhancement and source separation to speech coding, is constantly increasing. In this paper, we elaborate on why phase was believed to be unimportant in each application. We provide an overview of advancements in phase-aware signal processing with applications to speech, showing that considering phase-aware speech processing can be beneficial in many cases, while it can complement the possible solutions that magnitude-only methods suggest. Our goal is to show that phase-aware signal processing is an important emerging field with high potential in the current speech communication applications.

April 2016: MIMO Gain and Bandwidth Scaling for RFID Positioning in Dense Multipath Channels (Stefan Grebien)

In my research I analyze the achievable ranging and positioning performance for a radio frequency identification (RFID) system. Two design constraints of such a system, (i) the bandwidth of the transmit signal and (ii) the use of multiple antennas at the readers are analyzed in my paper ‘MIMO gain and bandwidth scaling for RFID positioning in Dense Multipath Channels’.

March 2016: Cognitive Indoor Positioning and Tracking using Multipath Channel Information (Erik Leitinger)

During my PhD studies I have introduced and discussed a positioning and tracking system for harsh indoor environments that is aware of its surrounding environment and further is able to act optimally on its environment, i.e. it controls the measurement information-return. The Figure illustrates the schematics of the cognitive positioning/tracking system. The physical main blocks are the cognitive perceptor (CP) and cognitive controller (CC) with built-in memories for the perceived environmental state and the (reciprocally) taken control-actions on the environment. Both are linked via feedback and feedforward information, thus the controller is able to choose new actions based on the perceptor’s Bayesian state information. The perception-action-cycle (PAC) incorporates the sensed environment into the closed loop with the CP and CC.

February 2016: Multichannel speech processing architectures for noise robust speech recognition: 3rd CHiME Challenge results (Franz Pernkopf)

Recognizing speech under noisy condition is an ill-posed problem. The CHiME3 challenge targets robust speech recognition in realistic environments such as street, bus, caffee and pedestrian areas. We study variants of beamformers used for pre-processing multi-channel speech recordings. In particular, we investigate three variants of generalized sidelobe canceller (GSC) beamformers, i.e. GSC with sparse blocking matrix (BM), GSC with adaptive BM (ABM), and GSC with minimum variance distortionless response (MVDR) and ABM. Furthermore, we apply several postfilters to further enhance the speech signal. We introduce MaxPower postfilters and deep neural postfilters (DPFs). DPFs outperformed our baseline systems significantly when measuring the overall perceptual score (OPS) and the perceptual evaluation of speech quality (PESQ). In particular DPFs achieved an average relative improvement of $17.54% OPS points and$18.28% in PESQ, when compared to the CHiME3 baseline. DPFs also achieved the best WER when combined with an ASR engine on simulated development and evaluation data, i.e. 8.98% and 10.82% WER. The proposed MaxPower beamformer achieved the best overall WER on CHiME3 real development and evaluation data, i.e. 14.23% and 22.12%, respectively.

January 2016: Adaptive Differential Microphone Arrays used as a Front-End for an Automatic Speech Recognition System (Elmar Messner)

For automatic speech recognition (ASR) systems it is important that the input signal mainly contains the desired speech signal. For a compact arrangement, differential microphone arrays (DMAs) are a suitable choice as front-end of ASR systems. The limiting factor of DMAs is the white noise gain, which can be treated by the minimum norm solution (MNS). In this work, we introduce the first time the MNS to adaptive differential microphone arrays (ADMAs). We compare its effect to the conventional implementation when used as front-end of an ASR system. In experiments we show that the proposed algorithms consistently increase the word accuracy up to 50% relative to their conventional implementations. For PESQ we achieve an improvement of up to 0.1 points.

November 2015: Structured Regularizer for Neural HigherOrder Sequence Models (Martin Ratajczak)

We introduce both joint training of neural higher-order linear-chain conditional random fields (NHO-LC-CRFs) and a new structured regularizer for sequence modelling. We show that this regularizer can be derived as lower bound from a mixture of models sharing parts, e.g. neural sub-networks, and relate it to ensemble learning. Furthermore, it can be expressed explicitly as regularization term in the training objective.

October 2015: On Representation Learning for Artificial Bandwidth Extension (Matthias Zöhrer)

Recently, sum-product networks (SPNs) showed convincing results on the ill-posed task of artificial bandwidth extension (ABE). However, SPNs are just one type of many architectures which can be summarized as representational models. In this paper, using ABE as benchmark task, we perform a comparative study of Gauss Bernoulli restricted Boltzmann machines, conditional restricted Boltzmann machines, higher order contractive autoencoders, SPNs and generative stochastic networks (GSNs). Especially the latter ones are promising architectures in terms of its reconstruction capabilities. Our experiments show impressive results of GSNs, achieving on average an improvement of 3.90dB and 4.08dB in segmental SNR on a speaker dependent (SD) and speaker independent (SI) scenario compared to SPNs, respectively.

September 2015: Cooperative Multipath-assisted Navigation and Tracking: A Low-Complexity Approach (Josef Kulmer)

Wireless localization has become a key technology for cooperative agent networks. However, for many applications, it is still illusive to reach the desired level of accuracy and robustness, especially in indoor environments which are characterized by harsh multipath propagation.
In this work we introduce a cooperative low-complexity algorithm that utilizes multipath components for localization instead of suffering from them. The algorithm uses two types of measurements: (i) bistatic measurements between agents and (ii) monostatic (bat-like) measurements by the individual agents. Simulations that use data generated from a realistic channel model, show the applicability of the methodology and the high level of accuracy that can be reached.

August 2015: Neural Higher-Order Factors in Conditional Random Fields for Phoneme Classification (Martin Ratajczak)

We explore neural higher-order input-dependent factors in linear-chain conditional random fields (LC-CRFs) for sequence labeling, i.e. the fusion of two powerful models. Higher-order LC-CRFs with linear factors are well-established for sequence labeling tasks, but they lack the ability to model non-linear dependencies. These non-linear dependencies, however, can be efficiently modelled by neural higher-order input-dependent factors which map sub-sequences of inputs to sub-sequences of outputs using distinct multilayer perceptron sub-networks. This mapping is important in many tasks, in particular, for phoneme classification where the phone representation strongly depends on the context phonemes. Experimental results for phoneme classification with LC-CRFs and neural higher-order factors confirm this fact and we achieve the best ever reported phoneme classification performance on TIMIT, i.e. a phoneme error rate of 15.8%. Furthermore, we show that the success is not obvious as linear high-order factors degrade phoneme classification performance on TIMIT.

July 2015: Automatic detection of uncertainty in spontaneous German dialogue (Tobias Schrank, Barbara Schuppler)

In this paper, we automatically detected uncertainty in naturalistic spontaneous German human-human conversations. We presented an approach which is based on linguistic, paralinguistic and extralinguistic features. We tested 9 feature classes (timing, fundamental frequency, intensity, spectrum, voice quality, lexicon, syntax, dialogue structure, external features) and evaluated their performance on 1158 dialogue acts taken from the spontaneous part of the Kiel Corpus. The results showed that it is possible to detect uncertainty in speech automatically relatively reliably. The accuracy with which this task is accomplished depended heavily on the feature set employed. In particular, our more complex modelling of speech rate contributed to good classification performance. Automatic feature selection could improve performance even though the machine learning algorithm employed in this paper is built to handle highly correlated features spaces. While only 64 features in size, the resulting feature set outperformed all other feature sets. Even though all features implemented in our system are theoretically motivated and have been used in previous publications, the amount of features that were uninformative regarding the detection of uncertainty in this very speech data is surprisingly large.

June 2015: DIRHA - An Application (A Showcase Video) (Hannes Pessentheiner, Martin Hagmüller)

In December 2014 we successfully finished our international project entitled ‘Distant-speech Interaction for Robust Home Application’, also known as DIRHA. Our main goal was to set up a prototype in our laboratory that could be controlled by Austrian German speech interaction. We would like to present the prototype by showing you a video.

It is about an application of this distant-speech interacting system’s prototype named DIRHA. Two attendees control lights and blinds by interacting with the system acoustically. They activate it by saying a keyword and instruct it to do something. In case of unclear or ambiguous instructions, the system automatically asks specific questions leading to answers containing the required information.

May 2015: Special Issue: Phase-Aware Signal Processing in Speech Communication

Followed up by the special session organized last year at INTESPEECH, Dr. Pejman Mowlaee together with Dr. Rahim Saeidi and Prof. Yannis Stylianou have proposed special issue entitled “Phase-Aware Signal Processing in Speech Communication” to EURASIP Speech Communication. The detailed information about the special issue is available at EURASIP website. Further information about the important deadlines and aims and scope of this special issue is available on Elsevier website. More recent updates, audio examples and progresses made towards phase-aware signal processing are available here. For an overview on phase-aware signal processing in speech communication see our special session paper published at INTERSPEECH last year found here. The description of the special issue is as follow:

April 2015: Analysis of Message Scheduling for Belief Propagation (Christian Knoll, Franz Pernkopf)

The influence of different message update schemes on belief propagation (BP) highlights the need of designing an appropriate message scheduling. Yet, Residual belief propagation (RBP) is the only established method utilizing this observation and consequently increasing the convergence rate. We observed that RBP fails to converge if local oscillations occur and the same messages are repeatedly updated. We propose two novel methods to prevent and correct such oscillations. First we show how noise injection belief propagation (NIBP) detects oscillating messages and adds random noise to improves the convergence rate. The second method, weight decay belief propagation (WDP), applies a damping on the residual to gradually reduce the relevance of these messages and consequently forces convergence. Additionally, in contrast to previous work, we consider the correctness of the obtained marginals and present the remarkable performance increase on a variety of synthetic problems.

March 2015: Simultaneous Localization and Mapping using Multipath Channel Information (Paul Meissner, Erik Leitinger)

In this work, we propose a new simultaneous localization and mapping (SLAM) approach that allows to learn the floor plan representation and to deal with inaccurate information. A key feature is an online estimated channel characterization that enables an efficient combination of the measurements. Starting with just the known anchor positions, the proposed method includes the virtual anchor (VA) positions also in the state space and is thus able to adapt the VA positions during tracking of the agent. Furthermore, the method is able to discover new potential VAs in a feature-based manner.  The work presents a proof of concept using measurement data. The excellent agent tracking performance of  90 % of the error lower than 5 cm achieved with a known floor plan can be reproduced with SLAM.

February 2015: Speech/Non-Speech Detection for Electro-Larynx Speech Using EMG (Anna Katharina Fuchs)

Experiments:

January 2015: Learning Mixtures of Submodular Functions for Image Collection Summarization (Sebastian Tschiatschek)

We address the problem of image collection summarization by learning mixtures of submodular functions. Submodularity is useful for this problem since it naturally represents characteristics such as fidelity and diversity, desirable for any summary. Several previously proposed image summarization scoring methodologies, in fact, instinctively arrived at submodularity. We provide classes of submodular component functions (including some which are instantiated via a deep neural network) over which mixtures may be learnt.

December 2014: General Stochastic Networks for Classification (Matthias Zöhrer)

We extend generative stochastic networks to supervised learning of representations. In particular, we introduce a hybrid training objective considering a generative and discriminative cost function governed by a trade-off parameter λ. We use a new variant of network training involving noise injection, i.e. walkback training, to jointly optimize multiple network layers. Neither additional regularization constraints, such as 1, 2 norms or dropout variants, nor pooling- or convolutional layers were added. Nevertheless, we are able to obtain state-of-the-art performance on the MNIST dataset, without using permutation invariant digits and outperform baseline models on sub-variants of the MNIST and rectangles dataset significantly.

November 2014: Room localization for distant speech recognition (Juan Andrés Morales Cordovilla)

The problem of room localization is to determine where, in a multi-room environment, a person is producing a speech utterance. At Interspeech 2014 we have presented the system of the figure. It exploits the information gained from a network of microphones installed in house, where the lack of calibration of the microphone energies creates an additional challenge.

October 2014: SPSC goes Antartica (Martin Hagmüller)

Our lab is responsible for the audio recording of dinner table talk at Concordia station in the Antartica for the European Space Agency sponsored project CAPA (Psychological Status Monitoring by Content Analysis and Acoustic- Phonetic Analysis of Crew Talks and Video Diaries).

September 2014: Sum-Product Networks for Structured Prediction: Context-Specific Deep Conditional Random Fields (Martin Ratajczak)

Linear-chain conditional random fields (LC-CRFs) have been successfully applied in many structured prediction tasks. Many previous extensions, e.g. replacing local factors by neural networks, are computationally demanding. In this paper, we extend conventional LC-CRFs by replacing the local factors with sum-product networks, i.e. a promising new deep architecture allowing for exact and efficient inference.
The proposed local factors can be interpreted as an extension of Gaussian mixture models (GMMs). Thus, we provide a powerful alternative to LC-CRFs extended by GMMs. In extensive experiments, we achieved performance competitive to state-of-the-art methods in phone classification and optical character recognition tasks.

July 2014: Pronunciation variation in Austrian German: A comparison of read and conversational speech (Barbara Schuppler)

Whereas for the varieties of German spoken in Germany, conversational speech has been given noticeable attention in the fields of linguistics and automatic speech recognition (ASR), for conversational Austrian there is a lack in speech resources and tools as well as phonetic studies. Based on the recently collected GRASS corpus, we provide rule-based methods for the creation of a pronunciation dictionary and an ASR-supported automatic method for the creation of broad phonetic transcriptions of conversational Austrian German. Our comparative analysis based on these transcriptions showed that whereas only 33.1% of the tokens in read speech show variation from the canonical transcription, this number raises to 63.2% in conversational speech. In the future, we will perform more detailed analysis concerning the conditions for pronunciation variation and incorporate our findings into models of automatic speech recognition.

June 2014: Artificial Bandwidth Extension using Sum-Product Networks (Robert Peharz)

Sum-Product networks (SPNs) are a recently proposed deep network architecture for representing probability distributions. They allow a high degree of dependency among the random variables, while still allowing efficient inference. In particular, SPNs showed convincing results on the ill-posed problem of image completion, i.e. predicting missing parts of an image given the observed part. We applied SPNs to the related task of artificial bandwidth extension, i.e. recovering the lost high frequencies in telephone speech, using the observed telephone low-band. To this end, we incorporated SPNs as observation models in hidden Markov models and used most-probable explanation (MPE) inference for reconstructing the lost frequency bins. The extended signals have a natural high-frequency structure in the spectrogram, and improve the state-of-the art in terms of log-spectral distortion and in informal listening tests.

May 2014: Multipath-Assisted Maximum-Likelihood Indoor Positioning using UWB Signals (Erik Leitinger)

Multipath-assisted indoor positioning (using ultra-wideband signals) exploits the geometric information contained in deterministic multipath components. With the help of a-priori available floorplan information, robust localization can be achieved, even in absence of a line-of-sight connection between anchor and agent. In a recent work, the Cramer-Rao lower bound has been derived for the position estimation variance using a channel model which explicitly takes into account diffuse multipath as a stochastic noise process in addition to the deterministic multipath components. In this work, we adapt this model for position estimation via a measurement likelihood function and evaluate the performance for real channel measurements. To find the global maximum of the highly multi-modal LHF, we introduced a particle filter method with swarm behavior optimization (PF-PSO). Performance results confirm the applicability of this approach and show the importance of considering diffuse multipath.

April 2014: Markov Aggregation and Information Bottlenecks (Bernhard Geiger)

Many scientific disciplines, such as systems biology or natural language processing, suffer from Markov chains with exploding state spaces. Markov aggregation, i.e., finding a Markov chain on the partition of the original state space, is one way to reduce the computational complexity of the model. We provide an information-theoretic cost function for the problem of Markov aggregation and show that the information bottleneck method, a popular technique in machine learning, can be used to find a solution iteratively.

March 2014: GRASS: The Graz corpus of Read And Spontaneous Speech (Barbara Schuppler)

Both research in the field of linguistics and speech technology require the existence of large speech corpora, recorded at sufficiently high quality and transcribed at least at the orthographic level, which can be used for the generation of further annotation layers (e.g., phonetic, morphological, syntactic and/or prosodic level). Since for Austrian German the available speech material was very limited, we have recently created the GRASS corpus, the first corpus of read and conversational Austrian German. GRASS contains phonetically balanced sentences, commands elicited by pictures, key words, telephone numbers and one hour of free conversations produced by 38 speakers originating from one of the mayor cities of eastern Austria (Graz, Linz, Salzburg, Vienna). Super-wideband recordings enable the simulation of different acoustic environments by filtering the speech material with different measured room impulse responses. Orthographic transcriptions were created manually and include the annotation of breathing, hesitations and laughter. More information can be found in our paper.

February 2014: Coding Efficiency and Efficiency-Linearity Trade-Off of Aliasing-Free PWM Based Burst-Mode RF Transmitters (Shuli Chi)

Coding efficiency is an important measure of burst-mode RF transmitters. In our recent publication [1] we have proposed an aliasing-free PWM (AFPWM) method which can avoid all destructive aliasing distortion due to the sampling process when the PWM process is performed in digital domain. A side effect of AFPWM method is that it induces amplitude variations onto the amplitude of the generated PWM signals. On the one hand, the non-ideal switching amplitude can cause nonlinear distortion due to the clipping effect, where a possible way to minimize the ripple is to choose an appropriate number of harmonics in the generated PWM signals. On the other hand, with the AFPWM method, the PA is operated over a slightly wider range of output power regions instead of operating at saturation and in cut-off, resulting in a reduced RF PA efficiency.

January 2014: A German Parallel Elctro-Larynx Speech -- Healthy speech corpus (Anna Katharina Fuchs)

Experi ment s:
In this paper, we describe the German parallel Electro-Larynx speech – Healthy speech (ELHE) Corpus which has been recorded in our recording studio. 3 female and 4 male healthy subjects recorded up to 500 sentences spoken one time with healthy speech (HE) and one time using the Electro-Larynx (EL) device.
Analyses of signal-to-noise ratios (SNR) have shown the following: For HE speech only two levels (noise and speech) can be distinguished but there are three different levels inherent in EL speech (see figure): noise, direct-radiated noise from the EL device (DREL) and speech (corrupted with DREL). First-order IIR smoothing was used to estimate the short-term power of the signal and of the noise whereas the DREL level was found using an iteratively changing threshold.

Conclusion:
Statistical analyses have shown that the length of EL sentences is longer than for HE sentences. Moreover, the fundamental frequency f0 of EL speech depends on the EL device and the variance of f0 is larger for HE speech due to the missing changing patterns of EL speech. This corpus can be used to analyze differences and similarities between healthy speech and disordered speech and based on this knowledge to improve the disordered speech.

December 2013: Live Demonstration of MINT at IPIN2013 (Paul Meissner)

At this years’ International Conference on Indoor positioning and Indoor Navigation (IPIN2013), a real-time demonstration of multipath-assisted indoor navigation and tracking (MINT) has been presented. Using an M-sequence based ultra-wideband (UWB) channel sounder, a mobile user is tracked exploiting the geometric structure of deterministic multipath components (MPCs). The plot shows the hardware setup and a tracking result in a roughly 4x5m room, demonstrating the centimeter-level accuracy.

This demonstration shows the benefits and challenges of this approach: On the one hand, deterministic MPCs carry a significant amount of position-related information that can increase both accuracy and robustness of tracking algorithms. This is especially relevant in non-line-of-sight (NLOS) situations, which are the most important performance impairments for radio-based indoor localization systems still today. On the other hand, the problem is challenging as reliable detection of MPCs and data association are required. This demonstration shows real-time algorithms that allow for systematic exploitation of multipath for indoor positioning and tracking. With this approach, position errors below 10 cm can be achieved in several realistic scenarios.

At the IPIN, the demo was running throughout the whole duration of the conference and attracted much interest from the audience. For more information on the algorithms used, please consult or paper from ENC2013!

November 2013: Modeling and Identification of Nonideal Ultra‐Wideband Multiplication Devices (Andreas Pedroß-Engel)

Analog multipliers are employed in many applications. In RF front-ends, for example, they are widely used for frequency conversion tasks. For noncoherent receivers such as energy detectors or transmitted-reference front-ends, they need to be able to multiply arbitrary (broadband) input signals. Unfortunately, there exist no ideal hardware realization of such devices, hence they inevitably create undesired signal content at their output. To be able to deal with these effects or correct for them, we need to be able to model and identify realistic RF multipliers.

October 2013: Sum Product Networks for Reconstructing Missing Data (Robert Peharz)

Sum-Product Networks are a novel type of graphical models, which can represent complex variable interaction, still allowing efficient inference. They show especially convincing results in reconstruction tasks, i.e. predicting missing parts of data given partial evidence. The image shows from top to bottom: original image, covered image, reconstruction using Poon & Domingos’ SPN algorithm (2011), Dennis & Ventura’s algorithm (2012), and our recently proposed Greedy Part-wise SPN learning algorithm.

September 2013: ASR for Electro-Laryngeal Speech (Anna Katharina Fuchs, Juan Andrés Morales Cordovilla)

In this work we apply disordered speech, namely speech produced by an Electro-Larynx (EL), on an Automatic Speech Recognition (ASR) system which was designed for normal, healthy speech. When disordered speech is applied to ASR systems, the performance will signiﬁcantly decrease. ASR systems are increasingly becoming part of daily life. Therefore, the word accuracy rate of disordered speech should be reasonably high to make ASR technologies accessible for patients suﬀering from speech disorders.

August 2013: Information Preserving Markov Chain Aggregation (Bernhard Geiger)

In a recent research collaboration with the Department of Mathematical Structure Theory, we characterized state space aggregations of Markov chains which preserve the information contained in the model. Moreover, we presented an information-theoretic characterization of lumpability, i.e., of the phenomenon that a non-injective function of a Markov chain can be a Markov chain of higher order. These characterizations, together with a set of sufficient conditions on the transition graph of the Markov chain, where employed for lossless model order reduction.

July 2013: Performance Bounds for Multipath-Assisted Indoor Localization on Backscatter Channels (Erik Leitinger)

In this work, we derive the Cramer-Rao lower bound (CRLB) on the position error for an RFID tag positioning system exploiting multipath. The channels constituting the backscatter radio system are modeled with a hybrid deterministic/stochastic channel model. In this way, both the geometry of the deterministic multipath components (MPCs) and the diffuse multipath are taken into account properly. Computational results show the influence of the room geometry on the bound and the importance of the diffuse multipath in dense indoor environments. Time reversal (TR) processing using the deterministic MPCs is analyzed as one possibility to overcome the degenerate nature of the backscatter channel. A derivation and evaluation of the corresponding CRLB shows the potential gain of TR processing as well as its strong dependence on the geometry.

June 2013: Communication System Receiver Filter: Searching is better than designing! (Manfred Mücke, Andreas Pedroß-Engel)

Digital IIR filter implementations are important building blocks of most communication systems. Conventionally, the filters are specified via amplitude and phase in the frequency domain as given by the matched filter theory. Digital filter implementations, nonlinear analog components and channel characteristics introduce a multitude of additional effects, though. These are not taken into account by the matched filter theory. Which, in turn, leads to results providing a rough estimate, at best. Our work reforms the design process, defines the system’s bit error rate as the main objective and searches the huge – yet finite – filter design space for suitable coefficients.

May 2013: Double Pitch Marks in Diplophonic Voice (Philipp Aichinger)

The measurement of pitch marks (PMs) is an important part of voice assessment. In diplophonic voice (i.e., a pathologic voice with two pitches) PM determination is crucial, and its validity needs special attention. Hence, a new approach for PM determination from Laryngeal High-Speed Videos (LHSVs), rather than from audio signals is proposed. In this novel approach, double PMs instead of traditional single PMs are extracted from a diplophonic voice sample, in order to account for double fundamental frequencies. The dominant oscillation frequencies of the vocal folds are extracted by spectral analysis of LHSVs with respect to time. Unit pulse trains with these frequencies are created as PM trains and compensated for the phase shift. The PMs are compared to Praat’s single audio PMs. It is shown that double PMs are needed in order to analyze diplophonic voice, because traditional single PMs do not explain its double-source characteristic.

April 2013: Efficiency Optimization for Burst-Mode Multilevel Radio Frequency Transmitters (Shuli Chi)

The utilization of a burst-mode PA together with pulse-width modulation (PWM) is a promising concept for achieving high efficiency in radio frequency (RF) transmitters. Nevertheless, such a transmitter requires bandpass filtering to suppress side-band spectral components to retrieve the wanted signal, which reduces the transmit power and the transmitter efficiency. To boost efficiency for signals with high PAPRs and signals at variable transmit power levels, burst-mode multilevel transmitters have been widely discussed as a potential solution.

March 2013: On phase estimation in single-channel speech enhancement and separation

In many speech processing applications, the spectral amplitude is the dominant information while the use of phase spectrum is not so widely spead. In [6] we present an overview on why speech phase spectrum has been neglected in the conventional techniques used in different applications including speech enhancement and source separation. Recovering a target speech signal from a single-channel recording falls into two groups of methods: 1) single-channel speech separation, and 2) single-channel speech enhancement algorithms. While there has been some success in either of the groups, all of them frequently ignore the issue of phase estimation in their parameter estimation and signal reconstruction. Instead, they directly pass the noisy signal phase for
reconstructing the output signal which leads to certain perceptual artifacts in the form of musical noise and cross-talk in speech enhancement and speech separation scenarios, respectively.

February 2013: Predicting human and ASR classification of plosives by their sub-phonemic properties (Barbara Schuppler)

In conversational speech words are often realized in a reduced way compared to their citation forms. One frequent process in Germanic languages is the deletion of word-final /t/. The German word und_for instance, is often pronounced as _un. In a series of studies, we investigated the role of reduced plosives for human perception compared to its role for automatic speech processing.

January 2013: Aliasing-Free Digital Pulse-Width Modulation for Burst-Mode RF Transmitters (Katharina Hausmair)

Digital pulse-width modulation (PWM) is used to encode a nonconstant-envelope signal into a train of rectangular pulses with varying widths, such that the information lying in the amplitude of the input signal is represented by the widths of the pulses. Pulsed signals can be used to drive the power ampliﬁer in burst-mode RF transmitters. After ampliﬁcation, the desired signal, which is the ampliﬁed passband equivalent of the input to the pulse-width modulator, has to be recovered by a bandpass ﬁlter. However, when generating PWM digitally, a considerable amount of distortion can be observed in and around the band of the desired signal, which prevents perfect signal recovery after ampliﬁcation. Therefore, conventional PWM is unsuitable for the use in burst-mode RF transmitters.

December 2012: Restructuring and Modernization of the SPSC Studio (Gerhard Graber)

The SPSC Studio is the key facility in educating students in audio recording and related fields. Quite a number of labs and seminars are held there, electro-acoustics, room-acoustics and digital-audio-technology labs as well as recording-studio-technology lab and recording practices to name just a few. In the last weeks, following a process of rethinking workflows and restructuring the concept of its signal flow, it was equipped with a new Lawo mc2 66 mixing-desk. Being one of the most widely used console in broadcast and large scale recording and events, this new console enables the students to be educated on the tools, they will meet in their carrier after graduation.

November 2012: Bayesian Network Classifiers with Reduced Precision Parameters (Sebastian Tschiatschek)

Bayesian network classifers (BNCs) are probabilistic classifers showing good performance in many applications. They consist of a directed acyclic graph and a set of conditional probabilities associated with the nodes of the graph. These conditional probabilities are also referred to as parameters of the BNCs. According to common believe, these classifers are insensitive to deviations of the conditional probabilities under certain conditions. The first condition is that these probabilities are not too extreme, i.e. not too close to 0 or 1. The second is that the posterior over the classes is significantly different. We investigated the effect of precision reduction of the parameters on the classifcation performance of BNCs. The probabilities are either determined generatively or discriminatively. Discriminative probabilities are typically more extreme. However, our results indicate that BNCs with discriminatively optimized parameters are almost as robust to precision reduction as BNCs with generatively optimized parameters. Furthermore, even large precision reduction does not decrease classifcation performance significantly. Our results allow the implementation of BNCs with less computational complexity. This supports application in embedded systems using  oating-point numbers with small bit-width. Reduced bit-widths further enable to represent BNCs in the integer domain while maintaining the classification performance.

October 2012: Learning an Artificial F0-Contour for ALT Speech (Anna Katharina Fuchs)

The Artiﬁcial Larynx Transducer (ALT) is a possibility to re-obtain audible speech for people who had to undergo an operation where the vocal folds are removed. For decades it is known that the resulting speech suffers from several problems such as a very poor speech quality and an unnatural sound of the speech. One reason for the lack of naturalness is the constant vibration of the ALT and a method to substantially improve ALT speech is to introduce a varying fundamental frequency (F0) - contour. In this work we present a new method to automatically learn an artificial F0-contour.

September 2012: Tracking of UWB Multipath Components Using Probability Hypothesis Density Filters (Markus Fröhle)

For indoor navigation and tracking using ultra wideband (UWB) radio signals, explicit use of the present multipath propagation can be made. Then, the multipath components (MPCs) need to be extracted from the measured channel impulse response (CIR). In this work we present a method to simultaneously estimate and track the number of MPCs present together with their individual state from measured CIR data using the Probability Hypothesis Density (PHD) multi-target filter.

August 2012: Error Analysis and Precision Estimation for Floating-Point Dot-Products Using Affine Arithmetic (Thang Huynh Viet)

In this work we use Affine Arithmetic (AA) to estimate the rounding error of different floating-point dot-product implementations. Two floating-point dot-product architectures - a sequential dot-product and a parallel (binary-tree) dot-product - are considered over a wide range of parameters. It is shown that an AA-based probabilistic bounding operator is able to provide a tighter rounding error bound compared to existing techniques. Furthermore, the analytical models for the rounding errors of different floating-point dot-product architectures are derived. As the estimated rounding error bounds are then used for bit width allocation for hardware implementations, the presented error models are key to floating-point code generators and efficient design space exploration.

July 2012: Analysis of Nonideal Multipliers for Multichannel Autocorrelation UWB Receivers (Andreas Pedroß-Engel)

In this work, the hardware implementation of a noncoherent multichannel autocorrelation UWB receiver (AcR) is addressed. We focus on the multiplication device, which is a core part of the AcR and introduces strong interference due to nonlinear effects. To analyze the signal-to-interference ratio performance of the receiver system, a combined Wiener-Hammerstein system model of the multiplication device is introduced. It is shown that the receiver performance strongly depends on the input power of the nonideal multiplier devices.

June 2012: Beamforming for Distant Speech Recognition in Reverberant Environments and Double-Talk Scenarios (Hannes Pessentheiner)

Beamforming is crucial for distant-speech recognition to mitigate causes of system degradation, e.g., interfering noise sources or competing speakers. We introduced adaptations of state-of-the-art broadband data-independent and data-dependent beamformers to uniform circular arrays (UCA), such that competing speakers are attenuated sufficiently for distant speech recognition.

May 2012: Capacity and Capacity-Achieving Input Distribution of the Energy Detector (Erik Leitinger)

This work presents the channel capacity and capacity-achieving input distribution of an energy detection receiver structure. Using the Blahut-Arimoto algorithm combined with a particle method, the positions and probabilities of the optimal mass points were found. It was shown that the capacity increases with decreasing noise dimensionality M and increasing peak-to-average power ratio (PAPR, parameter r in figure) and that the achieving input distribution is discrete with a finite number of mass points.

April 2012: Speech Enhancement Using Pre-Image Iterations (Christina Leitner, Franz Pernkopf)

In this work, we show how to de-noise speech in the complex spectral domain using pre-image iterations. The method is derived from kernel principal component analysis (kPCA). Instead of applying PCA in a high-dimensional feature space and then going back to the original input space by using a solution to the pre-image problem, only the pre-image step is applied for de-noising. We show that the de-noised audio sample is a convex combination of the noisy input data and that the resulting algorithm is closely related to the soft k-means algorithm. Compared to kPCA, this method reduces the computational costs while the audio quality is similar and speech quality measures do not degrade.

March 2012: A Comparison Between Ratio Detection and Threshold Comparison for GNSS Acquisition (Christian Vogel, Bernhard Geiger)

In this work, a comparison between two widespread global navigation satellite system acquisition strategies is presented. The ﬁrst strategy bases (TC) its decision on comparing the energy within a cell of the partitioned search space to a threshold, while the second one uses the ratio between the two largest cell energies (RD). It is shown that the TC outperforms RD in terms of receiver operating characteristics in many practically relevant cases. Moreover, despite the purported simplicity of the ratio detection method, it is further shown that its complexity is comparable to or even higher than the one of threshold comparison with adaptive threshold setting.

February 2012: A Probabilistic Model-Based Approach for Multipitch Tracking of Speech (Michael Wohlmayr)

The fundamental frequency is an important characteristic of speech signals. Most energy of voiced speech utterances is carried by the harmonics, which are located at integer multiples of the fundamental frequency.
The task of multipitch tracking is to extract the fundamental frequency from a mixture of simultaneous speakers. In this work, we investigate a model based approach where speaker specific characteristics are learned beforehand. The availability of speaker dependent (SD) models allows to additionally assign a pitch estimate to its corresponding speaker.

The above figure shows an example for the speech mixture of two female speakers. Panel (a): Spectrogram of speech mixture, together with reference pitch trajectories extracted from single speech recording (black and blue line). Note that the pitch trajectories of both speakers are located in the same frequency range crossing each other. In this situation, the assignment of pitch estimates to corresponding speakers based on time-continuity constraints is hard or even impossible - additional consideration of speaker specific spectral characteristics is necessary. Panel (b): Estimated pitch trajectories using speaker dependent (SD) models. The color of estimated pitch points indicates the assignment to a speaker (red x: speaker 1, blue o: speaker 2). Additionally, the reference trajectories are shown as black lines. A comparison with panel (a) reveals that the speaker assignment is correct most of the time. Panel (c): Estimated pitch trajectories using speaker independent (SI) models. A comparison with panel (a) and (b) shows that the speaker assignment is inferior to SD models.

January 2012: Coding Efficiency Optimization for Multilevel Pulse-Width Modulation (PWM) Based Switched-Mode Radio Frequency (RF) Transmitters (Shuli Chi)

In modern wireless communication systems, complex modulation techniques are employed for increased data rates and spectral efficiency. However, conventional radio frequency (RF) transmitters with linear power amplifier operation only provide moderate overall transmitter efficiency for complex modulated signals. Switched-mode power amplifiers (SMPA) with appropriate baseband modulation techniques such as pulse-width modulation (PWM) are employed to increase the overall transmitter efficiency. One of the drawbacks of this technique is out-of-band power. This out-of-band power needs to be filtered in order to fulfill the transmission spectrum requirements, thus reducing the overall efficiency. A measure for the efficiency degradation of such pulsed transmitters is the coding efficiency. This work investigated optimization concept on the coding efficiency for multilevel pulsed transmitters.

December 2011: Detection Results of Spectro-temporal Fragment-based Multiband Position-Pitch (MPoPi) Algorithm (Tania Habib, Harald Romsdorfer)

With increasingly powerful and affordable computational resources for digital signal processing and growing use of sensor arrays, acoustic source localization has become an interesting area of research. In contrast to traditional localization applications such as radar and sonar, speech source localization introduces additional challenges due to the wideband and non-stationary nature of speech signals, due to the unknown trajectories of the speakers and due to the effects of multipath propagation in enclosures.

November 2011: Performance Bounds for Multipath-aided Indoor Navigation and Tracking (MINT) (Klaus Witrisal, Paul Meissner)

Indoor positioning based on ultra-wideband radio signals remains a challenging problem, in particular due to error induced by non-line-of-sight propagation conditions. The MINT (multipath-aided indoor navigation and tracking) approach exploits the geometry of deterministic multipath components (MPCs) in such situations. Reflected multipath components are accounted for by virtual signal sources, indicated as “SR” and “DR” in the figure. The figure shows the Cramèr-Rao lower bound of the position error for this scenario.

October 2011: Sparse Consensus-based Distributed Field Estimation (Thomas Buchgraber)

In this work, a fully decentralized algorithm which is inspired by sparse Bayesian learning (SBL) is presented. It can be used for non-parametric sparse estimation of unknown spatial functions -spatial fields- with wireless sensor networks (WSNs). Such a field is represented as a linear combination of weighted fixed basis functions.

September 2011: Characterization of Wideband Backscatter Channels (Daniel Arnitz)

Backscatter systems have become more and more popular since radio-frequency identification (RFID) emerged a few years ago. Recent advances in short-range indoor backscatter localization, however, have shown that there is little to no information available on wideband backscatter channels despite the abundance of analyses available for single-channel links.

August 2011: Maximum Margin Bayesian Network Classifiers (Franz Pernkopf, Sebastian Tschiatschek, Michael Wohlmayr)

Classification is an important task in machine learning. It deals with assigning a given object to one of a number of different categories. We present a maximum margin parameter learning algorithm for Bayesian network classifiers using a conjugate gradient method for optimization to solve this task. In contrast to previous approaches, we maintain the normalization constraints of the parameters of the Bayesian network during optimization, i.e. the probabilistic interpretation of the model is not lost. This enables to handle missing features in discriminatively optimized Bayesian networks. The potentials of the proposed method as well as a comparison to other existing work on maximum margin Bayesian networks is focus of this work.

July 2011: Correction of Linear Time-varying Systems by Means of Time-varying FIR Filters (Michael Soudan)

Linear time-varying systems are encountered in many technical areas, for example as a means of modeling communication channels or signal processing blocks. Typically, this time-varying behavior is undesired as it has a negative impact on the performance of consecutive blocks in the signal processing chain. This negative impact can be reduced by either preprocessing or postprocessing the signal with a time-varying correction filter. Methods for the design of these filters are the focus of this work.

June 2011: Monaural Sound Localization (Anna Katharina Fuchs)

Many known sound localization algorithms are based on processing signals received by multiple, spatially separated sensors, e.g. microphone arrays. The advantages of single-channel sound source estimation are the lower costs for a single microphone and the possibility of developing very small gadgets. In this work we developed an accurate speaker localization strategy in the horizontal plane using the signal of only one microphone.

May 2011: Multipath-Aided UWB Indoor Localization using a single Base Station only (Paul Meissner)

Indoor localization systems have to face very challenging conditions, e.g. dense multipath scenarios resulting from propagation phenomena like reflections and scattering. Recently, our group has proposed a series of robust and accurate tracking algorithms for an ultra-wideband radio-based localization concept that is able to effectively make use of reflected signal components. We reach accuracies on centimeter level at a high level of robustness.