Phase perturbation improves channel robustness for speech spoofing countermeasures

Speech spoofing countermeasures (CM) are vulnerable to channel mismatch. As an example, with different communication protocols, the same audio utterance can be transmitted through different channels, such as MPEG-3 compression, VoIP transmission, radio and actual phone lines. This brings on unique challenge for speech spoofing countermeasures that are trying to distinguish fake (computer-generated) speech from real (human-produced) speech.

As an example, here's an utterance under original channel and a telephone channel, then encoded with lossy compression audio codec.

Original channel

Telephone (simulated) + MP3 Encoding

In this paper, we propose to alleviate the channel robustness issue of state-of-the-art CM systems through phase perturbation in training. This proposal is developed based on two observations and our hypotheses for explaining the observations:

1. State-of-the-art CM systems are time-domain systems; We hypothesize that they rely on phase information to detect synthetic spoofing attacks.
2. Communication channels often employ lossy compression codecs that are designed to only encode magnitude information; We hypothesize that they alter phase information in speech, making phase-aware CM systems difficult to generalize to unseen channels.

Based on these hypotheses, we propose to perturb the phase when training phase-aware CM systems. We believe that this will make such systems less reliant on phase information, hence more robust to channel variations.

We first investigate the effect of phase perturbation on the performance of state-of-the-art CM systems. To do this, we trained 3 state-of-the-art CM systems on the ASVspoof 2019 dataset, and evaluated their performance on the phase and magnitude perturbed test set. The audio utterances are as shown below: (Using LA_D_1000265.flac as an example.)

Original

Phase-perturbed samples

We selected 4 settings of perturbation for phase. The spectrogram shown on page right now is frequency-magnitude spectrogram, which remains unchanged between these samples; the phase spectrogram is perturbed randomly.

Perturbing 0.5pi

Perturbing 1.0pi

Perturbing 1.5pi

Perturbing 2.0pi

EER performance of three CM systems evaluated on phase perturbed test data.
EER (%)		CM Systems
		RawNet-2	RawGAT-ST	AASIST
No Perturbation		4.25	1.36	1.35
Phase	1/2π	9.55	20.09	21.12
	π	19.56	29.01	48.87
	3/2π	27.02	31.57	56.58
	2π	28.02	31.82	56.91
	Pooled	21.04	28.12	45.87

Magnitude-perturbed samples

Similar to phase, we selected 5 settings of signal-to-noise ratio to perturb the frequency-magnitude spectrogram only.

SNR 10dB

SNR 5dB

SNR 0dB

SNR -5dB

SNR -10dB

EER performance of three CM systems evaluated with magnitude perturbed test data.
EER (%)		CM Systems
		RawNet-2	RawGAT-ST	AASIST
No Perturbation		4.25	1.36	1.35
Mag.	10 dB	24.78	9.74	11.21
	5 dB	34.08	15.11	13.01
	0 dB	43.04	21.19	19.9
	-5 dB	45.25	32.47	30.5
	-10 dB	45.48	37.27	35.16
	Pooled	38.53	23.16	21.96

EER degradation in phase-perturbed and magnitude-perturbed settings compared to the baseline performance.

As we can see, time-domain systems are very sensitive to phase-domain perturbations. Furthermore, the better performance system on the clean data seems to degrade further on the phase-domain pertubed data, further showing that time-domain CM systems rely on phase information to make decisions.

Communication channels typically employ lossy compression codecs, many of which focus on encoding only frequency-magnitude information, since humans are more sensitive towards them. After transmitting through such compression, much phase information is corrupted, making the performance of time-domain CM systems degrade as they have much reliance on phase information.

By perturbing the phase during training, we can reduce the CM systems' reliance on phase information to build more robust CM systems.

However, we expect that this perturbation should not be too much that completely removes the reliance on phase, as some useful phase information may still remain for the time-domain CM systems to pick up, even after going through the communication channel. We hypothesize that there is a midway setting between not perturbing and perturbing all phases that provides the best trade-off resulting in best performance.

We use ASVspoof2021LA, the logical access (LA) sub-track of the ASVspoof2021 challenge. This subset transmitted the entire ASVspoof2019LA set, along with additional samples, through seven communication channels, denoted as C1 to C7.

C1 is the same channel as ASVspoof2019LA, while C2 to C7 are unseen channels. Amongst the unseen channels, C2 and C5 use the time-domain compression algorithms a-law and mu-law, while C4, C6, and C7 employ magnitude-based compression codecs, G.722, GSM, and OPUS. C3 differs from C2 by transmitting over a public switched telephone network, therefore introducing uncontrollable and unknown artifacts, such as data corruption during transit.

Evaluation results on ASVspoof2021LA. The left side denotes the perturbation setting in training. C1-C7 denote different transmission codecs.
		C1	C2	C3	C4	C5	C6	C7	Pooled
EER (%)		4.68	5.87	14.39	5.75	5.44	7.66	10.26	9.91
Phase	π/2	4.49	6.18	8.68	5.18	5.80	6.35	8.20	7.33
	π	6.72	7.00	7.34	6.41	6.89	6.85	7.30	7.31
	3π/2	5.52	6.20	10.66	5.21	6.18	7.25	6.21	8.32
	2π	5.68	6.37	9.63	5.42	6.34	7.14	6.39	7.73
Magnitude	10 dB	7.36	8.57	19.88	8.79	8.39	9.53	14.36	14.54
	5 dB	10.40	11.46	30.87	11.96	11.35	14.45	19.09	17.80
	0 dB	17.41	18.23	40.99	18.07	17.98	20.63	26.21	23.64
	-5 dB	23.70	25.05	46.63	24.45	25.00	29.75	34.23	29.87
	-10 dB	34.77	34.55	46.84	34.49	34.95	36.55	38.84	37.40

As expected, as magnitude is perturbed during training, performance on all unseen communication channels degrades, showing that removing reliance on magnitude information is harmful for CM systems' performance. This suggests that CM systems can benefit from the magnitude information preserved by lossy compression codecs.

At the same time, all settings with phase perturbed show some improvement in pooled results, achieving a more robust overall performance compared to non-perturbed settings. This indicates that by being less sensitive to phase information, time-domain CM systems can better generalize to unseen communication channels. Best performing setting shows a relative EER improvement of 26.2% compared to the no perturbation baseline, without introducing any channel data during training.

We also notice that the best-performing phase perturbation setting appears at pi, aligning with to our hypothesis of a "midway" perturbation setting.

In this paper, we observed significant degradation of various state-of-the-art time-domain CM systems when evaluated on phase-perturbed speech utterances. This degradation may cause channel robustness issues, since many communication channels employ lossy codecs that only encode frequency-magnitude information while losing much phase information. We proposed to mitigate this issue by perturbing phase in training. Systematic evaluation on real-world channel-variant data verified that perturbing phase in training does significantly improve the channel robustness of a state-of-the-art time-domain CM system.

For future work, we plan to use this insight to design CM systems that strike the balance between modeling useful phase information and being less sensitive to channel phase alternation.

Phase perturbation improves channel robustness for speech spoofing countermeasures

Yongyi Zang, You Zhang, Zhiyao Duan

Phase-perturbed samples

Magnitude-perturbed samples