Sleep staging using end-to-end deep learning model based on nocturnal sound for smartphones

May 25, 2022
Sleep 2022



Convenient sleep tracking with mobile devices such as smartphones is desirable for people who want to easily objectify their sleep. The objective of this study was to introduce a deep learning model for sound-based sleep staging using audio data re- corded with smartphones during sleep.


Two different audio datasets were used. One (N = 1,154) was extracted from polysomnography (PSG) data and the other (N = 327) was recorded using a smartphone during PSG from independent subjects. The performance of sound-based sleep staging would always depend on the quality of the audio. In practical conditions (non-contact and smart- phone microphones), breathing and body movement sounds during night are so weak that the energy of such signals is sometimes smaller than that of ambient noise. The audio was converted into Mel spectrogram to detect latent temporal fre- quency patterns of breathing and body movement sound from ambient noise. The proposed neural network model consisted of two sub-models. The first sub-model extracted features from each 30-second epoch Mel spectrogram and the second one classified sleep stages through inter-epoch analysis of ex- tracted features.


Our model achieved 70 % epoch-by-epoch agree- ment for 4-class (wake, light, deep, rapid eye movement) stage classification and robust performance across various signal- to-noise conditions. More precisely, the model was correct in 77% of wake, 73% of light, 46% of deep, and 66% of REM. The model performance was not considerably affected by ex- istence of sleep apnea but degradation observed with severe periodic limb movement. External validation with smart- phone dataset also showed 68 % epoch-by-epoch agreement. Compared with some commercially available sleep trackers such as Fitbit Alta HR (0.6325 in mean per-class sensitivity) and SleepScore Max (0.565 in mean per-class sensitivity), our model showed superior performance in both PSG audio (0.655 in mean per-class sensitivity) and smartphone audio (0.6525 in mean per-class sensitivity).


To the best of our knowledge, this is the first end (Mel spectrogram-based feature extraction)-to-end (sleep staging) deep learning model that can work with audio data in practical condi- tions. Our proposed deep learning model of sound-based sleep sta- ging has potential to be integrated in smartphone application for reliable at-home sleep tracking.


Joonki Hong
Hai Tran
Jinhwan Jeong
Hyeryung Jang
In-Young Yoon
Jung Kyung Hong
Jeong-Whun Kim