Abstract
Audio compression is usually achieved with algorithms that exploit spectral properties of the given signal such as frequency or temporal masking. In this paper we propose to tackle such a problem from a different point of view, considering the time-frequency domain of an audio signal as an intensity map to be reconstructed via a data-driven approach. The compression stage removes some selected input values from the time-frequency representation of the original signal. Then, decompression works by reconstructing the missing samples as an image completion task. Our method is divided into two main parts: first, we analyse the feasibility of a data-driven audio reconstruction with missing samples in its time-frequency representation. To do so, we exploit an existing CNN model designed for depth completion, involving a sequence of sparse convolutions to deal with absent values. Second, we propose a method to select the values to be removed at compression stage, maximizing the perceived audio quality of the decompressed signal. In the experimental section we validate the proposed technique on some standard audio datasets and provide an extensive study on the quality of the reconstructed signal under different conditions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The most common functions are the Hamming, Hanning, and Blackman window. We observed no relevant difference in the choice of such function for our purposes.
References
Arık, S.Ö., Jun, H., Diamos, G.: Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Process. Lett. 26(1), 94–98 (2018)
Brandenburg, K., Stoll, G.: ISO/MPEG-1 audio: a generic standard for coding of high-quality digital audio. J. Audio Eng. Soc. 42, 780–792 (1994)
Gasparetto, A., et al.: Cross-dataset data augmentation for convolutional neural networks training. In: 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 2018, pp. 910–915 (2018). https://doi.org/10.1109/ICPR.2018.8545812
Ghido, F., Tabus, I.: Sparse modeling for lossless audio compression. IEEE Trans. Audio Speech Lang. Process. 21(1), 14–28 (2012)
Griffin, D., Lim, J.: Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32, 236–243 (1984)
Hans, M., Schafer, R.W.: Lossless compression of digital audio. IEEE Signal Process. Mag. 18(4), 21–32 (2001)
Hanzo, L., Somerville, F.C.A., Woodard, J.: Voice and Audio Compression for Wireless Communications. Wiley, Hoboken (2008)
Harwath, D., Glass, J.: Deep multimodal semantic embeddings for speech and images. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 237–244. IEEE (2015)
Huang, Z., Fan, J., Cheng, S., Yi, S., Wang, X., Li, H.: HMS-Net: hierarchical multi-scale sparsity-invariant network for sparse depth completion. IEEE Trans. Image Process. 29, 3429–3441 (2019)
Jaritz, M., De Charette, R., Wirbel, E., Perrotton, X., Nashashibi, F.: Sparse and dense data with CNNs: depth completion and semantic segmentation. In: 2018 International Conference on 3D Vision (3DV), pp. 52–60. IEEE (2018)
Kanade, J., Sivakumar, B.: A literature survey on psychoacoustic models and wavelets in audio compression. Int. J. Adv. Res. Electron. Commun. Eng. (IJARECE) (2014)
Kankanahalli, S.: End-to-end optimized speech coding with deep neural networks. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (2018)
Kleijn, W.B., et al.: Generative speech coding with predictive variance regularization. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2021)
Lim, T.Y., Yeh, R.A., Xu, Y., Do, M.N., Hasegawa-Johnson, M.: Time-frequency networks for audio super-resolution. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 646–650. IEEE (2018)
Ma, F., Karaman, S.: Sparse-to-dense: depth prediction from sparse depth samples and a single image. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 4796–4803. IEEE (2018)
Morishima, S., Harashima, H., Katayama, Y.: Speech coding based on a multi-layer neural network. In: IEEE International Conference on Communications (1990)
Pistellato, M., Albarelli, A., Bergamasco, F., Torsello, A.: Robust joint selection of camera orientations and feature projections over multiple views. In: 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 2016, pp. 3703–3708 (2016). https://doi.org/10.1109/ICPR.2016.7900210
Pistellato, M., Bergamasco, F., Albarelli, A., Torsello, A.: Dynamic optimal path selection for 3D triangulation with multiple cameras. In: Murino, V., Puppo, E. (eds.) ICIAP 2015. LNCS, vol. 9279, pp. 468–479. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23231-7_42
Pistellato, M., Bergamasco, F., Albarelli, A., Torsello, A.: Robust cylinder estimation in point clouds from pairwise axes similarities. In: Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - ICPRAM, pp. 640–647 (2019). https://doi.org/10.5220/0007401706400647
Pistellato, M., Bergamasco, F., Fatima, T., Torsello, A.: Deep demosaicing for polarimetric filter array cameras. IEEE Trans. Image Process. 31, 2017–2026 (2022). https://doi.org/10.1109/TIP.2022.3150296
Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. In: 2017 international conference on 3D Vision (3DV) (2017)
Wang, W., Neumann, U.: Depth-aware CNN for RGB-D segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 144–161. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_9
Williamson, D.S., Wang, D.: Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 25(7), 1492–1501 (2017)
Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., Tagliasacchi, M.: SoundStream: an end-to-end neural audio codec. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 495–507 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Scodeller, G., Pistellato, M., Bergamasco, F. (2023). Exploring Audio Compression as Image Completion in Time-Frequency Domain. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14234. Springer, Cham. https://doi.org/10.1007/978-3-031-43153-1_37
Download citation
DOI: https://doi.org/10.1007/978-3-031-43153-1_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43152-4
Online ISBN: 978-3-031-43153-1
eBook Packages: Computer ScienceComputer Science (R0)