Exploring Audio Compression as Image Completion in Time-Frequency Domain

Scodeller, Giovanni; Pistellato, Mara; Bergamasco, Filippo

doi:10.1007/978-3-031-43153-1_37

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14234))

Included in the following conference series:

International Conference on Image Analysis and Processing

577 Accesses

Abstract

Audio compression is usually achieved with algorithms that exploit spectral properties of the given signal such as frequency or temporal masking. In this paper we propose to tackle such a problem from a different point of view, considering the time-frequency domain of an audio signal as an intensity map to be reconstructed via a data-driven approach. The compression stage removes some selected input values from the time-frequency representation of the original signal. Then, decompression works by reconstructing the missing samples as an image completion task. Our method is divided into two main parts: first, we analyse the feasibility of a data-driven audio reconstruction with missing samples in its time-frequency representation. To do so, we exploit an existing CNN model designed for depth completion, involving a sequence of sparse convolutions to deal with absent values. Second, we propose a method to select the values to be removed at compression stage, maximizing the perceived audio quality of the decompressed signal. In the experimental section we validate the proposed technique on some standard audio datasets and provide an extensive study on the quality of the reconstructed signal under different conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The most common functions are the Hamming, Hanning, and Blackman window. We observed no relevant difference in the choice of such function for our purposes.

References

Arık, S.Ö., Jun, H., Diamos, G.: Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Process. Lett. 26(1), 94–98 (2018)
Article Google Scholar
Brandenburg, K., Stoll, G.: ISO/MPEG-1 audio: a generic standard for coding of high-quality digital audio. J. Audio Eng. Soc. 42, 780–792 (1994)
Google Scholar
Gasparetto, A., et al.: Cross-dataset data augmentation for convolutional neural networks training. In: 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 2018, pp. 910–915 (2018). https://doi.org/10.1109/ICPR.2018.8545812
Ghido, F., Tabus, I.: Sparse modeling for lossless audio compression. IEEE Trans. Audio Speech Lang. Process. 21(1), 14–28 (2012)
Article Google Scholar
Griffin, D., Lim, J.: Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32, 236–243 (1984)
Article Google Scholar
Hans, M., Schafer, R.W.: Lossless compression of digital audio. IEEE Signal Process. Mag. 18(4), 21–32 (2001)
Article Google Scholar
Hanzo, L., Somerville, F.C.A., Woodard, J.: Voice and Audio Compression for Wireless Communications. Wiley, Hoboken (2008)
Google Scholar
Harwath, D., Glass, J.: Deep multimodal semantic embeddings for speech and images. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 237–244. IEEE (2015)
Google Scholar
Huang, Z., Fan, J., Cheng, S., Yi, S., Wang, X., Li, H.: HMS-Net: hierarchical multi-scale sparsity-invariant network for sparse depth completion. IEEE Trans. Image Process. 29, 3429–3441 (2019)
Article MATH Google Scholar
Jaritz, M., De Charette, R., Wirbel, E., Perrotton, X., Nashashibi, F.: Sparse and dense data with CNNs: depth completion and semantic segmentation. In: 2018 International Conference on 3D Vision (3DV), pp. 52–60. IEEE (2018)
Google Scholar
Kanade, J., Sivakumar, B.: A literature survey on psychoacoustic models and wavelets in audio compression. Int. J. Adv. Res. Electron. Commun. Eng. (IJARECE) (2014)
Google Scholar
Kankanahalli, S.: End-to-end optimized speech coding with deep neural networks. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (2018)
Google Scholar
Kleijn, W.B., et al.: Generative speech coding with predictive variance regularization. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2021)
Google Scholar
Lim, T.Y., Yeh, R.A., Xu, Y., Do, M.N., Hasegawa-Johnson, M.: Time-frequency networks for audio super-resolution. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 646–650. IEEE (2018)
Google Scholar
Ma, F., Karaman, S.: Sparse-to-dense: depth prediction from sparse depth samples and a single image. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 4796–4803. IEEE (2018)
Google Scholar
Morishima, S., Harashima, H., Katayama, Y.: Speech coding based on a multi-layer neural network. In: IEEE International Conference on Communications (1990)
Google Scholar
Pistellato, M., Albarelli, A., Bergamasco, F., Torsello, A.: Robust joint selection of camera orientations and feature projections over multiple views. In: 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 2016, pp. 3703–3708 (2016). https://doi.org/10.1109/ICPR.2016.7900210
Pistellato, M., Bergamasco, F., Albarelli, A., Torsello, A.: Dynamic optimal path selection for 3D triangulation with multiple cameras. In: Murino, V., Puppo, E. (eds.) ICIAP 2015. LNCS, vol. 9279, pp. 468–479. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23231-7_42
Chapter Google Scholar
Pistellato, M., Bergamasco, F., Albarelli, A., Torsello, A.: Robust cylinder estimation in point clouds from pairwise axes similarities. In: Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - ICPRAM, pp. 640–647 (2019). https://doi.org/10.5220/0007401706400647
Pistellato, M., Bergamasco, F., Fatima, T., Torsello, A.: Deep demosaicing for polarimetric filter array cameras. IEEE Trans. Image Process. 31, 2017–2026 (2022). https://doi.org/10.1109/TIP.2022.3150296
Article Google Scholar
Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. In: 2017 international conference on 3D Vision (3DV) (2017)
Google Scholar
Wang, W., Neumann, U.: Depth-aware CNN for RGB-D segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 144–161. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_9
Chapter Google Scholar
Williamson, D.S., Wang, D.: Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 25(7), 1492–1501 (2017)
Article Google Scholar
Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., Tagliasacchi, M.: SoundStream: an end-to-end neural audio codec. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 495–507 (2021)
Article Google Scholar

Download references

Author information

Authors and Affiliations

DAIS, Università Ca’Foscari Venezia, 155, via Torino, Venezia, Italy
Giovanni Scodeller, Mara Pistellato & Filippo Bergamasco

Authors

Giovanni Scodeller
View author publications
You can also search for this author in PubMed Google Scholar
Mara Pistellato
View author publications
You can also search for this author in PubMed Google Scholar
Filippo Bergamasco
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mara Pistellato .

Editor information

Editors and Affiliations

University of Udine, Udine, Italy
Gian Luca Foresti
University of Udine, Udine, Italy
Andrea Fusiello
University of York, York, UK
Edwin Hancock

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Scodeller, G., Pistellato, M., Bergamasco, F. (2023). Exploring Audio Compression as Image Completion in Time-Frequency Domain. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14234. Springer, Cham. https://doi.org/10.1007/978-3-031-43153-1_37

Download citation

DOI: https://doi.org/10.1007/978-3-031-43153-1_37
Published: 05 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43152-4
Online ISBN: 978-3-031-43153-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exploring Audio Compression as Image Completion in Time-Frequency Domain