Skip to main content

Exploring Audio Compression as Image Completion in Time-Frequency Domain

  • Conference paper
  • First Online:
Image Analysis and Processing – ICIAP 2023 (ICIAP 2023)

Abstract

Audio compression is usually achieved with algorithms that exploit spectral properties of the given signal such as frequency or temporal masking. In this paper we propose to tackle such a problem from a different point of view, considering the time-frequency domain of an audio signal as an intensity map to be reconstructed via a data-driven approach. The compression stage removes some selected input values from the time-frequency representation of the original signal. Then, decompression works by reconstructing the missing samples as an image completion task. Our method is divided into two main parts: first, we analyse the feasibility of a data-driven audio reconstruction with missing samples in its time-frequency representation. To do so, we exploit an existing CNN model designed for depth completion, involving a sequence of sparse convolutions to deal with absent values. Second, we propose a method to select the values to be removed at compression stage, maximizing the perceived audio quality of the decompressed signal. In the experimental section we validate the proposed technique on some standard audio datasets and provide an extensive study on the quality of the reconstructed signal under different conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The most common functions are the Hamming, Hanning, and Blackman window. We observed no relevant difference in the choice of such function for our purposes.

References

  1. Arık, S.Ö., Jun, H., Diamos, G.: Fast spectrogram inversion using multi-head convolutional neural networks. IEEE Signal Process. Lett. 26(1), 94–98 (2018)

    Article  Google Scholar 

  2. Brandenburg, K., Stoll, G.: ISO/MPEG-1 audio: a generic standard for coding of high-quality digital audio. J. Audio Eng. Soc. 42, 780–792 (1994)

    Google Scholar 

  3. Gasparetto, A., et al.: Cross-dataset data augmentation for convolutional neural networks training. In: 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 2018, pp. 910–915 (2018). https://doi.org/10.1109/ICPR.2018.8545812

  4. Ghido, F., Tabus, I.: Sparse modeling for lossless audio compression. IEEE Trans. Audio Speech Lang. Process. 21(1), 14–28 (2012)

    Article  Google Scholar 

  5. Griffin, D., Lim, J.: Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 32, 236–243 (1984)

    Article  Google Scholar 

  6. Hans, M., Schafer, R.W.: Lossless compression of digital audio. IEEE Signal Process. Mag. 18(4), 21–32 (2001)

    Article  Google Scholar 

  7. Hanzo, L., Somerville, F.C.A., Woodard, J.: Voice and Audio Compression for Wireless Communications. Wiley, Hoboken (2008)

    Google Scholar 

  8. Harwath, D., Glass, J.: Deep multimodal semantic embeddings for speech and images. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 237–244. IEEE (2015)

    Google Scholar 

  9. Huang, Z., Fan, J., Cheng, S., Yi, S., Wang, X., Li, H.: HMS-Net: hierarchical multi-scale sparsity-invariant network for sparse depth completion. IEEE Trans. Image Process. 29, 3429–3441 (2019)

    Article  MATH  Google Scholar 

  10. Jaritz, M., De Charette, R., Wirbel, E., Perrotton, X., Nashashibi, F.: Sparse and dense data with CNNs: depth completion and semantic segmentation. In: 2018 International Conference on 3D Vision (3DV), pp. 52–60. IEEE (2018)

    Google Scholar 

  11. Kanade, J., Sivakumar, B.: A literature survey on psychoacoustic models and wavelets in audio compression. Int. J. Adv. Res. Electron. Commun. Eng. (IJARECE) (2014)

    Google Scholar 

  12. Kankanahalli, S.: End-to-end optimized speech coding with deep neural networks. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (2018)

    Google Scholar 

  13. Kleijn, W.B., et al.: Generative speech coding with predictive variance regularization. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2021)

    Google Scholar 

  14. Lim, T.Y., Yeh, R.A., Xu, Y., Do, M.N., Hasegawa-Johnson, M.: Time-frequency networks for audio super-resolution. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 646–650. IEEE (2018)

    Google Scholar 

  15. Ma, F., Karaman, S.: Sparse-to-dense: depth prediction from sparse depth samples and a single image. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 4796–4803. IEEE (2018)

    Google Scholar 

  16. Morishima, S., Harashima, H., Katayama, Y.: Speech coding based on a multi-layer neural network. In: IEEE International Conference on Communications (1990)

    Google Scholar 

  17. Pistellato, M., Albarelli, A., Bergamasco, F., Torsello, A.: Robust joint selection of camera orientations and feature projections over multiple views. In: 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 2016, pp. 3703–3708 (2016). https://doi.org/10.1109/ICPR.2016.7900210

  18. Pistellato, M., Bergamasco, F., Albarelli, A., Torsello, A.: Dynamic optimal path selection for 3D triangulation with multiple cameras. In: Murino, V., Puppo, E. (eds.) ICIAP 2015. LNCS, vol. 9279, pp. 468–479. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23231-7_42

    Chapter  Google Scholar 

  19. Pistellato, M., Bergamasco, F., Albarelli, A., Torsello, A.: Robust cylinder estimation in point clouds from pairwise axes similarities. In: Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - ICPRAM, pp. 640–647 (2019). https://doi.org/10.5220/0007401706400647

  20. Pistellato, M., Bergamasco, F., Fatima, T., Torsello, A.: Deep demosaicing for polarimetric filter array cameras. IEEE Trans. Image Process. 31, 2017–2026 (2022). https://doi.org/10.1109/TIP.2022.3150296

    Article  Google Scholar 

  21. Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. In: 2017 international conference on 3D Vision (3DV) (2017)

    Google Scholar 

  22. Wang, W., Neumann, U.: Depth-aware CNN for RGB-D segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 144–161. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_9

    Chapter  Google Scholar 

  23. Williamson, D.S., Wang, D.: Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 25(7), 1492–1501 (2017)

    Article  Google Scholar 

  24. Zeghidour, N., Luebs, A., Omran, A., Skoglund, J., Tagliasacchi, M.: SoundStream: an end-to-end neural audio codec. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 495–507 (2021)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mara Pistellato .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Scodeller, G., Pistellato, M., Bergamasco, F. (2023). Exploring Audio Compression as Image Completion in Time-Frequency Domain. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14234. Springer, Cham. https://doi.org/10.1007/978-3-031-43153-1_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43153-1_37

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43152-4

  • Online ISBN: 978-3-031-43153-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics