skip to main content
10.1145/3589250.3596145acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Static Analysis of Data Transformations in Jupyter Notebooks

Published:06 June 2023Publication History

ABSTRACT

Jupyter notebooks used to pre-process and polish raw data for data science and machine learning processes are challenging to analyze. Their data-centric code manipulates dataframes through call to library functions with complex semantics, and the properties to track over it vary widely depending on the verification task. This paper presents a novel abstract domain that simplifies writing analyses for such programs, by extracting a unique CFG from the notebook that contains all transformations applied to the data. Several properties can then be determined by analyzing such CFG, that is simpler than the original Python code. We present a first use case that exploits our analysis to infer the required shape of the dataframes manipulated by the notebook.

References

  1. Irene Y. Chen, Fredrik D. Johansson, and David Sontag. 2018. Why is My Classifier Discriminatory? In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18). Curran Associates Inc., Red Hook, NY, USA. 3543–3554. https://proceedings.neurips.cc/paper/2018/file/1f1baa5b8edac74eb4eaa329f14a0361-Paper.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Patrick Cousot and Radhia Cousot. 1992. Abstract interpretation and application to logic programs. The Journal of Logic Programming, 13, 2 (1992), 103–179. issn:0743-1066 https://doi.org/10.1016/0743-1066(92)90030-7 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Pietro Ferrara, Luca Negrini, Vincenzo Arceri, and Agostino Cortesi. 2021. Static Analysis for Dummies: Experiencing LiSA. In Proceedings of the 10th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2021). Association for Computing Machinery, New York, NY, USA. 1–6. isbn:9781450384681 https://doi.org/10.1145/3460946.3464316 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Stefan Grafberger, Shubha Guha, Julia Stoyanovich, and Sebastian Schelter. 2021. MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD ’21). Association for Computing Machinery, New York, NY, USA. 2736–2739. isbn:9781450383431 https://doi.org/10.1145/3448016.3452759 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Mohammad Hossein Namaki, Avrilia Floratou, Fotis Psallidas, Subru Krishnan, Ashvin Agrawal, Yinghui Wu, Yiwen Zhu, and Markus Weimer. 2020. Vamsa: Automated Provenance Tracking in Data Science Scripts. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20). Association for Computing Machinery, New York, NY, USA. 1542–1551. isbn:9781450379984 https://doi.org/10.1145/3394486.3403205 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Luca Negrini. 2023. A generic framework for multilanguage analysis. Ph. D. Dissertation. Universitá Ca’ Foscari Venezia. Google ScholarGoogle Scholar
  7. Pavle Subotić, Lazar Milikić, and Milan Stojić. 2022. A Static Analysis Framework for Data Science Notebooks. In 2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 13–22. https://doi.org/10.1145/3510457.3513032 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Caterina Urban. 2020. What Programs Want: Automatic Inference of Input Data Specifications. CoRR, abs/2007.10688 (2020), arXiv:2007.10688. arxiv:2007.10688 Google ScholarGoogle Scholar
  9. Caterina Urban and Antoine Miné. 2021. A Review of Formal Methods applied to Machine Learning. ArXiv, abs/2104.02466 (2021), https://doi.org/10.48550/arXiv.2104.02466 Google ScholarGoogle Scholar
  10. Caterina Urban and Peter Müller. 2018. An Abstract Interpretation Framework for Input Data Usage. In Programming Languages and Systems, Amal Ahmed (Ed.). Springer International Publishing, Cham. 683–710. isbn:978-3-319-89884-1 https://doi.org/10.1007/978-3-319-89884-1_24 Google ScholarGoogle ScholarCross RefCross Ref
  11. Ke Yang, Biao Huang, Julia Stoyanovich, and Sebastian Schelter. 2020. Fairness-Aware Instrumentation of Preprocessing Pipelines for Machine Learning. In Workshop on Human-In-the-Loop Data Analytics (HILDA’20). Workshop on Human-In-the-Loop Data Analytics (HILDA’20), https://par.nsf.gov/biblio/10182459 Google ScholarGoogle Scholar

Index Terms

  1. Static Analysis of Data Transformations in Jupyter Notebooks

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SOAP 2023: Proceedings of the 12th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis
          June 2023
          70 pages
          ISBN:9798400701702
          DOI:10.1145/3589250

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 June 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate11of11submissions,100%

          Upcoming Conference

          PLDI '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader