research-article

Static Analysis of Data Transformations in Jupyter Notebooks

Authors:
Luca Negrini

Corvallis, Italy

Corvallis, Italy
View Profile

,
Guruprerana Shabadi

École Polytechnique, France / Institut Polytechnique de Paris, France

École Polytechnique, France / Institut Polytechnique de Paris, France
View Profile

,
Caterina Urban

Inria Paris, France / ENS, France

Inria Paris, France / ENS, France
View Profile

SOAP 2023: Proceedings of the 12th ACM SIGPLAN International Workshop on the State Of the Art in Program AnalysisJune 2023Pages 8–13https://doi.org/10.1145/3589250.3596145

Published:06 June 2023Publication History

SOAP 2023: Proceedings of the 12th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis

Pages 8–13

ABSTRACT

Jupyter notebooks used to pre-process and polish raw data for data science and machine learning processes are challenging to analyze. Their data-centric code manipulates dataframes through call to library functions with complex semantics, and the properties to track over it vary widely depending on the verification task. This paper presents a novel abstract domain that simplifies writing analyses for such programs, by extracting a unique CFG from the notebook that contains all transformations applied to the data. Several properties can then be determined by analyzing such CFG, that is simpler than the original Python code. We present a first use case that exploits our analysis to infer the required shape of the dataframes manipulated by the notebook.

References

Irene Y. Chen, Fredrik D. Johansson, and David Sontag. 2018. Why is My Classifier Discriminatory? In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18). Curran Associates Inc., Red Hook, NY, USA. 3543–3554. https://proceedings.neurips.cc/paper/2018/file/1f1baa5b8edac74eb4eaa329f14a0361-Paper.pdf Google ScholarDigital Library
Patrick Cousot and Radhia Cousot. 1992. Abstract interpretation and application to logic programs. The Journal of Logic Programming, 13, 2 (1992), 103–179. issn:0743-1066 https://doi.org/10.1016/0743-1066(92)90030-7 Google ScholarDigital Library
Pietro Ferrara, Luca Negrini, Vincenzo Arceri, and Agostino Cortesi. 2021. Static Analysis for Dummies: Experiencing LiSA. In Proceedings of the 10th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2021). Association for Computing Machinery, New York, NY, USA. 1–6. isbn:9781450384681 https://doi.org/10.1145/3460946.3464316 Google ScholarDigital Library
Stefan Grafberger, Shubha Guha, Julia Stoyanovich, and Sebastian Schelter. 2021. MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD ’21). Association for Computing Machinery, New York, NY, USA. 2736–2739. isbn:9781450383431 https://doi.org/10.1145/3448016.3452759 Google ScholarDigital Library
Mohammad Hossein Namaki, Avrilia Floratou, Fotis Psallidas, Subru Krishnan, Ashvin Agrawal, Yinghui Wu, Yiwen Zhu, and Markus Weimer. 2020. Vamsa: Automated Provenance Tracking in Data Science Scripts. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20). Association for Computing Machinery, New York, NY, USA. 1542–1551. isbn:9781450379984 https://doi.org/10.1145/3394486.3403205 Google ScholarDigital Library
Luca Negrini. 2023. A generic framework for multilanguage analysis. Ph. D. Dissertation. Universitá Ca’ Foscari Venezia. Google Scholar
Pavle Subotić, Lazar Milikić, and Milan Stojić. 2022. A Static Analysis Framework for Data Science Notebooks. In 2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 13–22. https://doi.org/10.1145/3510457.3513032 Google ScholarDigital Library
Caterina Urban. 2020. What Programs Want: Automatic Inference of Input Data Specifications. CoRR, abs/2007.10688 (2020), arXiv:2007.10688. arxiv:2007.10688 Google Scholar
Caterina Urban and Antoine Miné. 2021. A Review of Formal Methods applied to Machine Learning. ArXiv, abs/2104.02466 (2021), https://doi.org/10.48550/arXiv.2104.02466 Google Scholar
Caterina Urban and Peter Müller. 2018. An Abstract Interpretation Framework for Input Data Usage. In Programming Languages and Systems, Amal Ahmed (Ed.). Springer International Publishing, Cham. 683–710. isbn:978-3-319-89884-1 https://doi.org/10.1007/978-3-319-89884-1_24 Google ScholarCross Ref
Ke Yang, Biao Huang, Julia Stoyanovich, and Sebastian Schelter. 2020. Fairness-Aware Instrumentation of Preprocessing Pipelines for Machine Learning. In Workshop on Human-In-the-Loop Data Analytics (HILDA’20). Workshop on Human-In-the-Loop Data Analytics (HILDA’20), https://par.nsf.gov/biblio/10182459 Google Scholar

Index Terms

Static Analysis of Data Transformations in Jupyter Notebooks
1. Software and its engineering
  1. Software organization and properties
    1. Software functional properties
      1. Formal methods
        Automated static analysis
2. Theory of computation
  1. Semantics and reasoning
    1. Program reasoning
      1. Abstraction
      2. Program analysis

Recommendations

Benefits and Pitfalls of Jupyter Notebooks in the Classroom
SIGITE '20: Proceedings of the 21st Annual Conference on Information Technology Education

Jupyter notebooks are widely used in industry and in academic research, but have only begun to make inroads into the classroom. The design of the Jupyter notebook is in many ways well suited for teaching subjects in information technology and computer ...
Read More
Restoring reproducibility of Jupyter notebooks
ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings

Jupyter notebooks---documents that contain live code, equations, visualizations, and narrative text---now are among the most popular means to compute, present, discuss and disseminate scientific findings. In principle, Jupyter notebooks should easily ...
Read More
Auto-Grading Jupyter Notebooks
SIGCSE '20: Proceedings of the 51st ACM Technical Symposium on Computer Science Education

Jupyter Notebooks are becoming more widely used, both for data science applications and as a convenient environment for learning Python. Currently, grading of assignments done in Jupyter Notebooks is typically done manually. Manual grading results in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SOAP 2023: Proceedings of the 12th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis
June 2023
70 pages
ISBN:9798400701702
DOI:10.1145/3589250
General Chairs:
Pietro Ferrara
Ca' Foscari University of Venice, Italy
,
Liana Hadarean
AWS, USA
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 June 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Abstract Interpretation
Data Science
Jupyter Notebooks
Static Analysis
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate11of11submissions,100%
Upcoming Conference
PLDI '24

Sponsor:

sigplan

ACM SIGPLAN Conference on Programming Language Design and Implementation

June 24 - 28, 2024

Copenhagen , Denmark
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 118
  Total Downloads
- Downloads (Last 12 months)118
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Static Analysis of Data Transformations in Jupyter Notebooks

SOAP 2023: Proceedings of the 12th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Benefits and Pitfalls of Jupyter Notebooks in the Classroom

Restoring reproducibility of Jupyter notebooks

Auto-Grading Jupyter Notebooks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Static Analysis of Data Transformations in Jupyter Notebooks

SOAP 2023: Proceedings of the 12th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Benefits and Pitfalls of Jupyter Notebooks in the Classroom

Restoring reproducibility of Jupyter notebooks

Auto-Grading Jupyter Notebooks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media