ABSTRACT
Jupyter notebooks used to pre-process and polish raw data for data science and machine learning processes are challenging to analyze. Their data-centric code manipulates dataframes through call to library functions with complex semantics, and the properties to track over it vary widely depending on the verification task. This paper presents a novel abstract domain that simplifies writing analyses for such programs, by extracting a unique CFG from the notebook that contains all transformations applied to the data. Several properties can then be determined by analyzing such CFG, that is simpler than the original Python code. We present a first use case that exploits our analysis to infer the required shape of the dataframes manipulated by the notebook.
- Irene Y. Chen, Fredrik D. Johansson, and David Sontag. 2018. Why is My Classifier Discriminatory? In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18). Curran Associates Inc., Red Hook, NY, USA. 3543–3554. https://proceedings.neurips.cc/paper/2018/file/1f1baa5b8edac74eb4eaa329f14a0361-Paper.pdf Google ScholarDigital Library
- Patrick Cousot and Radhia Cousot. 1992. Abstract interpretation and application to logic programs. The Journal of Logic Programming, 13, 2 (1992), 103–179. issn:0743-1066 https://doi.org/10.1016/0743-1066(92)90030-7 Google ScholarDigital Library
- Pietro Ferrara, Luca Negrini, Vincenzo Arceri, and Agostino Cortesi. 2021. Static Analysis for Dummies: Experiencing LiSA. In Proceedings of the 10th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis (SOAP 2021). Association for Computing Machinery, New York, NY, USA. 1–6. isbn:9781450384681 https://doi.org/10.1145/3460946.3464316 Google ScholarDigital Library
- Stefan Grafberger, Shubha Guha, Julia Stoyanovich, and Sebastian Schelter. 2021. MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD ’21). Association for Computing Machinery, New York, NY, USA. 2736–2739. isbn:9781450383431 https://doi.org/10.1145/3448016.3452759 Google ScholarDigital Library
- Mohammad Hossein Namaki, Avrilia Floratou, Fotis Psallidas, Subru Krishnan, Ashvin Agrawal, Yinghui Wu, Yiwen Zhu, and Markus Weimer. 2020. Vamsa: Automated Provenance Tracking in Data Science Scripts. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’20). Association for Computing Machinery, New York, NY, USA. 1542–1551. isbn:9781450379984 https://doi.org/10.1145/3394486.3403205 Google ScholarDigital Library
- Luca Negrini. 2023. A generic framework for multilanguage analysis. Ph. D. Dissertation. Universitá Ca’ Foscari Venezia. Google Scholar
- Pavle Subotić, Lazar Milikić, and Milan Stojić. 2022. A Static Analysis Framework for Data Science Notebooks. In 2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 13–22. https://doi.org/10.1145/3510457.3513032 Google ScholarDigital Library
- Caterina Urban. 2020. What Programs Want: Automatic Inference of Input Data Specifications. CoRR, abs/2007.10688 (2020), arXiv:2007.10688. arxiv:2007.10688 Google Scholar
- Caterina Urban and Antoine Miné. 2021. A Review of Formal Methods applied to Machine Learning. ArXiv, abs/2104.02466 (2021), https://doi.org/10.48550/arXiv.2104.02466 Google Scholar
- Caterina Urban and Peter Müller. 2018. An Abstract Interpretation Framework for Input Data Usage. In Programming Languages and Systems, Amal Ahmed (Ed.). Springer International Publishing, Cham. 683–710. isbn:978-3-319-89884-1 https://doi.org/10.1007/978-3-319-89884-1_24 Google ScholarCross Ref
- Ke Yang, Biao Huang, Julia Stoyanovich, and Sebastian Schelter. 2020. Fairness-Aware Instrumentation of Preprocessing Pipelines for Machine Learning. In Workshop on Human-In-the-Loop Data Analytics (HILDA’20). Workshop on Human-In-the-Loop Data Analytics (HILDA’20), https://par.nsf.gov/biblio/10182459 Google Scholar
Index Terms
- Static Analysis of Data Transformations in Jupyter Notebooks
Recommendations
Benefits and Pitfalls of Jupyter Notebooks in the Classroom
SIGITE '20: Proceedings of the 21st Annual Conference on Information Technology EducationJupyter notebooks are widely used in industry and in academic research, but have only begun to make inroads into the classroom. The design of the Jupyter notebook is in many ways well suited for teaching subjects in information technology and computer ...
Restoring reproducibility of Jupyter notebooks
ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion ProceedingsJupyter notebooks---documents that contain live code, equations, visualizations, and narrative text---now are among the most popular means to compute, present, discuss and disseminate scientific findings. In principle, Jupyter notebooks should easily ...
Auto-Grading Jupyter Notebooks
SIGCSE '20: Proceedings of the 51st ACM Technical Symposium on Computer Science EducationJupyter Notebooks are becoming more widely used, both for data science applications and as a convenient environment for learning Python. Currently, grading of assignments done in Jupyter Notebooks is typically done manually. Manual grading results in ...
Comments