Do containers improve the reproducibility of Machine Learning (ML) experiments? First, we need to define what it means to be reproducible.
Reproducibility
Reproducibility is a key component of the scientific method and is used to validate the results of an experiment. However, within science, there is no clear definition of reproducibility and terms like repeatability and reproducibility are often used interchangeably. In The Fundamental Principles of Reproducibility (1) , Odd Erik Gundersen surveyed nineteen papers on the subject of reproducibility. Based on his survey, most of the papers agree that for an experiment to be reproducible, independent researchers should be able to draw the same conclusions when performing the reproducibility experiment in their independent laboratory.
Therefore, a reproducibility experiment must be conducted by independent researchers. If it is the original researcher that reruns their experiment and draws the same conclusion, then, based on the consensus of Gundersen’s survey, the conclusions are only repeatable, not reproducible.
Secondly, a reproducibility experiment should be conducted in an independent laboratory to test that the original experiment conclusions are not affected by any biases introduced by the original laboratory conditions. In the case of a machine learning (ML) experiment, the laboratory refers to the hardware and software environment that the experiment was conducted in.
After an independent researcher conducts a reproducibility experiment and draws their conclusions, the independent researchers can determine whether the original experiment is reproducible, is not reproducible, or the results are inconclusive. An inconclusive result would occur when the reproducibility experiment is conducted multiple times and the original conclusions are only supported some of the time. However, an experiment cannot be partially reproducible.
Variability Caused by the Laboratory
In the natural sciences, it is accepted that each laboratory has its own bias that will affect the precision of an experiment due to random errors that are introduced when running in each environment (2). Likewise, in ML experiments, the hardware and software environment, the laboratory, introduces random errors that will cause ML experiments to produce different results, even when running the ML experiments deterministically. On the software side, variations in how different ML frameworks implement algorithms, changes introduced when new versions of the software are released, and bugs in the software will cause results to vary when the software environment changes. On the hardware side, different floating point precision and the number of cores or threads on the processing unit will cause results to vary when the hardware environment changes.
Reproducibility versus Portability
In order for an independent researcher to conduct a reproducibility experiment using the original researcher’s code, the independent researcher will have to construct their own laboratory to run the experiment. In the case of ML experiments, this means creating a hardware and software environment needed to run the reproducible experiment. If the independent researcher is going to implement their own code to reproduce the experiment, then an existing hardware and software environment for ML experiments will most likely be sufficient. However, if the original researcher has published their code with their ML experiment then they can also publish their software environment, in the form of a container, to help aid the independent researcher to conduct their reproducibility experiment.
Providing a container does not mean the original experiment is reproducible, but it does make the original experiment’s software environment portable. In computer science, software is said to be portable if it can, with reasonable effort, be made to run on computers apart from the one for which it was originally written (3).
When the original researcher provides their ML code and a container, it makes it easier for an independent researcher to create a software environment to run the reproducibility experiment, but this doesn’t improve the reproducibility of an ML experiment. First of all, it has been shown that the hardware environment (4, 5, 6, 7), the other half of the laboratory, can have a significant effect on the results of an ML experiment. It may be easy for an independent researcher to configure their hardware to run a specific software environment, however it may not be possible for the independent researcher to gain access to the hardware used in the original experiment, either due to cost or availability. Additionally, having access to the same hardware and/or software environment may help repeat the results of the original experiment, however, it ignores other important parts of a reproducible experiment, like verifying the analysis of the results and the verifying the conclusion of the experiment based on the analysis, both important steps for verifying reproducibility based on the scientific method.
There is a downside to using the same container as the original researcher to conduct the original experiment when performing a reproducibility experiment; that is, any bias introduced in the original experiment by the software environment will also be present in the reproducibility experiment. An independent researcher that uses a container from the original experiment has to be mindful that their results are not influenced by the same biases as the original experiment. The results of an experiment may be considered more generalizable when the experiment can be reproduced by an independent researcher in a different hardware and software environment instead of reusing the software environment of the original experiment.
Portability of Containers
Containers do not guarantee portability. Containers built on one CPU architecture will not work on an incompatible architecture, AMD64 containers will not run on ARM CPUs and vice versa. The official Python Docker container has seven different versions due to architecture incompatibilities.
One way to deal with architecture incompatibilities is to not provide a pre-built container,but only include the container build definition file. However, this comes with challenges.
Providing just the container build definition file, like a Dockerfile file, may not produce the same container when built at a later date. Due to stale links, software listed in the container build definition file may no longer be available for download and installation. However, an even bigger issue is the installation of different versions of the software dependencies. For example, if the container build definition file explicitly states that TensorFlow 2.12.0 will be installed, TensorFlow has a dependency to install the latest version of numpy as long as it is greater than version 1.12.0. Installing TensorFlow at different times will likely install a different set of dependent packages. Software, like TensorFlow, can have several dozen dependencies, which may have their own dependencies, with weak software dependency version requirements. There may be software package managers that can freeze dependencies, like pip for Python, but specifying individual versions for what could be hundreds of dependencies will likely increase a container build to fail due to stale links.
Summary
In conclusion, containers can improve the portability of an ML experiment's software environment, but they do not necessarily improve the reproducibility of an ML experiment. Providing a container does not mean that the original experiment is reproducible, as it only makes the original experiment’s software environment portable. Additionally, containers do not guarantee portability across different architectures.
Therefore, containerizing an ML experiment is a valuable tool for portability but not reproducibility. To improve the reproducibility of an ML experiment, it is important to document the entire experiment in detail, including the hardware and software environment used to conduct the experiment.
References
1. O. E. Gundersen, “The Fundamental Principles of Reproducibility,” arXiv:2011.10098 [cs], Nov. 2020, Accessed: Jan. 17, 2021. [Online]. Available: http://arxiv.org/abs/2011.10098
2. J. N. Miller and J. C. Miller, Statistics and chemometrics for analytical chemistry, 6. ed. Harlow: Prentice Hall, 2010.
3. Brown, Peter J. "Software portability." Encyclopedia of computer science. 2003. 1633-1634.
4. O. E. Gundersen, S. Shamsaliei, and R. J. Isdahl, “Do machine learning platforms provide out-of-the-box reproducibility?,” Future Generation Computer Systems, vol. 126, pp. 34–47, Jan. 2022, doi: 10.1016/j.future.2021.06.014.
5. S.-Y. Hong et al., “An Evaluation of the Software System Dependency of a Global Atmospheric Model,” Monthly Weather Review, vol. 141, no. 11, pp. 4165–4172, Nov. 2013, doi: 10.1175/MWR-D-12-00352.1.
6. P. Nagarajan, G. Warnell, and P. Stone, “The Impact of Nondeterminism on Reproducibility in Deep Reinforcement Learning,” p. 10
7. K. Coakley, C. R. Kirkpatrick, and O. E. Gundersen, “Examining the Effect of Implementation Factors on Deep Learning Reproducibility”.
Comments