Is It Possible That Machine Learning Could Cause a Reproducibility Crisis in the Science Industry?

Researchers find that a result of data leakage in science questions the credibility of the use of machine learning in science

Machine learning is a significant tool used by researchers and is expanding quickly due to its efficiency. Machine learning helps researchers to make predictions by analyzing patterns in their data.

A pair of researchers at Princeton University (New Jersey) are predicting a reproducibility crisis in machine learning use for researchers in science. As machine learning is being sold as a tool to researchers in their studies, the tool unfortunately lacks credibility, as there is not enough proper training involved for researchers, and therefore their results could very likely be flawed.

Reproducibility allows for others to replicate the results of the experiment given the data and receive similar output. Machine learning creates conflicts in this process: if researchers misuse the tool, their data are incorrect and so are the results. This doesn’t allow others to reproduce the data and diminishes the credibility of their experiment as there are errors in data analysis.

When applied to areas such as health and justice, errors in machine learning algorithms and flaws in data models could pose a real issue. To fix this data leakage, the researchers suggest that they use evidence in their manuscripts, through a template, to prove that their models don’t have each of the eight types of leakage.

Although machine learning is still new to other fields, researchers must attempt to avoid the crash and implement steps to take to avoid data leakage sooner rather than later.