The current machine learning training and testing process is not rigorous enough to ensure that the models being trained will work in the real world. During the training process many different models can be produced that all perform equally well when tested in lab settings and differ only in small, arbitrary ways. The differences stem from things like the initial random values given to the nodes in a neural network before the training starts, the way training data is selected for traning and so on.
The problem becomes evident when you use the models in a real world setting. Here, these small differences can lead to huge variation in performance. Called underspecification (meaning that observed effects can have many possible causes), this is a problem that is not unique to machine learning, but it is a problem that is more acute in machine learning than in other fields. .
The authors of the paper suggest a few ways to mitigate this problem, including training multiple models at once and using real world data to test the models. For big companies like Google this huge amount of work might be worth the effort, but for other smaller companies this might be not doable at all.