When run 'in the wild' by the community, high-level music descriptors may not perform the way they did in the lab.
To what extent can we trust our automated data processing pipelines?
To what extent can we trust 'ground truth' in supervised machine learning to be a reliable oracle?