I should have chimed into this discussion earlier, but let me now add some info as a Picard developer.
petitminion If an audio file is linked to a recording, the recoding to a release but the audio file don't match the release it a mb db error. And I don't think fw should handle those
On MB there is a difference between a recording ID and a track ID. The recordings do indeed ignore some differences that might be present on the track level. E.g. the same recording with varying silence at the end still is considered the same recording on MB. This e.g. affects the gapless album tracks animaldaydream is referring to: An album might be, for artistic reasons, a continuous, gapless piece of audio, where one track seamlessly goes into the next. But single tracks of that album might e.g. be added to compilations or best-of albums. And there they probably will get added with a short silence at the end to separate them from the next, now unrelated, track.
Likewise mastering differences are ignored, e.g. the CD and Vinyl release, even though mastered for different media, will share the same recordings.
[deleted] I have been made aware that AcoustID fingerprints are actually more sensitive to audio differences than I thought, that they can even pick up differences between a lossless audio file and its transcoded lossy versions.
The fingerprints will be slightly different between the lossless and lossy (or different quality lossy) files. But AcoustID as a whole is designed to ignore exactly these differences. If you have a lossless FLAC and create a lossy MP3 and AAC file from it then all those ideally will end up with the same AcoustID. Otherwise the whole thing would not work for its intended purpose of identifying the recording.
AcoustID is intentionally designed to ignore some differences. Also there is some other in-accuracy that can lead to more different recordings to still identify as the same. AcoustID takes the fingerprint of the first 30 seconds of an recording plus the duration of of the recording for identification. First this means differences after the first 30 seconds are not picked up (this could for example ignore differences between those "clean lyrics" versions and the normal version). And also duration differences up to a certain threshold (I'd need to look this up, but I think it's 20 or 30 seconds) are ignored and still assigned the same AcoustID.
Some general thoughts from my side:
- You should not base file duplicates on the concept of MB recordings or AcoustID alone. These can be a strong indication of duplicates, but they do not necessarily mean files taken from different releases are universally interchangeable.
- You can use AcoustID to find matching recordings after all this is what AcoustID has been designed to do. As long a there is a sufficient number of submitted fingerprints and there are no mass wrong submissions it is pretty good at that job. But for full identification of recording + release you need to take more information into account (existing metadata and duration is what you'll normally have). This should consider all files belonging to one release as a whole. Then you can link a file to combined recording and release MBID.
- Fully automating this will cause errors (as pointed out above by rob and animaldaydreams) and is better accompanied by user interaction. At least there should maybe be a system where uncertain matches get put forward to the user to review and fix, and there should be a system for users to manually correct matching mistakes.
- There are also a lot of simple cases where there is no big ambiguity in matches. That's true for a lot of newer digital releases.
- I believe automatic matching could be better and there is much potential to improve the existing lookup in e.g. Picard. rob's recent ventures into this (see links in his post above) is pretty interesting, take a look at that. It's solely based on metadata for now, but it looks at all the files and tries to group and match them into MB releases the best way. For long I have been dreaming about a combined lookup in Picard that uses both metadata and AcoustID fingerprints to get the best match (but so far did not find the time to work on this). But in the end I'm convinced some user control is necessary.