Troi recommendation system

mayhem

Any news on this front? Is everything stopped just because of this?

petitminion

yes

We also have to discuss funding with nlnet because they might not want to fund this because its not enough decentralized for them. We will see o/

petitminion

okey so we had our meeting. We agreed we want to implement this with or without funding. We can also look into the typesense solution by running it in another container. This will be optional and could be disabled by the admin for instances without resources.

We can also use nmslib because it will help this feature to be more accessible. But we would be dependent from them. I would wait and see if they really update their packaging or not. In the meantime I'm gonna look into the typesense solution to implement it to our setup.

We also shared we are very happy you're so enthusiastic about this o/

petitminion

Okey, sorry for the delay :s
Typesense will shortly be implemented in our dev setup. Maybe we can update https://github.com/metabrainz/listenbrainz-content-resolver to have both index method (nmslib and typesense) ? So if nmslib is updated we could use it in the future and we can work with typesense now ?

petitminion

after looking more in depth into https://github.com/metabrainz/listenbrainz-content-resolver I think it might be better to write a content resolver inside funkwhale directly. Because the infrastructure and db is very different from mb.

I've a running setup with typesense and a collection like this one https://github.com/metabrainz/listenbrainz-server/blob/78e37b78ff9e1a246d24adb2a0a0108a1462b284/mbid_mapping/mapping/typesense_index.py

Now I'm stuck because you suggested we could update Troi library to only suggest tracks that are accessible locally. Is this really doable ?
Or should I go the other way around and resolve the troi recommendations locally ?

The second option allow the load to be on the funkwhale pod side I supposed its better for mb infrastructures

petitminion

mayhem

OK, great -- I'm glad we can use Typesense, since this is going to be easier than messing about with the other search approach. Phew.

As for integrating Troi into Funkwhale, I've learned a few things in the past few weeks and now think it would be better to not add content resolution to Troi (that is not its intended job, really). A better way forward is to have the FW content resolver read JSPF playlists and then resolve them to local files. This way troi does not need to be modified and it becomes location agnostic, meaning that we can embed it into FW or have it hosted someplace else. JSPF is well suited for this purpose and makes everything, more interconnected and modular.

Which is why I was started the listenbrainz-content-resolver project -- it has become clear that more people are interested in this project, so it follows the JSPF -> resolved JSPF/m3u format path.

As for creating a content resolver in FW, I can totally see how the internal DB might make it better to have the resolver in FW, no problem. But, remember that the resolver has two distinct parts that are both critical:

Search for tracks (solved)
De-tune metadata. This component takes a potentially dirty track/artist names and converts it to something that is clean. Example: "Inhale - Radio edit 1997" -> "Inhale"

Playlist metadata and funkwhale metadata may either/both contain this cruft and that complicates content resolution. Of the two components, the first one is pretty simple. When its done, its done. The latter is going to evolve over time as people make the detuning process better. If you don't use the LB library for this, you will have to continually improve/update this code as your users find tracks that are not matched.

What I can do is re-work the lb content resolver to expose the de-tuning engine with a minimal set of requirements -- quite possibly 1 or 2 python libraries (unidecode is one). This way you can include the library without adding a lot more stuff to install.

So, the lookup process then goes like this:

Build the typesense index duplicate metadata records, one for the original metadata and one for a detuned version (if it can be detuned). Insert both (all?) records into the index. This ensures that the detuning happens on the FW side and on the inbound metadata side.
Lookup track with metadata as given. Success? Done.
No matches? Examine the given metadata and run query artist and/or recording through the detuner. If the detuning was successful, redo the lookup.
Still no luck? Bummer.

How does all this sound?

petitminion

mayhem What I can do is re-work the lb content resolver to expose the de-tuning engine with a minimal set of requirements -- quite possibly 1 or 2 python libraries (unidecode is one). This way you can include the library without adding a lot more stuff to install

I'm not sure using the lb content resolver lib into funkwhale is a good approach. Different databases / would be easier for fw to have an approach that is not base on file directory. Also we wanted to have jspf resolver anyway for other things, so we need to write this code anyway. That's why I would implement this without the library.
If we want to use the lib we have either to make it more agnostic or we can fork it and change a few things. But this also represent work and maintenance

mayhem it would be better to not add content resolution to Troi

That's why I though also o/

mayhem So, the lookup process then goes like this:

Looks good ! I wouldn't spend too much energy to try to match to files that are wrongly tagged. I would prefer to focus on having a clean database than to have to work with a dirty one but I understand this can be usefull anyway o/ and detuning is interesting o/

mayhem

petitminion Looks good ! I wouldn't spend too much energy to try to match to files that are wrongly tagged. I would prefer to focus on having a clean database than to have to work with a dirty one but I understand this can be usefull anyway o/ and detuning is interesting o/

This is not about badly tagged data. I am talking about correctly tagged data.

But, go ahead, don't use it for now and just proceed to use Troi with JSPF output. When you realize how poorly that works, we can talk again. 🙂

petitminion

do you need help to work on the de-tuning thing ?

mayhem

How strong are your regexp skills? 😃

petitminion

mayhem if you need any help let me know, i'm not very experienced with regex but I can mange o/

mayhem

I've finally had time to work on this and the version that is on feature parity with what LB uses currently is now implemented in this new clean library:

https://github.com/metabrainz/listenbrainz-matching-tools

Let me know if you have questions on how to use it.

petitminion

mayhem amazing ! thx ! 🙂 Do you plan to use this on mb side ? I mean will this be improved an maintained or it's not yet known ?

On my side typesense and mb match is implemented in fw. But I'm still working on troi implementation (actually its implemented but to be efficient we need to improve funkwhale radio system). I'm sorry I'm a lot afk these days, hope to get a experimental version working soon enough !

mayhem

This will be used in lb-content-resolver and in listenbrainz-server and I expect other projects will pick it up as well, so yes, I plan to maintain it.

No rush on getting this working -- we've got lots on too!

petitminion

hello 🙂
We've been asking ourselve if its better to use postgre or redis to store matches. The main question is about how much time do you think we should keep a match in db before deleting it in case of metadata change ?

mayhem

It would be ideal to invalidate the match when the metadata changes, but that could be a lot more work. But, using redis and then discarding all keys in redis when any metadata changes would be a start.

petitminion

mayhem indeed o/ But if metadata change in musicbrainz side ?

mayhem

If you want to be 100% correct, then install a MB instance and get notified when the metadata changes. That's even more work!

Realistically, I wouldn't bother storing the matches, honestly. Each lookup is already expensive and I would expect that not that many playlist resolutions are going to happen per hour/minute. And given that the greatest expense in looking up a match is loading all the data in the first place, if you repeat a few lookups that otherwise would've been cache hits, won't be noticeable in the overall performance.

petitminion

Each lookup is already expensive

lookup of what ?

the greatest expense in looking up a match is loading all the data in the first place,

loading what data ?

sorry I didn't understood ^^ But I get your point is to not cache the matches and repeat the matching process at each playlist generation. I will test that o/

« Previous Page Next Page »