Troi recommendation system

petitminion

@mayhem proposed this :

"Well, the proof of concept is the MBID Mapper that is in production on ListenBrainz. I am not proposing to write any amount of new code -- just slapping together bits that already exist, so proof of concept is pretty much the working thing.
Let me lay out the moving parts:

Typesense index -- this would require running a typesense container: https://typesense.org/
An indexer that builds an initial index of all the metadata and receives updates of tracks being added, 3. removed or renamed from the funkwhale DB.
A troi component of some sort. Let's assume that we want to implement daily-jams in funkwhale. This component would need to include Troi and be called on a nightly basis to generate the playlist.
A modified daily-jams patch that downloads the recs from LB -- it would do the normal daily-jams stuff, but it would add an extra step of narrowing the recommended tracks down to only tracks available locally.
That's about it. This then makes troi a part of FW -- and troi patch should be able to run and generate playlists now. I might need to think about how the FW aspects of troi could be included in troi patches without having a make custom patches, but that is a minor design detail.

Does this make sense so far?
"

petitminion

The index is to link fw track to mbid ? do you have a link of lb code ?
I tried to use troi has a python lib and it work well. We can easyly create a task that use the troi patches
We could also sort the tracks in fw side, but maybe doing a special fw patch is more efficient idk

mayhem

This, but ported to FW: https://github.com/metabrainz/listenbrainz-server/blob/master/mbid_mapping/mapping/typesense_index.py
Great
There is a concept I've been pondering for a while and now might be the time for it. You know how troi makes a pipeline composed of Elements? I want to make it so that "local plugin" patches could be loaded by the daily-jams patch. So, if it finds a plugin in a certain directory, then it splices that pipeline into the daily-jams pipeline near the end and then execute the whole thing. That would allow daily-jams to be platform agnostic and only one patch be needed on the FW side to lookup recordings and if a match is found in the index, pass it along the pipeline. If not, drop the track and move on.

Thoughts?

petitminion

mayhem
I think I don't understand.

mapping.canonical_musicbrainz_data is a table used to index artist_credit and recording_name to mbids ? Then typesense index is used to search a match between the canonical table and the raw data (metadata not tagged with mbid) ? So the match is only done by comparing the artist_credit, recording_name and release string ?

You are proposing to have an similar index in fw that map fw tracks to mbids based on metadata strings ? This way we don't actually touch the fw tracks metadata and we have an index to build the recommendation system, it seems great o/
I'm not sure we should use Typesense. We could build an index with django.

You want to import troi in funkwhale, then funkwhale call troi.patch(local_patch), and the troi lib can handle local_patch at the end of the pipeline ?

ps :
this is usefull to understand : https://github.com/metabrainz/listenbrainz-server/blob/e461b465f9bcb3496372affa468d5b5f31c0d120/docs/developers/mapping.rst

mayhem

mapping.canonical_musicbrainz_data is the entirety of the MusicBrainz data whittled down to only canonical recordings (the recordings that people think of as being "the recording" for a given track). Given that MB tracks multiple versions of everything this step is critical to build a mapper.

Funkwhale doesn't do that, it only has metadata about its tracks, so the funkwhale metadata can be used directly in its place. And yes, a metadata index where we don't actually use MBIDs at all. No need for collections to be tagged. No need to modify files.

I'm not sure we should use Typesense. We could build an index with django.

Que? Django is a web framework and Typesense is text search engine. These are not interchangeable.

Typesense is the right tool for this job. Typo tolerant fast text search is exactly what is needed in this case right here. We can tune how many typos are allowed in a search (based on the size of the index -- larger index, longer search times. Typesense is the right tool here and I will develop this project using it. Should FW have issues with installing typesense, then you and the rest of the community can find some worse performing index to install in its place or to reinvent the wheel as you see fit. Just keep me out of these politics.

You want to import troi in funkwhale, then funkwhale call troi.patch(local_patch), and the troi lib can handle local_patch at the end of the pipeline ?

I looked at this in detail yesterday and I'll add a command line --post-process that can specify another patch to run as a post processing patch. In this case that would be the content resolver patch that makes calls to typesense. Using the post-process flag allows the main troi patches to run unmodified and have the content resolver tacked into the pipeline at the end, ensuring that only local tracks are recommended.

petitminion

mayhem And yes, a metadata index where we don't actually use MBIDs at all

So when troi check if the recordings are locally accessible it will just make a query to see if artist_credit, album name and recording title match with some data in the fw index ?

mayhem Typesense is the right tool here and I will develop this project using it. Should FW have issues with installing typesense, then you and the rest of the community can find some worse performing index to install in its place or to reinvent the wheel as you see fit. Just keep me out of these politics.

The discussion will certainly happen here, you are free to take part in it or not as you wish ofc.

I understand typesense is a good tool but we are creating a federated software here. It run on a large varieties of supports, so we need to be very careful about increasing load, installation complexity, etc
Idk what are the processing times we should tried to reach but if a postgresql request throught django is fast enough its maybe the better option. But I will let people that are more aware of our deployments methods to enlight us o/

mayhem I looked at this in detail yesterday and I'll add a command line --post-process

Perfect ! I suppose we will use troi has a library into our backend so we should make sure we can pass this has an function argument and not only has a command line one but it should be included in the change anyway o/

mayhem

I've been talking about the content resolver here at FOSDEM and it's gotten more interest than I expected, making it clear that a solution other than typesense is needed. I'll try and find a simple python lib that can get the job done. Also, I think a standalone project might be a good idea for this library, so others can use it.

Overall, this doesn't change much, just adding another abstraction layer. I am certainly fired up to get this working soon.

petitminion

if you need any help you can ping me 🙂

mayhem

I repurposed an old stillborn project and have 80% of the needed code in place already. Index building and search is already working!
I do worry about the scalability of the solution -- do you have any number for how many tracks people typically have in their pod? and the largest pods?
Hopefully I'll have some code for you to try today, or in the next few days.

petitminion

mayhem we have this https://network.funkwhale.audio/dashboards/d/overview/funkwhale-network-overview?orgId=1&refresh=2h

I just added track count o/ but this do not represent the amount of tracks metadata, only locals tracks. The recommandation query will run on all the metadata, so we have to assume it way more than the local tracks count. Taking the whole network tracks count has reference could be more accurate : 860k tracks. Because were gonna enable a feature to scrawl all the tracks of the network in one pod.

mayhem

OK, that makes sense. Our text indexing solution needs to be solid then. I've been playing with the Whoosh Python library, which is good, but buggy and it has been abandoned. Typesense is still the best thing out there. Maybe we'll have to make do with postgres text search -- the key issue is that we need fuzzy searching, not just exact searches.

But, let me the get the proof of concept working, the we can discuss what we should use.

mayhem

Ok, here is a first cut of this project:

https://github.com/metabrainz/listenbrainz-content-resolver

It will need some restructuring to be more generic, but right now it is a proof of concept that can take a LB generated JSPF file and resolve it to a local M3U file. Have a look and we'll work out next steps from here.

petitminion

mayhem
If I understand correctly the main interest of whoosh is Levenshtein distance, considering whoosh is not maintained anymore maybe using trigram-similarity that is implement into django could do the trick ?
According to postgre doc b-tree index should be efficiet for this kind of query. And django allow us to index the metadata we will need, we could index it into the format we want : artist name - recording name for example.

There is also the full text search option

mayhem

Could work, not sure, we'll have to try. But, just to be clear, it isn't just the edit distance we need -- that is easy. But we need to search a body of data using that edit distance. Postgres fulltext search might do it as well.

This is where it would be useful to for you to jump in and explore the various options that allows us to lookup fuzzy matches from a table that we've generated.

petitminion

mayhem

mayhem we need to search a body of data using that edit distance

I don't understant that part :s You method is to search for linking park - thumb has our query into a generated index with wrong data like lnikign prk - thmub . I don't understand why not make two queries instead of one : one for the artist and then another one on the recordings ?

mayhem This is where it would be useful to for you to jump in and explore the various options that allows us to lookup fuzzy matches from a table that we've generated.

you have a table of recordings to look up for test purpose ? eg fuzzy data linked to clean data ?

mayhem

petitminion I don't understant that part :s

Lets say we have a table with the lookup fields that have the important data, lower cased, punctuation removed, white space removed (along with a file_path that lets us know where the file is). Something like:

portisheadstrangers
u2wherethestreetshavenoname
queenwewillrockyou

And lets assume that there are 100k rows of this, one for each track. We must now find a search solution that if we query for something that is not quite 100% accurate. In the case below, the "s" at the end is missing:

portisheadstranger

should find

portisheadstrangers

Calculating the edit distance between two strings simply compares two strings. What we need is to implement a fuzzy search in a DB table (or document index).
Does this make more sense?

petitminion you have a table of recordings to look up for test purpose ? eg fuzzy data linked to clean data ?

How many GBs would you like? 🙂

mayhem

petitminion I don't understand why not make two queries instead of one : one for the artist and then another one on the recordings ?

Because it is vastly easier and considerably faster this way.

petitminion

mayhem How many GBs would you like?

haha well not too much for sure ^^ And do you have performance statistics about using typesense ?

mayhem

petitminion And do you have performance statistics about using typesense ?

Considerably faster than postgres trigrams (or somesuch) for large data sets. The real question is: Is the funkwhale data we're going to throw at it in the right range of performance?
MetaBrainz needs fuzzy search with 20M+ recordings. Funkwhale will never need that, so could we set performance goal of max 500,000 tracks in a pod and ensure that it still behaves performantly?
I think this afternoon I will try making a PG table with 500k rows and see if we can get fuzzy text search to work with it. If that works, then we're free and clear for the FW solution.

But first lunch, now that I am back in the country of all the tasty noms. 🙂

mayhem

Oh, also to be clear, we're not just trying to find differences at the end of a string as in my example above. The differences could appear anywhere in the string. For instance, if we have:

u2wherethestreetshavenoname

and we search for:

u2wherestreetshavenoname

Is should still find the track with the complete name (that is not missing the article).