#metabrainz

/

      • Lotheric has quit
      • Lotheric joined the channel
      • Protab joined the channel
      • Rotab has quit
      • Protab is now known as Rotab
      • Nyanko-sensei joined the channel
      • D4RK-PH0ENiX has quit
      • Nyanko-sensei has quit
      • D4RK-PH0ENiX joined the channel
      • Nyanko-sensei joined the channel
      • D4RK-PH0ENiX has quit
      • Nyanko-sensei has quit
      • D4RK-PH0ENiX joined the channel
      • reosarevok
        bitmap, yvanzo: we should try to find a solution for https://tickets.metabrainz.org/browse/MBS-2278 one of these days
      • BrainzBot
        MBS-2278: Sorting collections by artist should use the "sort name" of the artist
      • reosarevok
        (that's a lot of votes)
      • "using joined sort-names of the artists from the artist credit" is probably the best we can do since it makes no sense to add sort names to artist credits as a deeper concept
      • yvanzo
        reosarevok, bitmap: On a related issue, I’m looking at replacing paging with react-window for collections.
      • reosarevok
        yvanzo: is that endless scrolling? Tell me it isn't endless scrolling
      • (that'd not solve any of the main issues of pagination, such as making things findable in one go, but would also make it impossible to just load multiple pages in different tabs to quickly check them all)
      • yvanzo
        it doesn´t need to be endless scrolling (and collections are not endless either) and it can make things findable in one go with anchors.
      • reosarevok
        Findable meaning "ctrl + f for a name", I meant
      • Gore|work has quit
      • amCap1712 has quit
      • LordSputnik has quit
      • RJ2 has quit
      • akhilesh has quit
      • akhilesh joined the channel
      • amCap1712 joined the channel
      • LordSputnik joined the channel
      • RJ2 joined the channel
      • spellew has quit
      • Cyna has quit
      • spellew joined the channel
      • HorusHorrendus has quit
      • Mr_Monkey has quit
      • Cyna joined the channel
      • discopatrick has quit
      • xarph has quit
      • alastairp has quit
      • Mr_Monkey joined the channel
      • alastairp joined the channel
      • HorusHorrendus joined the channel
      • xarph joined the channel
      • discopatrick joined the channel
      • Leftmost has quit
      • modwizcode has quit
      • modwizcode joined the channel
      • Gore|work joined the channel
      • Leftmost joined the channel
      • Cyna has quit
      • Cyna joined the channel
      • ruaok
        moooin
      • Gazooo joined the channel
      • Matthew_ joined the channel
      • Matthew_
        Hello! I've been trying to build a full index in Solr, with limited success. I'm using mb-solr@v3.0 and sir@schema-24. I consistently seeing failures in the Solr log for artists and recordings. In the case of artists, I see the same error: "unknown field 'primary_alias'", in the case of recordings I consistently see "missing required field: name". What's odd is that if I check a failed artist on search.musicbrainz.org, it's present so
      • you've clearly successfully indexed it but it fails for me?
      • ruaok
        morning Matthew_. yvanzo is the particular person you need to speak with. most of the rest of us only have cursory knowledge of how the search stuff works, let alone how to debug it.
      • Matthew_
        Thanks ruaok.
      • mueslo has quit
      • mueslo joined the channel
      • pristine__
        ruaok: hi
      • yvanzo
        Hi Matthew_, I double-checked and our latest docker images seem to match these versions. Can you check mb-solr submodules are in sync in your clone?
      • Matthew_
        Thanks, yvanzo. Will take a look. On a related note, how long typically does a full reindex take for you and what is the server spec?
      • yvanzo
        You can compare mbsssss and mmd-schema submodules with https://github.com/metabrainz/mb-solr/tree/v3.0
      • Matthew_: I don’t know, SIR is still beta, there currently is an issue with SIR which doesn’t returns even though reindexing is complete.
      • Matthew_
        Thanks, yvanzo. So how do you currently rebuild your indices?
      • yvanzo
        I never had to run a full reindex (or any reindex) on prod servers for now, still learning from previous devs. :)
      • Matthew_
        Fair enough! That might be a problem for us though. We require the ability to originate a full index on our slave implementation for the purposes of initial setup and also disaster recovery. Anyhow, I'll double check the dependencies / versions...
      • yvanzo
        Matthew_: I successfully built search indexes locally on sample data without issue about primary alias, will make further tests with full data.
      • Matthew_
        Thanks yvanzo. From what I've seen, the error appears pretty quickly once artists start being indexed. However, it's probably that it's user error on my part and the deps are out of kilter. Will let you know for sure...
      • ruaok
        hi pristine__ !
      • I'm feeling a lot better (though not 100%) so I am slowly catching up.
      • pristine__
        that's good to hear :)
      • ruaok: there is a good news.
      • ruaok
        I could use some of those. :)
      • CatQuest
        [07:54] <yvanzo> reosarevok, bitmap: On a related issue, I’m looking at replacing paging with react-window for collections.
      • NO please
      • pristine__
        We were able to reduce lookup time from 12 hours to around 2 hours.
      • I have made a few changes, the script is running on leader now, I will forward the HTML files to you in some time.
      • ruaok
        lookup meaning running the model?
      • pristine__
        You can compare the lookup time of the script I sent you yesterday and of the one I will send you.
      • no
      • ruaok
        training vs running.
      • pristine__
        ruaok: "look up" : after predicting recommendations, we get recording ids of the recommended songs, then we lookup for relevant information (track name, artist name etc) corresponding to the recording ids.
      • D4RK-PH0ENiX has quit
      • ruaok
        oh, so the 10 hours didn't involve models at all?
      • pristine__
        no. 2 hours to train the model plus 12 hours to predict tracks and lookup
      • in these two hours, we are training 8 models.
      • and computing each model's RMSE
      • ruaok
        yep
      • ok, so here will be the proof if your work.
      • when I get the chance to look at the HTML files I will want to understand what "12 hours to predict tracks and lookup" means.
      • because those are two very distinct steps and we should know which step takes how long
      • pristine__
        did you read the HTML?
      • ruaok
        not yet, but I remain hopeful. :)
      • let me do that right now.
      • pristine__
        yeah, they will help you to understand (I hope)
      • okay
      • ruaok
        yes, looking much better. nicely done.
      • but, there are still some things to improve.
      • pristine__
        Okay
      • ruaok
        on the model training page... you have roughly three sections of data.
      • model info, explanations for model info and the table of models generated.
      • the most useful things are at the bottom of the page.
      • reference / explanation which we will need once or at least infrequently is near the top.
      • the table bottom/middle.
      • I think the last 5 lines should be near the top, perhaps in a concise table as well.
      • along with "Preprocessing of playcounts-dataframe takes 105.73s. Of the preprocessed data, approx. 66% (15081669) listens have been used as training data, 17% (3773882) listens have been used as validation data and 17% (3772169) listens have been used as test data. After preprocessing, training phase starts. From the models trained, the best one is selected to generate recommendations."
      • then the table of model trainings and finally the reference stuff.
      • Matthew_
        yvanzo. I can confirm that I'm running mb-solr@v3.0, mbsssss@5e6153f, mmd-schema@40e2115, sir@d28c977
      • pristine__
        yeah, right.
      • ruaok
        this way, the most important stuff is near the top where we wish to see it and the less important stuff as we go down.
      • make sense?
      • Matthew_
        (Solr version 7.7.1)
      • ruaok
        but, it looks like you have all info relevant to this page now, which is good.
      • pristine__
        yeah. I put reference stuff at the top because it will be needed to understand the table. but yeah, makes sene :)
      • It is like a story, so I tried to put everything in order of the script.
      • ruaok
        ah, I see. not a bad approach to things, really.
      • yvanzo
        Matthew_: https://github.com/metabrainz/mb-solr/blob/v3.0... is Solr 7.5.0, but I don’t think it is related to your issue.
      • ruaok
        but we and our community are the target audience of the script and I know how we're going to look at it time and time again.
      • Matthew_
        Aye. It should be backwardsly compatible with a point release. We don't use the docker file - we build RPMs for deployment.
      • pristine__
        sure. I will bring important things to the top :)
      • ruaok
        the data collection page is great, btw. however, it gives us stats about a model that was trained, but no model ID.
      • pristine__
        there is.
      • " listenbrainz-recommendation-model-bf1155df-b926-45b9-a5dc-69938811dd73"
      • ruaok
        ok, I still haven't found it.
      • pristine__
        something like this.
      • ruaok
        I'm talking about the "data collection" page. I don't see that model ID on that page.
      • pristine__
        because at that time
      • we have not trained the model
      • we are just collecting data to be able to preprocess it :)
      • ruaok
        ohh,I see.
      • but, I can look at the two pages and not correlate them.
      • pristine__
        The three HTMLs correspond to three scripts that we use in the whole process. and they are in that order. there is a link at the bottom to go to the next.
      • did you read about the playcounts-df in the "data collection" HTMl?
      • I mean playcounts-dataframe
      • ruaok
        hmm, ok.
      • but I see a problem.
      • pristine__
        yeah?
      • ruaok
        you're using dates to emit filenames.
      • there will be multiple runs on the same day.
      • and I suppose that the reports will be sufficiently linked if the reader can go bidirectionally.
      • pristine__
        yeah. I have used dates to name html files, It should be changed.
      • ruaok
        back and forward, ya?
      • you might consider generating a UUID for a "run"
      • pristine__
        was thinking the same. thanks :)
      • ruaok
        data-collection-C98E3B93-EC03-482D-B4EC-31EDF86AB58E.html
      • recommendations-C98E3B93-EC03-482D-B4EC-31EDF86AB58E.html