#metabrainz

/

      • v6lur has quit
      • 2022-04-06 09654, 2022

      • dseomn_ joined the channel
      • 2022-04-06 09647, 2022

      • dseomn has quit
      • 2022-04-06 09647, 2022

      • dseomn_ is now known as dseomn
      • 2022-04-06 09638, 2022

      • saumon has quit
      • 2022-04-06 09624, 2022

      • saumon joined the channel
      • 2022-04-06 09658, 2022

      • adhawkins_ joined the channel
      • 2022-04-06 09623, 2022

      • adhawkins has quit
      • 2022-04-06 09639, 2022

      • adhawkins_ is now known as adhawkins
      • 2022-04-06 09641, 2022

      • Shubh joined the channel
      • 2022-04-06 09622, 2022

      • srinathkp joined the channel
      • 2022-04-06 09616, 2022

      • srinathkp has quit
      • 2022-04-06 09656, 2022

      • Xianyi joined the channel
      • 2022-04-06 09636, 2022

      • q3lont joined the channel
      • 2022-04-06 09645, 2022

      • BrainzGit
        [bookbrainz-site] 14tr1ten opened pull request #833 (03master…sort-options): Fix(languages): Sort language options after fast filter https://github.com/metabrainz/bookbrainz-site/pul…
      • 2022-04-06 09624, 2022

      • reosarevok
        yvanzo, bitmap: I think user-tags returning all + tags returning only upvoted makes sense, at the very least by default
      • 2022-04-06 09602, 2022

      • reosarevok
        We could add a param to request all tags inc. downvoted ones if someone actually requests it (maybe the app wants to also display downvoted tags so you can vote on them via the API or something)
      • 2022-04-06 09604, 2022

      • akshaaatt
        Hi aerozol! Thanks for the share. Will look into it :)
      • 2022-04-06 09656, 2022

      • odnes joined the channel
      • 2022-04-06 09643, 2022

      • q3lont has quit
      • 2022-04-06 09607, 2022

      • outsidecontext
        yvanzo, bitmap: the bug with user genres that was supposed to be fixed was not returning genres yourself had downvoted.
      • 2022-04-06 09647, 2022

      • outsidecontext
        the problem was that if you upvote a genre it counts as "your genre", but the same happened for genres you downvoted
      • 2022-04-06 09603, 2022

      • outsidecontext
        maybe this needs to be checked again if it is working as expected. your upvoted genres should all be included, even if downvoted by others
      • 2022-04-06 09631, 2022

      • outsidecontext
        but looking at the patch at https://github.com/metabrainz/musicbrainz-server/… I think it is doing the right thing
      • 2022-04-06 09640, 2022

      • outsidecontext
        also about downvoted tags (with zero or lower count) in the WS: picard doesn't need them and filters them out. but I would suggest not to change WS behavior here, some WS users might make use of it
      • 2022-04-06 09640, 2022

      • q3lont joined the channel
      • 2022-04-06 09625, 2022

      • DjSlash has quit
      • 2022-04-06 09658, 2022

      • cuanim joined the channel
      • 2022-04-06 09658, 2022

      • cuanim has quit
      • 2022-04-06 09658, 2022

      • cuanim joined the channel
      • 2022-04-06 09611, 2022

      • cuanim has quit
      • 2022-04-06 09650, 2022

      • mayhem
        moooin!
      • 2022-04-06 09655, 2022

      • mayhem wonders about https://readyset.io/blog/introducing-readyset
      • 2022-04-06 09602, 2022

      • mayhem
        sounds like magic
      • 2022-04-06 09632, 2022

      • akshaaatt
        moin!
      • 2022-04-06 09640, 2022

      • lucifer
        morning
      • 2022-04-06 09654, 2022

      • lucifer
        mayhem: that sounds like a generalization of cont. agg to me
      • 2022-04-06 09622, 2022

      • mayhem
        yeah, and we know how well magic like that works.
      • 2022-04-06 09623, 2022

      • lucifer
        hehe indeed. but our data wasn't truly time series either so that contributed to that bad experience to some extent as well.
      • 2022-04-06 09633, 2022

      • lucifer
        mayhem: when you have looked at the MLHD proposal, let me know i wanted to discuss some points about it.
      • 2022-04-06 09645, 2022

      • lucifer
        alastairp: i have updated https://github.com/metabrainz/critiquebrainz/pull… as well with tests.
      • 2022-04-06 09606, 2022

      • legoktm[m] has quit
      • 2022-04-06 09624, 2022

      • mayhem
        lucifer: I have. let me pull it up.
      • 2022-04-06 09634, 2022

      • lucifer
        ah cool.
      • 2022-04-06 09617, 2022

      • mayhem
        I love the flowchat in particular. I thought those were BS in like 1986, but someone still teaches that nonsense. :)
      • 2022-04-06 09652, 2022

      • mayhem
        there are some minor mistakes in the schema diagram, but not sure we care that much.
      • 2022-04-06 09609, 2022

      • lucifer
        yes makes sense. iiuc the proposal suggested this timeline: 1) experiment in python 2) production code in python 3) production code in spark. i don't think 2 is worth it. i'd suggest do 1, then do experiments in spark and then do 3. thoughts?
      • 2022-04-06 09655, 2022

      • mayhem
        agreed. but I fear that juypiter -> spark is going to be a rather large step
      • 2022-04-06 09646, 2022

      • mayhem
        and this project proposal doesn't really cover the scope of my original hopes. I had hoped that we would do this project, eval the results and then make a decision: does this work? Or do we need to lookup the metadata for each and then run that through the mapping.
      • 2022-04-06 09609, 2022

      • mayhem
        I don't want to do this project blindly -- we need to take the simple steps and then see where we are.
      • 2022-04-06 09644, 2022

      • lucifer
        yes there will a leap. we can setup a dev environment with spark to help and assist but unsure how we can help in other ways.
      • 2022-04-06 09612, 2022

      • lucifer
        right so how do we evaluate the results?
      • 2022-04-06 09647, 2022

      • mayhem
        I *think* the most pernicious problem that is inherent in the data set is the conflated artists and the mapping of metadata to MBIDs.
      • 2022-04-06 09610, 2022

      • mayhem
        I remember the artist "muse" as being singled out in this case -- I think param found some of these errors.
      • 2022-04-06 09643, 2022

      • DjSlash joined the channel
      • 2022-04-06 09651, 2022

      • mayhem
        thinking out loud, I think we need to pick a chunk or two and see if we can find users who listen to conflated artists. and then see how it deals with those users with those particular issues.
      • 2022-04-06 09627, 2022

      • lucifer
        uh sorry, i do not understand what you meant by conflated artists?
      • 2022-04-06 09654, 2022

      • mayhem
        the mapping that last.fm used considered artist and recording separately, I think.
      • 2022-04-06 09635, 2022

      • mayhem
        so, "muse" "dopest song ever", could result in a muse artist MBID who do not perform "dopest song ever".
      • 2022-04-06 09647, 2022

      • lucifer
        ah makes sense. thanks!
      • 2022-04-06 09621, 2022

      • mayhem
        I distrust the artist data, but I am not sure we can trust the recording data either. I suspect that we will nearly certainly find we can't trust that either.
      • 2022-04-06 09639, 2022

      • lucifer
        btw are there any existing reserach papers on MLHD dataset or other publicly available resources on how someone used it?
      • 2022-04-06 09657, 2022

      • mayhem
        alastairp would know.
      • 2022-04-06 09616, 2022

      • mayhem
        now that I am thinking about it more, I think we may want to re-consider the goal of this project.
      • 2022-04-06 09634, 2022

      • mayhem
        I think getting to a working solution in spark at the end of summer is too lofty of a coal. perhaps.
      • 2022-04-06 09641, 2022

      • mayhem
        goal. not coal.
      • 2022-04-06 09624, 2022

      • mayhem
        so, rather than a juypiter version, I would love to see a functioning python version that just uses PG. get to the first eval stage as fast as possible so we can do the eval.
      • 2022-04-06 09624, 2022

      • lucifer
        do we have the recording name and artist name available so could we run it through our mbid mapper? (i do not have the dataset at hand currently to check and confirm)
      • 2022-04-06 09642, 2022

      • mayhem
        no, the dataset only ever has MBIDs. no text at all.
      • 2022-04-06 09653, 2022

      • lucifer
        oh :(
      • 2022-04-06 09634, 2022

      • mayhem
        yeah. so two stages become apparent:
      • 2022-04-06 09605, 2022

      • mayhem
        Stage 1: What is described in the doc right now, but with no spark parts, only python/PG.
      • 2022-04-06 09619, 2022

      • mayhem
        Stage 2a: If that produces good results, move to spark.
      • 2022-04-06 09644, 2022

      • mayhem
        Stage 2b: If that does not produce good results, implement MBID mapping lookup in python.
      • 2022-04-06 09612, 2022

      • lucifer
        MBID mapping lookup on what? since we don't have text
      • 2022-04-06 09659, 2022

      • mayhem
        just look up the artist MBID and recording MBID in MB independently to get text.
      • 2022-04-06 09611, 2022

      • lucifer
        ah ok, makes sense
      • 2022-04-06 09612, 2022

      • mayhem
        then take that text and run it through the mapper.
      • 2022-04-06 09629, 2022

      • mayhem
        I am almost certain we'll need to go do this step.
      • 2022-04-06 09645, 2022

      • mayhem
        and it makes no sense for us to write spark code until we are certain of the approach of the project.
      • 2022-04-06 09601, 2022

      • alastairp
        hi
      • 2022-04-06 09604, 2022

      • mayhem
        moin
      • 2022-04-06 09618, 2022

      • mayhem
        curious to see what alastairp thinks about this convo.
      • 2022-04-06 09619, 2022

      • lucifer
        i think that would result in just running the mbids through canonical recording redirect table but yes for bad data results may differ.
      • 2022-04-06 09624, 2022

      • alastairp
        yeah, just reading it now
      • 2022-04-06 09655, 2022

      • mayhem
        lucifer: it would be really good to see both in action.
      • 2022-04-06 09605, 2022

      • lucifer
        sure sounds good
      • 2022-04-06 09615, 2022

      • commet joined the channel
      • 2022-04-06 09616, 2022

      • mayhem
        in fact, I wonder if I could find my own user ID in MLHD and then evaluate my own data through this.
      • 2022-04-06 09652, 2022

      • mayhem
        that would be relevant at least, but I am not sure I listen to enough conflated artists to suss it out.
      • 2022-04-06 09600, 2022

      • mayhem
        and how would I go about finding my user id? lol.
      • 2022-04-06 09610, 2022

      • commet
        hello
      • 2022-04-06 09623, 2022

      • reosarevok
        Is http://tickets.metabrainz.org/ failing for everyone or just me?
      • 2022-04-06 09625, 2022

      • mayhem
        hello commet
      • 2022-04-06 09633, 2022

      • commet
        what's the discussion?
      • 2022-04-06 09633, 2022

      • mayhem
        just you reosarevok
      • 2022-04-06 09633, 2022

      • lucifer
        the thing i see working against that plan is the huge amount data so we may be unable to process it in python in reasonable amount of time.
      • 2022-04-06 09658, 2022

      • mayhem
        we can only realistically ever work on one data file. perhaps two.
      • 2022-04-06 09659, 2022

      • reosarevok
        Hmm, ok, I'll restart the router :)
      • 2022-04-06 09616, 2022

      • mayhem
        processing the whole data set is simply not feasible in the summer.
      • 2022-04-06 09640, 2022

      • alastairp
        processing to clean it up, or processing to build some recommendation tool?
      • 2022-04-06 09642, 2022

      • mayhem
        if I could learn the right algorithm to use to fix the dataset as the sole result of this project, I would be pretty happy, honestly.
      • 2022-04-06 09647, 2022

      • lucifer
        yes so whatever results we get won't be complete. i guess if choose files randomly we can hope to get a reasonably good sample.
      • 2022-04-06 09650, 2022

      • mayhem
        the former, alastairp
      • 2022-04-06 09626, 2022

      • mayhem
        I suppose we can ask gabriel, the creator of the dataset, if he knows a particular corner of it better.
      • 2022-04-06 09634, 2022

      • lucifer
        commet: we are discussing about the MLHD dataset, ways to validate process it so on.
      • 2022-04-06 09638, 2022

      • mayhem
        and help us with the eval of the results.
      • 2022-04-06 09628, 2022

      • commet
        27 million time stamped logs, wow
      • 2022-04-06 09632, 2022

      • commet
        had to look up what that data set was
      • 2022-04-06 09634, 2022

      • alastairp
        let me take a look at the stuff that he published
      • 2022-04-06 09645, 2022

      • mayhem
        27 *billion* no?
      • 2022-04-06 09651, 2022

      • commet
        billion, yes
      • 2022-04-06 09655, 2022

      • commet
        my bad
      • 2022-04-06 09600, 2022

      • commet
        misread that
      • 2022-04-06 09610, 2022

      • mayhem
        27 million and it wouldn't be worth to do as a project.
      • 2022-04-06 09651, 2022

      • commet
        sounds like a fun project
      • 2022-04-06 09632, 2022

      • alastairp
        here's the list of citations of the dataset, not sure exactly what part of it each one uses: https://scholar.google.ca/scholar?oi=bibs&hl=…
      • 2022-04-06 09637, 2022

      • alastairp
        we could do a very quick review of them
      • 2022-04-06 09648, 2022

      • lucifer
        oh the first one is interesting!
      • 2022-04-06 09659, 2022

      • lucifer
        "The music streaming sessions dataset" by spotify
      • 2022-04-06 09632, 2022

      • lucifer
        mayhem: unrelated to current topic but from the above, "Each session is defined to be a period of listening with no more than 60 seconds of inactivity between consecutive tracks." heh we were trying 30mins last time we working on recording similarity.
      • 2022-04-06 09653, 2022

      • mayhem
        oh, interesting.
      • 2022-04-06 09604, 2022

      • mayhem
        our definition was very different.
      • 2022-04-06 09627, 2022

      • mayhem
        my definition was a window of activity. their focus is on duration of inactivity
      • 2022-04-06 09637, 2022

      • mayhem
        that's a good hint. :)
      • 2022-04-06 09643, 2022

      • lucifer
        ah indeed, makes sense
      • 2022-04-06 09646, 2022

      • reosarevok
        Sigh
      • 2022-04-06 09653, 2022

      • reosarevok
      • 2022-04-06 09655, 2022

      • reosarevok
        Por qué
      • 2022-04-06 09659, 2022

      • mayhem
        clearly I will need track lengths to use in the similarity stuff going forward.
      • 2022-04-06 09609, 2022

      • commet
        the project that imaged the center of the galaxy did everything with python,t ehre's probably some good takeaways from the projects for working with very large data sets
      • 2022-04-06 09612, 2022

      • mayhem
        reo do a traceroute, not ping
      • 2022-04-06 09614, 2022

      • reosarevok
        meb.org itself works just fine
      • 2022-04-06 09615, 2022

      • lucifer
        but we could never figure out why it failed in spark :(
      • 2022-04-06 09625, 2022

      • reosarevok
        mayhem: oh, will check
      • 2022-04-06 09603, 2022

      • lucifer
        this paper seems to only refer to it as other datasets available, checking other refs now
      • 2022-04-06 09613, 2022

      • mayhem
        lucifer: mostly because we didn't finish that project. now I got the similarity stuff stable in python, now we can consider a move to spark.
      • 2022-04-06 09652, 2022

      • lucifer
        yup makes sense
      • 2022-04-06 09603, 2022

      • reosarevok
      • 2022-04-06 09607, 2022

      • reosarevok
        mayhem: ^
      • 2022-04-06 09640, 2022

      • mayhem
        zas, atj : what do you make of this traceroute?
      • 2022-04-06 09655, 2022

      • alastairp
        lucifer: I think that the sessions dataset is using spotify data, right?
      • 2022-04-06 09612, 2022

      • alastairp
        they probably just cited mlhd in terms of saying "oh hey, here's another big dataset of music stuff"
      • 2022-04-06 09618, 2022

      • lucifer
        alastairp: yes, its a dataset of spotify streams.
      • 2022-04-06 09621, 2022

      • lucifer
        right
      • 2022-04-06 09624, 2022

      • mayhem
        reosarevok: try an SSH tunnel.
      • 2022-04-06 09652, 2022

      • lucifer
        thats probably another dataset we could look in future.
      • 2022-04-06 09655, 2022

      • mayhem
        ssh -L 8080:tickets.metabrainz.org:80 wolf.metabrainz.org
      • 2022-04-06 09610, 2022

      • mayhem
      • 2022-04-06 09623, 2022

      • mayhem
        might need to redo for https, but still.