yvanzo, bitmap: I think user-tags returning all + tags returning only upvoted makes sense, at the very least by default
2022-04-06 09602, 2022
reosarevok
We could add a param to request all tags inc. downvoted ones if someone actually requests it (maybe the app wants to also display downvoted tags so you can vote on them via the API or something)
2022-04-06 09604, 2022
akshaaatt
Hi aerozol! Thanks for the share. Will look into it :)
2022-04-06 09656, 2022
odnes joined the channel
2022-04-06 09643, 2022
q3lont has quit
2022-04-06 09607, 2022
outsidecontext
yvanzo, bitmap: the bug with user genres that was supposed to be fixed was not returning genres yourself had downvoted.
2022-04-06 09647, 2022
outsidecontext
the problem was that if you upvote a genre it counts as "your genre", but the same happened for genres you downvoted
2022-04-06 09603, 2022
outsidecontext
maybe this needs to be checked again if it is working as expected. your upvoted genres should all be included, even if downvoted by others
also about downvoted tags (with zero or lower count) in the WS: picard doesn't need them and filters them out. but I would suggest not to change WS behavior here, some WS users might make use of it
2022-04-06 09640, 2022
q3lont joined the channel
2022-04-06 09625, 2022
DjSlash has quit
2022-04-06 09658, 2022
cuanim joined the channel
2022-04-06 09658, 2022
cuanim has quit
2022-04-06 09658, 2022
cuanim joined the channel
2022-04-06 09611, 2022
cuanim has quit
2022-04-06 09650, 2022
mayhem
moooin!
2022-04-06 09655, 2022
mayhem wonders about https://readyset.io/blog/introducing-readyset
2022-04-06 09602, 2022
mayhem
sounds like magic
2022-04-06 09632, 2022
akshaaatt
moin!
2022-04-06 09640, 2022
lucifer
morning
2022-04-06 09654, 2022
lucifer
mayhem: that sounds like a generalization of cont. agg to me
2022-04-06 09622, 2022
mayhem
yeah, and we know how well magic like that works.
2022-04-06 09623, 2022
lucifer
hehe indeed. but our data wasn't truly time series either so that contributed to that bad experience to some extent as well.
2022-04-06 09633, 2022
lucifer
mayhem: when you have looked at the MLHD proposal, let me know i wanted to discuss some points about it.
I love the flowchat in particular. I thought those were BS in like 1986, but someone still teaches that nonsense. :)
2022-04-06 09652, 2022
mayhem
there are some minor mistakes in the schema diagram, but not sure we care that much.
2022-04-06 09609, 2022
lucifer
yes makes sense. iiuc the proposal suggested this timeline: 1) experiment in python 2) production code in python 3) production code in spark. i don't think 2 is worth it. i'd suggest do 1, then do experiments in spark and then do 3. thoughts?
2022-04-06 09655, 2022
mayhem
agreed. but I fear that juypiter -> spark is going to be a rather large step
2022-04-06 09646, 2022
mayhem
and this project proposal doesn't really cover the scope of my original hopes. I had hoped that we would do this project, eval the results and then make a decision: does this work? Or do we need to lookup the metadata for each and then run that through the mapping.
2022-04-06 09609, 2022
mayhem
I don't want to do this project blindly -- we need to take the simple steps and then see where we are.
2022-04-06 09644, 2022
lucifer
yes there will a leap. we can setup a dev environment with spark to help and assist but unsure how we can help in other ways.
2022-04-06 09612, 2022
lucifer
right so how do we evaluate the results?
2022-04-06 09647, 2022
mayhem
I *think* the most pernicious problem that is inherent in the data set is the conflated artists and the mapping of metadata to MBIDs.
2022-04-06 09610, 2022
mayhem
I remember the artist "muse" as being singled out in this case -- I think param found some of these errors.
2022-04-06 09643, 2022
DjSlash joined the channel
2022-04-06 09651, 2022
mayhem
thinking out loud, I think we need to pick a chunk or two and see if we can find users who listen to conflated artists. and then see how it deals with those users with those particular issues.
2022-04-06 09627, 2022
lucifer
uh sorry, i do not understand what you meant by conflated artists?
2022-04-06 09654, 2022
mayhem
the mapping that last.fm used considered artist and recording separately, I think.
2022-04-06 09635, 2022
mayhem
so, "muse" "dopest song ever", could result in a muse artist MBID who do not perform "dopest song ever".
2022-04-06 09647, 2022
lucifer
ah makes sense. thanks!
2022-04-06 09621, 2022
mayhem
I distrust the artist data, but I am not sure we can trust the recording data either. I suspect that we will nearly certainly find we can't trust that either.
2022-04-06 09639, 2022
lucifer
btw are there any existing reserach papers on MLHD dataset or other publicly available resources on how someone used it?
2022-04-06 09657, 2022
mayhem
alastairp would know.
2022-04-06 09616, 2022
mayhem
now that I am thinking about it more, I think we may want to re-consider the goal of this project.
2022-04-06 09634, 2022
mayhem
I think getting to a working solution in spark at the end of summer is too lofty of a coal. perhaps.
2022-04-06 09641, 2022
mayhem
goal. not coal.
2022-04-06 09624, 2022
mayhem
so, rather than a juypiter version, I would love to see a functioning python version that just uses PG. get to the first eval stage as fast as possible so we can do the eval.
2022-04-06 09624, 2022
lucifer
do we have the recording name and artist name available so could we run it through our mbid mapper? (i do not have the dataset at hand currently to check and confirm)
2022-04-06 09642, 2022
mayhem
no, the dataset only ever has MBIDs. no text at all.
2022-04-06 09653, 2022
lucifer
oh :(
2022-04-06 09634, 2022
mayhem
yeah. so two stages become apparent:
2022-04-06 09605, 2022
mayhem
Stage 1: What is described in the doc right now, but with no spark parts, only python/PG.
2022-04-06 09619, 2022
mayhem
Stage 2a: If that produces good results, move to spark.
2022-04-06 09644, 2022
mayhem
Stage 2b: If that does not produce good results, implement MBID mapping lookup in python.
2022-04-06 09612, 2022
lucifer
MBID mapping lookup on what? since we don't have text
2022-04-06 09659, 2022
mayhem
just look up the artist MBID and recording MBID in MB independently to get text.
2022-04-06 09611, 2022
lucifer
ah ok, makes sense
2022-04-06 09612, 2022
mayhem
then take that text and run it through the mapper.
2022-04-06 09629, 2022
mayhem
I am almost certain we'll need to go do this step.
2022-04-06 09645, 2022
mayhem
and it makes no sense for us to write spark code until we are certain of the approach of the project.
2022-04-06 09601, 2022
alastairp
hi
2022-04-06 09604, 2022
mayhem
moin
2022-04-06 09618, 2022
mayhem
curious to see what alastairp thinks about this convo.
2022-04-06 09619, 2022
lucifer
i think that would result in just running the mbids through canonical recording redirect table but yes for bad data results may differ.
2022-04-06 09624, 2022
alastairp
yeah, just reading it now
2022-04-06 09655, 2022
mayhem
lucifer: it would be really good to see both in action.
2022-04-06 09605, 2022
lucifer
sure sounds good
2022-04-06 09615, 2022
commet joined the channel
2022-04-06 09616, 2022
mayhem
in fact, I wonder if I could find my own user ID in MLHD and then evaluate my own data through this.
2022-04-06 09652, 2022
mayhem
that would be relevant at least, but I am not sure I listen to enough conflated artists to suss it out.
mayhem: unrelated to current topic but from the above, "Each session is defined to be a period of listening with no more than 60 seconds of inactivity between consecutive tracks." heh we were trying 30mins last time we working on recording similarity.
2022-04-06 09653, 2022
mayhem
oh, interesting.
2022-04-06 09604, 2022
mayhem
our definition was very different.
2022-04-06 09627, 2022
mayhem
my definition was a window of activity. their focus is on duration of inactivity
clearly I will need track lengths to use in the similarity stuff going forward.
2022-04-06 09609, 2022
commet
the project that imaged the center of the galaxy did everything with python,t ehre's probably some good takeaways from the projects for working with very large data sets