#metabrainz

/

0:11 AM
v6lur has quit

2022-04-06 09654, 2022

2:19 AM
dseomn_ joined the channel

2022-04-06 09647, 2022

2:20 AM
dseomn has quit

2022-04-06 09647, 2022

2:20 AM
dseomn_ is now known as dseomn

2022-04-06 09638, 2022

2:28 AM
saumon has quit

2022-04-06 09624, 2022

2:52 AM
saumon joined the channel

2022-04-06 09658, 2022

3:14 AM
adhawkins_ joined the channel

2022-04-06 09623, 2022

3:15 AM
adhawkins has quit

2022-04-06 09639, 2022

3:15 AM
adhawkins_ is now known as adhawkins

2022-04-06 09641, 2022

3:35 AM
Shubh joined the channel

2022-04-06 09622, 2022

4:08 AM
srinathkp joined the channel

2022-04-06 09616, 2022

4:21 AM
srinathkp has quit

2022-04-06 09656, 2022

4:35 AM
Xianyi joined the channel

2022-04-06 09636, 2022

5:55 AM
q3lont joined the channel

2022-04-06 09645, 2022

6:07 AM
BrainzGit

[bookbrainz-site] 14tr1ten opened pull request #833 (03master…sort-options): Fix(languages): Sort language options after fast filter https://github.com/metabrainz/bookbrainz-site/pul…

2022-04-06 09624, 2022

6:15 AM
reosarevok

yvanzo, bitmap: I think user-tags returning all + tags returning only upvoted makes sense, at the very least by default

2022-04-06 09602, 2022

6:16 AM
reosarevok

We could add a param to request all tags inc. downvoted ones if someone actually requests it (maybe the app wants to also display downvoted tags so you can vote on them via the API or something)

2022-04-06 09604, 2022

6:32 AM
akshaaatt

Hi aerozol! Thanks for the share. Will look into it :)

2022-04-06 09656, 2022

6:37 AM
odnes joined the channel

2022-04-06 09643, 2022

6:40 AM
q3lont has quit

2022-04-06 09607, 2022

7:00 AM
outsidecontext

yvanzo, bitmap: the bug with user genres that was supposed to be fixed was not returning genres yourself had downvoted.

2022-04-06 09647, 2022

7:00 AM
outsidecontext

the problem was that if you upvote a genre it counts as "your genre", but the same happened for genres you downvoted

2022-04-06 09603, 2022

7:02 AM
outsidecontext

maybe this needs to be checked again if it is working as expected. your upvoted genres should all be included, even if downvoted by others

2022-04-06 09631, 2022

7:03 AM
outsidecontext

but looking at the patch at https://github.com/metabrainz/musicbrainz-server/… I think it is doing the right thing

2022-04-06 09640, 2022

7:09 AM
outsidecontext

also about downvoted tags (with zero or lower count) in the WS: picard doesn't need them and filters them out. but I would suggest not to change WS behavior here, some WS users might make use of it

2022-04-06 09640, 2022

7:57 AM
q3lont joined the channel

2022-04-06 09625, 2022

8:27 AM
DjSlash has quit

2022-04-06 09658, 2022

8:34 AM
cuanim joined the channel

2022-04-06 09658, 2022

8:34 AM
cuanim has quit

2022-04-06 09658, 2022

8:34 AM
cuanim joined the channel

2022-04-06 09611, 2022

8:36 AM
cuanim has quit

2022-04-06 09650, 2022

8:37 AM
mayhem

moooin!

2022-04-06 09655, 2022

8:37 AM
mayhem wonders about https://readyset.io/blog/introducing-readyset

2022-04-06 09602, 2022

8:38 AM
mayhem

sounds like magic

2022-04-06 09632, 2022

8:48 AM
akshaaatt

moin!

2022-04-06 09640, 2022

8:54 AM
lucifer

morning

2022-04-06 09654, 2022

8:54 AM
lucifer

mayhem: that sounds like a generalization of cont. agg to me

2022-04-06 09622, 2022

8:55 AM
mayhem

yeah, and we know how well magic like that works.

2022-04-06 09623, 2022

8:56 AM
lucifer

hehe indeed. but our data wasn't truly time series either so that contributed to that bad experience to some extent as well.

2022-04-06 09633, 2022

8:57 AM
lucifer

mayhem: when you have looked at the MLHD proposal, let me know i wanted to discuss some points about it.

2022-04-06 09645, 2022

8:58 AM
lucifer

alastairp: i have updated https://github.com/metabrainz/critiquebrainz/pull… as well with tests.

2022-04-06 09606, 2022

9:00 AM
legoktm[m] has quit

2022-04-06 09624, 2022

9:00 AM
mayhem

lucifer: I have. let me pull it up.

2022-04-06 09634, 2022

9:00 AM
lucifer

ah cool.

2022-04-06 09617, 2022

9:01 AM
mayhem

I love the flowchat in particular. I thought those were BS in like 1986, but someone still teaches that nonsense. :)

2022-04-06 09652, 2022

9:01 AM
mayhem

there are some minor mistakes in the schema diagram, but not sure we care that much.

2022-04-06 09609, 2022

9:03 AM
lucifer

yes makes sense. iiuc the proposal suggested this timeline: 1) experiment in python 2) production code in python 3) production code in spark. i don't think 2 is worth it. i'd suggest do 1, then do experiments in spark and then do 3. thoughts?

2022-04-06 09655, 2022

9:03 AM
mayhem

agreed. but I fear that juypiter -> spark is going to be a rather large step

2022-04-06 09646, 2022

9:05 AM
mayhem

and this project proposal doesn't really cover the scope of my original hopes. I had hoped that we would do this project, eval the results and then make a decision: does this work? Or do we need to lookup the metadata for each and then run that through the mapping.

2022-04-06 09609, 2022

9:06 AM
mayhem

I don't want to do this project blindly -- we need to take the simple steps and then see where we are.

2022-04-06 09644, 2022

9:06 AM
lucifer

yes there will a leap. we can setup a dev environment with spark to help and assist but unsure how we can help in other ways.

2022-04-06 09612, 2022

9:07 AM
lucifer

right so how do we evaluate the results?

2022-04-06 09647, 2022

9:08 AM
mayhem

I *think* the most pernicious problem that is inherent in the data set is the conflated artists and the mapping of metadata to MBIDs.

2022-04-06 09610, 2022

9:09 AM
mayhem

I remember the artist "muse" as being singled out in this case -- I think param found some of these errors.

2022-04-06 09643, 2022

9:10 AM
DjSlash joined the channel

2022-04-06 09651, 2022

9:11 AM
mayhem

thinking out loud, I think we need to pick a chunk or two and see if we can find users who listen to conflated artists. and then see how it deals with those users with those particular issues.

2022-04-06 09627, 2022

9:12 AM
lucifer

uh sorry, i do not understand what you meant by conflated artists?

2022-04-06 09654, 2022

9:12 AM
mayhem

the mapping that last.fm used considered artist and recording separately, I think.

2022-04-06 09635, 2022

9:13 AM
mayhem

so, "muse" "dopest song ever", could result in a muse artist MBID who do not perform "dopest song ever".

2022-04-06 09647, 2022

9:13 AM
lucifer

ah makes sense. thanks!

2022-04-06 09621, 2022

9:14 AM
mayhem

I distrust the artist data, but I am not sure we can trust the recording data either. I suspect that we will nearly certainly find we can't trust that either.

2022-04-06 09639, 2022

9:14 AM
lucifer

btw are there any existing reserach papers on MLHD dataset or other publicly available resources on how someone used it?

2022-04-06 09657, 2022

9:14 AM
mayhem

alastairp would know.

2022-04-06 09616, 2022

9:15 AM
mayhem

now that I am thinking about it more, I think we may want to re-consider the goal of this project.

2022-04-06 09634, 2022

9:15 AM
mayhem

I think getting to a working solution in spark at the end of summer is too lofty of a coal. perhaps.

2022-04-06 09641, 2022

9:15 AM
mayhem

goal. not coal.

2022-04-06 09624, 2022

9:16 AM
mayhem

so, rather than a juypiter version, I would love to see a functioning python version that just uses PG. get to the first eval stage as fast as possible so we can do the eval.

2022-04-06 09624, 2022

9:16 AM
lucifer

do we have the recording name and artist name available so could we run it through our mbid mapper? (i do not have the dataset at hand currently to check and confirm)

2022-04-06 09642, 2022

9:16 AM
mayhem

no, the dataset only ever has MBIDs. no text at all.

2022-04-06 09653, 2022

9:16 AM
lucifer

oh :(

2022-04-06 09634, 2022

9:17 AM
mayhem

yeah. so two stages become apparent:

2022-04-06 09605, 2022

9:18 AM
mayhem

Stage 1: What is described in the doc right now, but with no spark parts, only python/PG.

2022-04-06 09619, 2022

9:18 AM
mayhem

Stage 2a: If that produces good results, move to spark.

2022-04-06 09644, 2022

9:18 AM
mayhem

Stage 2b: If that does not produce good results, implement MBID mapping lookup in python.

2022-04-06 09612, 2022

9:19 AM
lucifer

MBID mapping lookup on what? since we don't have text

2022-04-06 09659, 2022

9:19 AM
mayhem

just look up the artist MBID and recording MBID in MB independently to get text.

2022-04-06 09611, 2022

9:20 AM
lucifer

ah ok, makes sense

2022-04-06 09612, 2022

9:20 AM
mayhem

then take that text and run it through the mapper.

2022-04-06 09629, 2022

9:20 AM
mayhem

I am almost certain we'll need to go do this step.

2022-04-06 09645, 2022

9:20 AM
mayhem

and it makes no sense for us to write spark code until we are certain of the approach of the project.

2022-04-06 09601, 2022

9:21 AM
alastairp

hi

2022-04-06 09604, 2022

9:21 AM
mayhem

moin

2022-04-06 09618, 2022

9:21 AM
mayhem

curious to see what alastairp thinks about this convo.

2022-04-06 09619, 2022

9:21 AM
lucifer

i think that would result in just running the mbids through canonical recording redirect table but yes for bad data results may differ.

2022-04-06 09624, 2022

9:21 AM
alastairp

yeah, just reading it now

2022-04-06 09655, 2022

9:21 AM
mayhem

lucifer: it would be really good to see both in action.

2022-04-06 09605, 2022

9:22 AM
lucifer

sure sounds good

2022-04-06 09615, 2022

9:22 AM
commet joined the channel

2022-04-06 09616, 2022

9:22 AM
mayhem

in fact, I wonder if I could find my own user ID in MLHD and then evaluate my own data through this.

2022-04-06 09652, 2022

9:22 AM
mayhem

that would be relevant at least, but I am not sure I listen to enough conflated artists to suss it out.

2022-04-06 09600, 2022

9:23 AM
mayhem

and how would I go about finding my user id? lol.

2022-04-06 09610, 2022

9:23 AM
commet

hello

2022-04-06 09623, 2022

9:23 AM
reosarevok

Is http://tickets.metabrainz.org/ failing for everyone or just me?

2022-04-06 09625, 2022

9:23 AM
mayhem

hello commet

2022-04-06 09633, 2022

9:23 AM
commet

what's the discussion?

2022-04-06 09633, 2022

9:23 AM
mayhem

just you reosarevok

2022-04-06 09633, 2022

9:23 AM
lucifer

the thing i see working against that plan is the huge amount data so we may be unable to process it in python in reasonable amount of time.

2022-04-06 09658, 2022

9:23 AM
mayhem

we can only realistically ever work on one data file. perhaps two.

2022-04-06 09659, 2022

9:23 AM
reosarevok

Hmm, ok, I'll restart the router :)

2022-04-06 09616, 2022

9:24 AM
mayhem

processing the whole data set is simply not feasible in the summer.

2022-04-06 09640, 2022

9:24 AM
alastairp

processing to clean it up, or processing to build some recommendation tool?

2022-04-06 09642, 2022

9:24 AM
mayhem

if I could learn the right algorithm to use to fix the dataset as the sole result of this project, I would be pretty happy, honestly.

2022-04-06 09647, 2022

9:24 AM
lucifer

yes so whatever results we get won't be complete. i guess if choose files randomly we can hope to get a reasonably good sample.

2022-04-06 09650, 2022

9:24 AM
mayhem

the former, alastairp

2022-04-06 09626, 2022

9:25 AM
mayhem

I suppose we can ask gabriel, the creator of the dataset, if he knows a particular corner of it better.

2022-04-06 09634, 2022

9:25 AM
lucifer

commet: we are discussing about the MLHD dataset, ways to validate process it so on.

2022-04-06 09638, 2022

9:25 AM
mayhem

and help us with the eval of the results.

2022-04-06 09628, 2022

9:26 AM
commet

27 million time stamped logs, wow

2022-04-06 09632, 2022

9:26 AM
commet

had to look up what that data set was

2022-04-06 09634, 2022

9:26 AM
alastairp

let me take a look at the stuff that he published

2022-04-06 09645, 2022

9:26 AM
mayhem

27 *billion* no?

2022-04-06 09651, 2022

9:26 AM
commet

billion, yes

2022-04-06 09655, 2022

9:26 AM
commet

my bad

2022-04-06 09600, 2022

9:27 AM
commet

misread that

2022-04-06 09610, 2022

9:27 AM
mayhem

27 million and it wouldn't be worth to do as a project.

2022-04-06 09651, 2022

9:31 AM
commet

sounds like a fun project

2022-04-06 09632, 2022

9:32 AM
alastairp

here's the list of citations of the dataset, not sure exactly what part of it each one uses: https://scholar.google.ca/scholar?oi=bibs&hl=…

2022-04-06 09637, 2022

9:32 AM
alastairp

we could do a very quick review of them

2022-04-06 09648, 2022

9:34 AM
lucifer

oh the first one is interesting!

2022-04-06 09659, 2022

9:34 AM
lucifer

"The music streaming sessions dataset" by spotify

2022-04-06 09632, 2022

9:38 AM
lucifer

mayhem: unrelated to current topic but from the above, "Each session is defined to be a period of listening with no more than 60 seconds of inactivity between consecutive tracks." heh we were trying 30mins last time we working on recording similarity.

2022-04-06 09653, 2022

9:38 AM
mayhem

oh, interesting.

2022-04-06 09604, 2022

9:39 AM
mayhem

our definition was very different.

2022-04-06 09627, 2022

9:39 AM
mayhem

my definition was a window of activity. their focus is on duration of inactivity

2022-04-06 09637, 2022

9:39 AM
mayhem

that's a good hint. :)

2022-04-06 09643, 2022

9:39 AM
lucifer

ah indeed, makes sense

2022-04-06 09646, 2022

9:39 AM
reosarevok

Sigh

2022-04-06 09653, 2022

9:39 AM
reosarevok

https://www.irccloud.com/pastebin/AQc9KmCs/

2022-04-06 09655, 2022

9:39 AM
reosarevok

Por qué

2022-04-06 09659, 2022

9:39 AM
mayhem

clearly I will need track lengths to use in the similarity stuff going forward.

2022-04-06 09609, 2022

9:40 AM
commet

the project that imaged the center of the galaxy did everything with python,t ehre's probably some good takeaways from the projects for working with very large data sets

2022-04-06 09612, 2022

9:40 AM
mayhem

reo do a traceroute, not ping

2022-04-06 09614, 2022

9:40 AM
reosarevok

meb.org itself works just fine

2022-04-06 09615, 2022

9:40 AM
lucifer

but we could never figure out why it failed in spark :(

2022-04-06 09625, 2022

9:40 AM
reosarevok

mayhem: oh, will check

2022-04-06 09603, 2022

9:41 AM
lucifer

this paper seems to only refer to it as other datasets available, checking other refs now

2022-04-06 09613, 2022

9:41 AM
mayhem

lucifer: mostly because we didn't finish that project. now I got the similarity stuff stable in python, now we can consider a move to spark.

2022-04-06 09652, 2022

9:41 AM
lucifer

yup makes sense

2022-04-06 09603, 2022

9:42 AM
reosarevok

https://www.irccloud.com/pastebin/hvB2uNSX/

2022-04-06 09607, 2022

9:42 AM
reosarevok

mayhem: ^

2022-04-06 09640, 2022

9:42 AM
mayhem

zas, atj : what do you make of this traceroute?

2022-04-06 09655, 2022

9:42 AM
alastairp

lucifer: I think that the sessions dataset is using spotify data, right?

2022-04-06 09612, 2022

9:43 AM
alastairp

they probably just cited mlhd in terms of saying "oh hey, here's another big dataset of music stuff"

2022-04-06 09618, 2022

9:43 AM
lucifer

alastairp: yes, its a dataset of spotify streams.

2022-04-06 09621, 2022

9:43 AM
lucifer

right

2022-04-06 09624, 2022

9:43 AM
mayhem

reosarevok: try an SSH tunnel.

2022-04-06 09652, 2022

9:43 AM
lucifer

thats probably another dataset we could look in future.

2022-04-06 09655, 2022

9:43 AM
mayhem

ssh -L 8080:tickets.metabrainz.org:80 wolf.metabrainz.org

2022-04-06 09610, 2022

9:44 AM
mayhem

then go to http://localhost:8080

2022-04-06 09623, 2022

9:44 AM
mayhem

might need to redo for https, but still.