in #metabrainz

18:04 PM
ruaok

thats a really weird way of thinking about that.
18:04 PM
for me, I'd prefer just a some parametric score of sorts.
18:05 PM
now, what part of that do we need to work on, iliekcomputers ?
18:05 PM
yvanzo

iliekcomputers: great, no need to know about SIR, at least you are used to Python :)
18:05 PM
ruaok

but that score is literally the point of the collaborative filtering algorithm, no?
18:05 PM
iliekcomputers

i am not sure yet, but a better metric is probably used in reality. i'll have to take a look.
18:05 PM
ferbncode

Iliekcomputers: sure \o/
18:06 PM
ruaok

ok, I for now I'm just going to work with the idea that we have some score where higher is better.
18:06 PM
iliekcomputers

ruaok: I was thinking that making it predict numbers from 1 to 1000s is harder than making it predict some other metric of likeability.
18:06 PM
ruaok

which is not to say that we should make that into a playlist directly. I doubt that will turn out well.
18:06 PM
ah, ok, now I understand.
18:07 PM
ok, we clearly need to research what metrics are doable.
18:07 PM
but this is where aidanlw17's work comes in.
18:08 PM
Slurpee joined the channel
18:08 PM
Slurpee has quit
18:08 PM
Slurpee joined the channel
18:08 PM
if we say, pick the most played track of the last week, and then find CF recommended tracks that are similar, we can start constructing a playlist.
18:08 PM
chaining along from track to track that is similar.
18:08 PM
pristine__

What other metric can we have apart from listen counts? The more I play a song, the more I like it.
18:09 PM
aidanlw17

ruaok what is CF?
18:09 PM
ruaok

pristine__: I am not sure. this is precisely what we need to learn.
18:09 PM
iliekcomputers

aidanlw17: collaborative filtering.
18:09 PM
aidanlw17

oh thank you!
18:09 PM
iliekcomputers

pristine__: everything will need to be based on listen counts.
18:09 PM
ruaok

and also, I want to reiterate the ONE GOAL I had that caused me to start MusicBrainz.
18:09 PM
I wanted to pick a starting track and an ending track and give a duration.
18:10 PM
iliekcomputers

the thing is that predicting listen counts (which have a large range 1 to tens of thousands) is a harder problem than we need to solve probably.
18:10 PM
ruaok

start with enter sandman from metallica and end up with orinoco flow from enya in 2 hours. GO.
18:10 PM
Mr_Monkey

The longest i've listened to a track for probably also indicates my tastes, those that are cemented
18:10 PM
pristine__

Exciting
18:10 PM
ruaok

so, finding a line of similar tracks that go from one track/artist to another track/artist.
18:10 PM
pristine__

Mr_Monkey: yup
18:11 PM
ruaok

THIS, believe if it or not is why I started MusicBrainz. without MB, this is impossible.
18:11 PM
iliekcomputers: what does the CF algorithm spit out currently as its ranking?
18:11 PM
or is that a black box, based on the fact that we're shoving in listen counts?
18:11 PM
iliekcomputers

ruaok: we give it listen counts and it tries to predict listen counts as a result.
18:12 PM
ruaok

where is the problem in that?
18:12 PM
is it not doing a good job?
18:12 PM
iliekcomputers

maybe i'm not able to explain my thoughts on this correctly.
18:12 PM
let me do some research and come back with a good paragraph or two.
18:12 PM
ruaok

ok, likely I am being dense too.
18:13 PM
pristine__

It is doing a good job, maybe iliekcomputers wants a diff metric
18:13 PM
Diff from listen count.
18:13 PM
ruaok

but what you are saying is exactly what I've been understanding, so I am not understand the crux of the problem you're raising.
18:14 PM
well, if it ain't broke, don't fix it.
18:14 PM
perhaps it is suitable for the first round.
18:14 PM
iliekcomputers

hmm, yep.
18:14 PM
ruaok

In reality I think we're going to do this challenge in the autumn and then realize "oh crap, we need this data set, that data set, this, that".
18:14 PM
learning is the key goal of the challenge.
18:14 PM
and then we start the cycle again.
18:15 PM
and perhaps at the end of the second cycle we'll have something to be proud of.
18:15 PM
ruaok is managing expectations
18:15 PM
aidanlw17: any thoughts from you?
18:15 PM
have you thought about how to extend your resultant data to artstsis?
18:15 PM
pristine__

"managing expectations"....awww
18:16 PM
ruaok

pristine__: yes, we're all working hard to get things done, but the reality is that the first pass is not going to be glorious.
18:16 PM
if it teaches us how to do better, than I am 100% satisfied.
18:17 PM
pristine__

Well said. Learning is the key :)
18:17 PM
ruaok

ding.
18:17 PM
pristine__

Dong
18:17 PM
ruaok

iliekcomputers: the work we've done for shuffling user stats data back to hetzner.... can we use that to shove the recommendations from CF back to hetzner too?
18:17 PM
aidanlw17

I'd like to really review the files from pristine__ and the CF project as a whole to get a better understanding of this recommendation work. One thought of mine is that alastairp and I currently will be using 12 separate metrics for track-track similarity, then near the end of the summer a goal is to bring these together into one track-track metric for overall similarity. I think in the end, a combination of this metric and pristine__'s
18:17 PM
results would give a good dataset for recommendation.
18:18 PM
iliekcomputers

ruaok: yes.
18:18 PM
shouldn't be much work.
18:18 PM
ruaok

that then begs the question: how to we handle new runs of the CF data?
18:18 PM
do we keep X data sets and run a new one once a week?
18:18 PM
iliekcomputers

that is what i was expecting.
18:18 PM
ruaok

iliekcomputers: great. that will clearly be the next step for pristine__
18:18 PM
iliekcomputers: <3
18:19 PM
pristine__

hetzner?
18:19 PM
ruaok

and then we can make playable lists on lb.org -- once we have that, then we're at a point when we can realistically see how the CF alg is performing.
18:19 PM
pristine__

Lb-server?
18:19 PM
ruaok

pristine__: yes.
18:19 PM
pristine__

Oh. Okay.
18:19 PM
iliekcomputers

hetzner == leader.listenbrainz
18:20 PM
pristine__

I like the next step 😆
18:20 PM
ruaok

and I guess there we ought to post process it into, recommendations of things that people have played and recommendations for things that are new to users.
18:20 PM
iliekcomputers: actually in this case I mean hetzer = lemmy
18:20 PM
iliekcomputers

ooh
18:20 PM
ambiguous. :P
18:20 PM
aidanlw17

ruaok: In terms of artist-artist similarity, I think we need these two projects in combination - given that artists may also diverge greatly in the types of music they create, I don't anticipate that only track-track similarity would provide a strong recommendation artist-artist. When bringing in the listen counts from pristine__, I would be interested in seeing how artist-artist recommendation could change.
18:21 PM
ruaok

aidanlw17: yes, and I think part of our challenge might be to pick different better metrics that feed your algorithm.
18:21 PM
perhaps we should make samples of track-track similarities available for public inspection asap too.
18:22 PM
aidanlw17: I think that is spot on.
18:22 PM
aidanlw17

Yeah. alastairp and I also were planning to make a public evaluation available for track-track similarity as soon as we have a working pipeline
18:22 PM
ruaok

I'd like all of use to start thinking about how to accomplish the artist-artist data set from the LB and AB datasets.
18:22 PM
aidanlw17: superb
18:23 PM
ok, I think we all have a better understanding of next steps and more of the roadmap now, yes?
18:23 PM
if something is unclear, ask now.
18:23 PM
pristine__

Yes yes.
18:23 PM
alastairp

iliekcomputers: thanks for starting the script. how's it going?
18:23 PM
ruaok

iliekcomputers: I'd live to hear more about your reservations about the metric/ranking for CF when you come by them.
18:23 PM
ruaok waves at alastairp
18:24 PM
alastairp

hi. I'm just reading backlog, and cooking too
18:24 PM
aidanlw17

I'll keep that in mind. Additionally, if you guys produce a metric from the collaborative filtering it might be possible to index that with annoy as we will do with the other metrics for track-track. Is that something you want to consider?
18:24 PM
iliekcomputers

ruaok: let me try to rephrase what i was saying.
18:24 PM
ruaok

aidanlw17: that does sound interesting yes.
18:24 PM
iliekcomputers

right now, we're trying to predict exactly how many times you would / should have listened to a particular song (say the strokes' last nite)
18:25 PM
this value can range from one to tens of thousands.
18:25 PM
ruaok

I am super eager to learn from comes from your project. pristine__ has done an excellent job doing that for me on the CF front.
18:25 PM
iliekcomputers

so it is hard to predict.
18:25 PM
ruaok

too granular?
18:25 PM
iliekcomputers

when in reality, we probably do not need that number to that degree of accuracy.
18:25 PM
pristine__

ruaok: thanks. Means a lot :)
18:25 PM
ruaok

:)
18:25 PM
iliekcomputers

a lesser range would probably work out as well (intuition, not sure)
18:25 PM
ruaok

iliekcomputers: and the scale of the CF ranking? is that linear or non-linear?
18:26 PM
aidanlw17

ruaok: I appreciate the excitement - I feel it too.
18:26 PM
ruaok

well, mapping the giant range into something smaller is easy.
18:26 PM
premature quantization might become a problem.
18:26 PM
pristine__

Yes. We can probably normalize.
18:26 PM
ruaok

normalizing makes sense to me. quantizing gives me hesitation.
18:27 PM
iliekcomputers

alastairp: https://www.irccloud.com/pastebin/AGzOHdmi/
18:27 PM
alastairp

cool! that's really fast
18:27 PM
ruaok

I see how quantizing the data might be useful for other algs down the line, but for starters we may not want to do that.
18:28 PM
alastairp, iliekcomputers : what script is that?
18:28 PM
alastairp

I'm not surprised... the original method took about 10 minutes for me to do it on a slow machine with only 4m tracks, but that blocked the whole table. this one is better
18:28 PM
ruaok: writing submission offsets to the ll table
18:28 PM
ruaok

ah, yes.
18:28 PM
alastairp

tomorrow we can deploy write offset on submit
18:28 PM
ruaok

are submission offfsets monotonically increasing numbers?
18:28 PM
alastairp

yes
18:29 PM
pristine__

I guess we should continue with the road map and pick on normalization sometime later.
18:29 PM
ruaok

makes sense.
18:29 PM
pristine__: yes.
18:29 PM
alastairp

it's the same as we're currently using in the GET endpoint
18:29 PM
ruaok

once we see the scores in the report (soon, I hope!) we can get our heads around this more.
18:29 PM
alastairp

uuid/low-level?n=[offset]
18:29 PM
iliekcomputers

hmm.
18:29 PM
pristine__

By tomorrow ruaok :)
18:29 PM
iliekcomputers

we should start merging PRs soon too.
18:29 PM
ruaok

wooo
18:29 PM
alastairp

iliekcomputers: when are you next available?
18:30 PM
pristine__

iliekcomputers: could you look at 21
18:30 PM
ruaok

the stats PRs should be merged asap, IMHO.
18:30 PM
iliekcomputers

alastairp: tomorrow works for me.
18:30 PM
pristine__

Can*
18:30 PM
PR#21
18:30 PM
ruaok

my goal for today is to look at pristine's latest PR
18:30 PM
(aside from boring nonprofit work)
18:30 PM
pristine__

I will send you link, ruaok
18:31 PM
alastairp

ok, good. perhaps then we can do the next PR on this offset stuff (if we do it early in the morning perhaps we can do the last part in the evening)
18:31 PM
ruaok

#26 is on my list.
18:31 PM
alastairp

and also we could take a look at the docker stuff that you were finishing up
18:32 PM
pristine__ telling her laptop to wake up.
18:33 PM
iliekcomputers

alastairp: ok.
18:33 PM
AfroThundr|main has quit
18:33 PM
ruaok: do we wanna talk some about azure?
18:33 PM
ruaok

sure.