for me, I'd prefer just a some parametric score of sorts.
2019-05-27 14709, 2019
ruaok
now, what part of that do we need to work on, iliekcomputers ?
2019-05-27 14724, 2019
yvanzo
iliekcomputers: great, no need to know about SIR, at least you are used to Python :)
2019-05-27 14739, 2019
ruaok
but that score is literally the point of the collaborative filtering algorithm, no?
2019-05-27 14739, 2019
iliekcomputers
i am not sure yet, but a better metric is probably used in reality. i'll have to take a look.
2019-05-27 14756, 2019
ferbncode
Iliekcomputers: sure \o/
2019-05-27 14719, 2019
ruaok
ok, I for now I'm just going to work with the idea that we have some score where higher is better.
2019-05-27 14740, 2019
iliekcomputers
ruaok: I was thinking that making it predict numbers from 1 to 1000s is harder than making it predict some other metric of likeability.
2019-05-27 14742, 2019
ruaok
which is not to say that we should make that into a playlist directly. I doubt that will turn out well.
2019-05-27 14758, 2019
ruaok
ah, ok, now I understand.
2019-05-27 14720, 2019
ruaok
ok, we clearly need to research what metrics are doable.
2019-05-27 14729, 2019
ruaok
but this is where aidanlw17's work comes in.
2019-05-27 14711, 2019
Slurpee joined the channel
2019-05-27 14711, 2019
Slurpee has quit
2019-05-27 14711, 2019
Slurpee joined the channel
2019-05-27 14713, 2019
ruaok
if we say, pick the most played track of the last week, and then find CF recommended tracks that are similar, we can start constructing a playlist.
2019-05-27 14734, 2019
ruaok
chaining along from track to track that is similar.
2019-05-27 14747, 2019
pristine__
What other metric can we have apart from listen counts? The more I play a song, the more I like it.
2019-05-27 14701, 2019
aidanlw17
ruaok what is CF?
2019-05-27 14708, 2019
ruaok
pristine__: I am not sure. this is precisely what we need to learn.
2019-05-27 14710, 2019
iliekcomputers
aidanlw17: collaborative filtering.
2019-05-27 14718, 2019
aidanlw17
oh thank you!
2019-05-27 14722, 2019
iliekcomputers
pristine__: everything will need to be based on listen counts.
2019-05-27 14738, 2019
ruaok
and also, I want to reiterate the ONE GOAL I had that caused me to start MusicBrainz.
2019-05-27 14754, 2019
ruaok
I wanted to pick a starting track and an ending track and give a duration.
2019-05-27 14710, 2019
iliekcomputers
the thing is that predicting listen counts (which have a large range 1 to tens of thousands) is a harder problem than we need to solve probably.
2019-05-27 14716, 2019
ruaok
start with enter sandman from metallica and end up with orinoco flow from enya in 2 hours. GO.
2019-05-27 14736, 2019
Mr_Monkey
The longest i've listened to a track for probably also indicates my tastes, those that are cemented
2019-05-27 14742, 2019
pristine__
Exciting
2019-05-27 14747, 2019
ruaok
so, finding a line of similar tracks that go from one track/artist to another track/artist.
2019-05-27 14757, 2019
pristine__
Mr_Monkey: yup
2019-05-27 14704, 2019
ruaok
THIS, believe if it or not is why I started MusicBrainz. without MB, this is impossible.
2019-05-27 14737, 2019
ruaok
iliekcomputers: what does the CF algorithm spit out currently as its ranking?
2019-05-27 14749, 2019
ruaok
or is that a black box, based on the fact that we're shoving in listen counts?
2019-05-27 14755, 2019
iliekcomputers
ruaok: we give it listen counts and it tries to predict listen counts as a result.
2019-05-27 14709, 2019
ruaok
where is the problem in that?
2019-05-27 14715, 2019
ruaok
is it not doing a good job?
2019-05-27 14732, 2019
iliekcomputers
maybe i'm not able to explain my thoughts on this correctly.
2019-05-27 14744, 2019
iliekcomputers
let me do some research and come back with a good paragraph or two.
2019-05-27 14754, 2019
ruaok
ok, likely I am being dense too.
2019-05-27 14703, 2019
pristine__
It is doing a good job, maybe iliekcomputers wants a diff metric
2019-05-27 14708, 2019
pristine__
Diff from listen count.
2019-05-27 14718, 2019
ruaok
but what you are saying is exactly what I've been understanding, so I am not understand the crux of the problem you're raising.
2019-05-27 14700, 2019
ruaok
well, if it ain't broke, don't fix it.
2019-05-27 14706, 2019
ruaok
perhaps it is suitable for the first round.
2019-05-27 14710, 2019
iliekcomputers
hmm, yep.
2019-05-27 14736, 2019
ruaok
In reality I think we're going to do this challenge in the autumn and then realize "oh crap, we need this data set, that data set, this, that".
2019-05-27 14742, 2019
ruaok
learning is the key goal of the challenge.
2019-05-27 14747, 2019
ruaok
and then we start the cycle again.
2019-05-27 14706, 2019
ruaok
and perhaps at the end of the second cycle we'll have something to be proud of.
2019-05-27 14713, 2019
ruaok is managing expectations
2019-05-27 14724, 2019
ruaok
aidanlw17: any thoughts from you?
2019-05-27 14742, 2019
ruaok
have you thought about how to extend your resultant data to artstsis?
2019-05-27 14745, 2019
pristine__
"managing expectations"....awww
2019-05-27 14710, 2019
ruaok
pristine__: yes, we're all working hard to get things done, but the reality is that the first pass is not going to be glorious.
2019-05-27 14727, 2019
ruaok
if it teaches us how to do better, than I am 100% satisfied.
2019-05-27 14709, 2019
pristine__
Well said. Learning is the key :)
2019-05-27 14714, 2019
ruaok
ding.
2019-05-27 14751, 2019
pristine__
Dong
2019-05-27 14754, 2019
ruaok
iliekcomputers: the work we've done for shuffling user stats data back to hetzner.... can we use that to shove the recommendations from CF back to hetzner too?
2019-05-27 14758, 2019
aidanlw17
I'd like to really review the files from pristine__ and the CF project as a whole to get a better understanding of this recommendation work. One thought of mine is that alastairp and I currently will be using 12 separate metrics for track-track similarity, then near the end of the summer a goal is to bring these together into one track-track metric for overall similarity. I think in the end, a combination of this metric and pristine__'s
2019-05-27 14758, 2019
aidanlw17
results would give a good dataset for recommendation.
2019-05-27 14706, 2019
iliekcomputers
ruaok: yes.
2019-05-27 14710, 2019
iliekcomputers
shouldn't be much work.
2019-05-27 14714, 2019
ruaok
that then begs the question: how to we handle new runs of the CF data?
2019-05-27 14727, 2019
ruaok
do we keep X data sets and run a new one once a week?
2019-05-27 14744, 2019
iliekcomputers
that is what i was expecting.
2019-05-27 14749, 2019
ruaok
iliekcomputers: great. that will clearly be the next step for pristine__
2019-05-27 14751, 2019
ruaok
iliekcomputers: <3
2019-05-27 14717, 2019
pristine__
hetzner?
2019-05-27 14728, 2019
ruaok
and then we can make playable lists on lb.org -- once we have that, then we're at a point when we can realistically see how the CF alg is performing.
2019-05-27 14741, 2019
pristine__
Lb-server?
2019-05-27 14745, 2019
ruaok
pristine__: yes.
2019-05-27 14754, 2019
pristine__
Oh. Okay.
2019-05-27 14756, 2019
iliekcomputers
hetzner == leader.listenbrainz
2019-05-27 14710, 2019
pristine__
I like the next step 😆
2019-05-27 14715, 2019
ruaok
and I guess there we ought to post process it into, recommendations of things that people have played and recommendations for things that are new to users.
2019-05-27 14729, 2019
ruaok
iliekcomputers: actually in this case I mean hetzer = lemmy
2019-05-27 14743, 2019
iliekcomputers
ooh
2019-05-27 14748, 2019
iliekcomputers
ambiguous. :P
2019-05-27 14750, 2019
aidanlw17
ruaok: In terms of artist-artist similarity, I think we need these two projects in combination - given that artists may also diverge greatly in the types of music they create, I don't anticipate that only track-track similarity would provide a strong recommendation artist-artist. When bringing in the listen counts from pristine__, I would be interested in seeing how artist-artist recommendation could change.
2019-05-27 14720, 2019
ruaok
aidanlw17: yes, and I think part of our challenge might be to pick different better metrics that feed your algorithm.
2019-05-27 14751, 2019
ruaok
perhaps we should make samples of track-track similarities available for public inspection asap too.
2019-05-27 14720, 2019
ruaok
aidanlw17: I think that is spot on.
2019-05-27 14729, 2019
aidanlw17
Yeah. alastairp and I also were planning to make a public evaluation available for track-track similarity as soon as we have a working pipeline
2019-05-27 14742, 2019
ruaok
I'd like all of use to start thinking about how to accomplish the artist-artist data set from the LB and AB datasets.
2019-05-27 14750, 2019
ruaok
aidanlw17: superb
2019-05-27 14712, 2019
ruaok
ok, I think we all have a better understanding of next steps and more of the roadmap now, yes?
2019-05-27 14722, 2019
ruaok
if something is unclear, ask now.
2019-05-27 14725, 2019
pristine__
Yes yes.
2019-05-27 14740, 2019
alastairp
iliekcomputers: thanks for starting the script. how's it going?
2019-05-27 14745, 2019
ruaok
iliekcomputers: I'd live to hear more about your reservations about the metric/ranking for CF when you come by them.
2019-05-27 14755, 2019
ruaok waves at alastairp
2019-05-27 14709, 2019
alastairp
hi. I'm just reading backlog, and cooking too
2019-05-27 14710, 2019
aidanlw17
I'll keep that in mind. Additionally, if you guys produce a metric from the collaborative filtering it might be possible to index that with annoy as we will do with the other metrics for track-track. Is that something you want to consider?
2019-05-27 14713, 2019
iliekcomputers
ruaok: let me try to rephrase what i was saying.
2019-05-27 14743, 2019
ruaok
aidanlw17: that does sound interesting yes.
2019-05-27 14751, 2019
iliekcomputers
right now, we're trying to predict exactly how many times you would / should have listened to a particular song (say the strokes' last nite)
2019-05-27 14704, 2019
iliekcomputers
this value can range from one to tens of thousands.
2019-05-27 14708, 2019
ruaok
I am super eager to learn from comes from your project. pristine__ has done an excellent job doing that for me on the CF front.
2019-05-27 14712, 2019
iliekcomputers
so it is hard to predict.
2019-05-27 14726, 2019
ruaok
too granular?
2019-05-27 14727, 2019
iliekcomputers
when in reality, we probably do not need that number to that degree of accuracy.
2019-05-27 14735, 2019
pristine__
ruaok: thanks. Means a lot :)
2019-05-27 14739, 2019
ruaok
:)
2019-05-27 14752, 2019
iliekcomputers
a lesser range would probably work out as well (intuition, not sure)
2019-05-27 14759, 2019
ruaok
iliekcomputers: and the scale of the CF ranking? is that linear or non-linear?
2019-05-27 14716, 2019
aidanlw17
ruaok: I appreciate the excitement - I feel it too.
2019-05-27 14728, 2019
ruaok
well, mapping the giant range into something smaller is easy.
2019-05-27 14739, 2019
ruaok
premature quantization might become a problem.
2019-05-27 14742, 2019
pristine__
Yes. We can probably normalize.
2019-05-27 14759, 2019
ruaok
normalizing makes sense to me. quantizing gives me hesitation.
I see how quantizing the data might be useful for other algs down the line, but for starters we may not want to do that.
2019-05-27 14709, 2019
ruaok
alastairp, iliekcomputers : what script is that?
2019-05-27 14713, 2019
alastairp
I'm not surprised... the original method took about 10 minutes for me to do it on a slow machine with only 4m tracks, but that blocked the whole table. this one is better
2019-05-27 14724, 2019
alastairp
ruaok: writing submission offsets to the ll table
2019-05-27 14733, 2019
ruaok
ah, yes.
2019-05-27 14745, 2019
alastairp
tomorrow we can deploy write offset on submit
2019-05-27 14748, 2019
ruaok
are submission offfsets monotonically increasing numbers?
2019-05-27 14752, 2019
alastairp
yes
2019-05-27 14707, 2019
pristine__
I guess we should continue with the road map and pick on normalization sometime later.
2019-05-27 14708, 2019
ruaok
makes sense.
2019-05-27 14715, 2019
ruaok
pristine__: yes.
2019-05-27 14722, 2019
alastairp
it's the same as we're currently using in the GET endpoint
2019-05-27 14731, 2019
ruaok
once we see the scores in the report (soon, I hope!) we can get our heads around this more.
2019-05-27 14737, 2019
alastairp
uuid/low-level?n=[offset]
2019-05-27 14742, 2019
iliekcomputers
hmm.
2019-05-27 14747, 2019
pristine__
By tomorrow ruaok :)
2019-05-27 14750, 2019
iliekcomputers
we should start merging PRs soon too.
2019-05-27 14751, 2019
ruaok
wooo
2019-05-27 14758, 2019
alastairp
iliekcomputers: when are you next available?
2019-05-27 14705, 2019
pristine__
iliekcomputers: could you look at 21
2019-05-27 14707, 2019
ruaok
the stats PRs should be merged asap, IMHO.
2019-05-27 14709, 2019
iliekcomputers
alastairp: tomorrow works for me.
2019-05-27 14711, 2019
pristine__
Can*
2019-05-27 14720, 2019
pristine__
PR#21
2019-05-27 14722, 2019
ruaok
my goal for today is to look at pristine's latest PR
2019-05-27 14737, 2019
ruaok
(aside from boring nonprofit work)
2019-05-27 14748, 2019
pristine__
I will send you link, ruaok
2019-05-27 14702, 2019
alastairp
ok, good. perhaps then we can do the next PR on this offset stuff (if we do it early in the morning perhaps we can do the last part in the evening)
2019-05-27 14712, 2019
ruaok
#26 is on my list.
2019-05-27 14716, 2019
alastairp
and also we could take a look at the docker stuff that you were finishing up