in #metabrainz

14:35 PM
pristine__

no
14:36 PM
scroll and see the counts next to "Similar artists for rob"
14:36 PM
ruaok

top artists for rob is much much slower than iliekcomputers and you. odd.
14:36 PM
pristine__

this time greatly because we have calculated so mnay counts. I will remove them once we are done with the html
14:37 PM
ruaok

even those counts make no sense to me.
14:37 PM
iliekcomputers has more listens but half time.
14:37 PM
does that mean that the is a greater number of similar artists in my set?
14:37 PM
pristine__

yes
14:37 PM
ruaok

should I shut up and listen for now?
14:38 PM
pristine__

There is a reason I put up those counts
14:38 PM
Like for instance
14:38 PM
there is this artist "New Order" in your set
14:38 PM
2528 similar artists
14:39 PM
Doesn't makes much sense to include all of them
14:39 PM
ruaok

yes, I suspect that my similar artist set need to calculate fewer similar artists.
14:39 PM
pristine__

On basis of count, we can filter top x?
14:39 PM
ruaok

with a popular older artist like new order, there are many many similar artists since they appeared on many more compliations.
14:39 PM
yes, that is a good way of filtering.
14:40 PM
pristine__

Also,
14:40 PM
ruaok

and it would be best to do it on the lb-labs side and not the AAR side.
14:40 PM
pristine__

How do we do that?
14:40 PM
AAR gives us artist similarity
14:41 PM
ruaok

no, it gives you a count of how many times two artists appeared on a compilation.
14:41 PM
and on AAR I cut off any counts 2 or less.
14:41 PM
pristine__

yes, and that way we compute there affinity
14:42 PM
their*
14:42 PM
ruaok

you should set a higher threshold. or a max number.
14:42 PM
possibly both.
14:42 PM
pristine__

for count?
14:42 PM
yeah
14:42 PM
ruaok

yes
14:43 PM
pristine__

top 10 ?
14:43 PM
ruaok

156,523 new artists. ouch.
14:43 PM
pristine__

I was just coming to that
14:43 PM
ruaok

make it a configuration value and then start with 10.
14:44 PM
pristine__

yeah, was asking to start with. thanks:)
14:44 PM
ruaok

I think making a config value for threshold and max count and then doing multiple runs with various figures should be your next step.
14:44 PM
pristine__

so yeah
14:44 PM
for new artists
14:44 PM
ruaok

yes
14:44 PM
pristine__

I have calculated candidate sets on a months data
14:44 PM
so top 20 artists
14:45 PM
and there similar artists
14:45 PM
subtracted from total artists
14:45 PM
for each user
14:45 PM
but what i think is
14:46 PM
for now, We should just focus on generating two playlist, "top artists" and "similar artists"
14:46 PM
because random new artists are no good
14:46 PM
ruaok

playlists or candidate sets?
14:46 PM
pristine__

two playlists from two candidate set
14:47 PM
playlist 1: songs from the top artists that you have listened to in last month/week
14:48 PM
playlist 2: songs from the artists similar to top artists that you have listened to in last month/week
14:48 PM
ruaok

ok
14:48 PM
yes, that seems like a very good goal.
14:48 PM
pristine__

so, later on we can group new artists
14:48 PM
on basis of nationality, are, genre
14:48 PM
and then build a third candidate set.
14:49 PM
that way we can accomplish our main goal
14:49 PM
of promoting new artists
14:49 PM
ruaok

oh. feedback! that is an interesting idea.
14:49 PM
pristine__

keeping in mind the taste of user
14:49 PM
area*
14:50 PM
ruaok

I think that is an interesting idea. very interesting.
14:50 PM
please pursue that.
14:50 PM
pristine__

I mean there is a lot of possibility to filter artists from these 156523 artists that you are closer to your taste
14:50 PM
ruaok

yes.
14:51 PM
I think with the thresholding/max count we should be able to reduce that quite a lot.
14:51 PM
pristine__

which idea? focusing on two playlists?
14:51 PM
ruaok

not sure what a good target it, but I'm guessing less than 1000 artists.
14:51 PM
pristine__

yeah.
14:51 PM
ruaok

focusing on two playlists so that we can build a third candidate set for a new artist playlist later.
14:52 PM
pristine__

yeah
14:52 PM
and how often are we going to train our model?
14:53 PM
ruaok

I have no idea.
14:53 PM
pristine__

and how often we would generate recommendations?
14:53 PM
ruaok

that is something we need to learn -- it will be a balance between the computing resources and keeping things fresh.
14:54 PM
pristine__

because if it is weekly, then we need week wise files in hdfs
14:54 PM
ruaok

also unknown. I think each of these need to be config values that we can tweak as we go.
14:54 PM
pristine__

yeah, right.
14:54 PM
ruaok

did you catch the conversation between iliekcomputers and I earlier this week?
14:54 PM
rdswift

pristine__: To answer your earlier question, "Owned Music" is one type of user collection available. See https://beta.musicbrainz.org/collection/create
14:54 PM
pristine__

Umm...about what?
14:55 PM
I don't think so
14:55 PM
ruaok

in order to keep the data in the cluster fresh, iliekcomputers is going to work on incremental LB data dumps
14:55 PM
pristine__

rdswift: thanks
14:55 PM
ruaok

rdswift: that reminds me, I need to respond to an old mesg of yours.
14:55 PM
pristine__: the idea is that we can wake up the cluster at any time and then.
14:55 PM
pristine__

incremental data dumps, what will be that like
14:55 PM
ruaok

1. load incremental data dumps that have been produced since the cluster last woke.
14:56 PM
2. calculate whatever we need to. stats: train models, run CF models
14:56 PM
rdswift doesn't know what response that might be.
14:56 PM
3. Shut down the cluster
14:56 PM
which basically means that you do not need to worry about data freshness right now.
14:56 PM
that something that iliekcomputers and I will work on.
14:57 PM
pristine__

Okay. I just have too many thoughts whilst working .Lol
14:57 PM
ruaok

and effectively we just need to create scripts that carry out a task once they are called.
14:57 PM
doesn't matter when they are called.
14:57 PM
pristine__: good thoughts too. keep bringing them up.
14:58 PM
rdswift: > ruaok, pristine__: I just had another thought regarding identifying artist-artist afinity. Similar to ruaok's number of times each artist pair appears on the same compilation album, how about the number of times each artist pair appears in a user's "owned music" collection? Chances are they would only own both if they actually liked both (or at least the tracks or albums on which they appear).
14:58 PM
pristine__

Yeah, we should have independent scripts for that.
14:58 PM
ruaok

rdswift: yes, that is also a good source of data.
14:58 PM
however, I feat that there isn't much data AND we would need to get users permission to "process" them as per GDPR.
14:59 PM
which means that it isn't an easy thing to do that will likely drastically improve the data we have.
14:59 PM
pristine__

I was just thinking, how are we gonna keep our AAR fresh and updated
14:59 PM
rdswift

No response required. I was just brainstorming in case something triggered a better idea. Thanks though.
14:59 PM
ruaok

I'm not saying we shouldn't do it, but have lots of low hanging fruit first.
14:59 PM
pristine__

as more releases/recordings come out
14:59 PM
ruaok

pristine__: that is is nearly done.
15:00 PM
I've got a little more work to do, but AAR can re-run on a weekly basis.
15:00 PM
1. calculate a new table.
15:00 PM
pristine__

and it may happen that an artist changes its affinity to other artists in time
15:00 PM
wow
15:00 PM
ruaok

2. in a transaction: drop old table, rename new table
15:00 PM
3. commit
15:00 PM
pristine__: yes, it will. but those changes are going to move very slowly that weekly updates are quite sufficient.
15:01 PM
right now I am moving fast and trying to build stuff that allows you to continue.
15:01 PM
pristine__

I will someday try to understand the code for AAR, I was reading it one day, and was stuck up but now i don't remember.
15:01 PM
weekly sounds good to me
15:01 PM
ruaok

towards the end of the summer both you and I will need to spend time "finishing" things so that they are ready for deployment.
15:01 PM
I can explain it.
15:01 PM
it is actually fairly simple, really.
15:02 PM
pristine__

yeah, mentor working as much as the student
15:02 PM
<3
15:02 PM
thanks :)
15:02 PM
ruaok

first it runs a query to that fetches the artists that are on a release and returns release/artists pairs.
15:02 PM
pristine__

okay
15:03 PM
ruaok

then in memory the python script creates a dict with aritsts-artists MBIDs as they key.
15:03 PM
everytime that pair is encountered that count is incremented.
15:03 PM
that really it.
15:03 PM
the rest is the overhead to flush the data to a table, dropping counts less than 3.
15:04 PM
... dropping counts <3
15:04 PM
lol.
15:04 PM
pristine__

default dict val is 0? to account for single artists in artist_credit?
15:04 PM
ruaok

I think i've been staring at screen for too long today.
15:04 PM
pristine__

okay.
15:04 PM
eyes pain?
15:04 PM
ruaok

implied default value is 0, yes.
15:04 PM
no, being silly.
15:04 PM
brain can't really focus anymore.
15:04 PM
pristine__

lol
15:05 PM
Cool then, do we anything else to discuss?
15:05 PM
I will take care of new artists, empty dataframe, towards the end of month, no?
15:05 PM
new users*
15:05 PM
ruaok

I don't. I just need to put my head down and work on the MSB mapping.
15:05 PM
as you make progress.
15:06 PM
pristine__

yeah. New users thing should be handled delicately. I had many thoughts on it today.
15:07 PM
Okay then. See ya tomorrow <3
15:07 PM
All the best :)
15:07 PM
ruaok

yes. that is called the cold start problem.
15:07 PM
ok, sounds good. I remain excited.