in order to keep the data in the cluster fresh, iliekcomputers is going to work on incremental LB data dumps
2019-06-12 16304, 2019
pristine__
rdswift: thanks
2019-06-12 16317, 2019
ruaok
rdswift: that reminds me, I need to respond to an old mesg of yours.
2019-06-12 16336, 2019
ruaok
pristine__: the idea is that we can wake up the cluster at any time and then.
2019-06-12 16336, 2019
pristine__
incremental data dumps, what will be that like
2019-06-12 16351, 2019
ruaok
1. load incremental data dumps that have been produced since the cluster last woke.
2019-06-12 16314, 2019
ruaok
2. calculate whatever we need to. stats: train models, run CF models
2019-06-12 16318, 2019
rdswift doesn't know what response that might be.
2019-06-12 16325, 2019
ruaok
3. Shut down the cluster
2019-06-12 16342, 2019
ruaok
which basically means that you do not need to worry about data freshness right now.
2019-06-12 16359, 2019
ruaok
that something that iliekcomputers and I will work on.
2019-06-12 16311, 2019
pristine__
Okay. I just have too many thoughts whilst working .Lol
2019-06-12 16329, 2019
ruaok
and effectively we just need to create scripts that carry out a task once they are called.
2019-06-12 16337, 2019
ruaok
doesn't matter when they are called.
2019-06-12 16343, 2019
ruaok
pristine__: good thoughts too. keep bringing them up.
2019-06-12 16302, 2019
ruaok
rdswift: > ruaok, pristine__: I just had another thought regarding identifying artist-artist afinity. Similar to ruaok's number of times each artist pair appears on the same compilation album, how about the number of times each artist pair appears in a user's "owned music" collection? Chances are they would only own both if they actually liked both (or at least the tracks or albums on which they appear).
2019-06-12 16304, 2019
pristine__
Yeah, we should have independent scripts for that.
2019-06-12 16326, 2019
ruaok
rdswift: yes, that is also a good source of data.
2019-06-12 16349, 2019
ruaok
however, I feat that there isn't much data AND we would need to get users permission to "process" them as per GDPR.
2019-06-12 16309, 2019
ruaok
which means that it isn't an easy thing to do that will likely drastically improve the data we have.
2019-06-12 16321, 2019
pristine__
I was just thinking, how are we gonna keep our AAR fresh and updated
2019-06-12 16328, 2019
rdswift
No response required. I was just brainstorming in case something triggered a better idea. Thanks though.
2019-06-12 16331, 2019
ruaok
I'm not saying we shouldn't do it, but have lots of low hanging fruit first.
2019-06-12 16339, 2019
pristine__
as more releases/recordings come out
2019-06-12 16346, 2019
ruaok
pristine__: that is is nearly done.
2019-06-12 16305, 2019
ruaok
I've got a little more work to do, but AAR can re-run on a weekly basis.
2019-06-12 16311, 2019
ruaok
1. calculate a new table.
2019-06-12 16326, 2019
pristine__
and it may happen that an artist changes its affinity to other artists in time
2019-06-12 16328, 2019
pristine__
wow
2019-06-12 16328, 2019
ruaok
2. in a transaction: drop old table, rename new table
2019-06-12 16331, 2019
ruaok
3. commit
2019-06-12 16355, 2019
ruaok
pristine__: yes, it will. but those changes are going to move very slowly that weekly updates are quite sufficient.
2019-06-12 16318, 2019
ruaok
right now I am moving fast and trying to build stuff that allows you to continue.
2019-06-12 16331, 2019
pristine__
I will someday try to understand the code for AAR, I was reading it one day, and was stuck up but now i don't remember.
2019-06-12 16337, 2019
pristine__
weekly sounds good to me
2019-06-12 16342, 2019
ruaok
towards the end of the summer both you and I will need to spend time "finishing" things so that they are ready for deployment.
2019-06-12 16352, 2019
ruaok
I can explain it.
2019-06-12 16358, 2019
ruaok
it is actually fairly simple, really.
2019-06-12 16314, 2019
pristine__
yeah, mentor working as much as the student
2019-06-12 16316, 2019
pristine__
<3
2019-06-12 16334, 2019
pristine__
thanks :)
2019-06-12 16337, 2019
ruaok
first it runs a query to that fetches the artists that are on a release and returns release/artists pairs.
2019-06-12 16358, 2019
pristine__
okay
2019-06-12 16304, 2019
ruaok
then in memory the python script creates a dict with aritsts-artists MBIDs as they key.
2019-06-12 16325, 2019
ruaok
everytime that pair is encountered that count is incremented.
2019-06-12 16342, 2019
ruaok
that really it.
2019-06-12 16358, 2019
ruaok
the rest is the overhead to flush the data to a table, dropping counts less than 3.
2019-06-12 16306, 2019
ruaok
... dropping counts <3
2019-06-12 16307, 2019
ruaok
lol.
2019-06-12 16321, 2019
pristine__
default dict val is 0? to account for single artists in artist_credit?
2019-06-12 16328, 2019
ruaok
I think i've been staring at screen for too long today.
2019-06-12 16337, 2019
pristine__
okay.
2019-06-12 16341, 2019
pristine__
eyes pain?
2019-06-12 16342, 2019
ruaok
implied default value is 0, yes.
2019-06-12 16346, 2019
ruaok
no, being silly.
2019-06-12 16352, 2019
ruaok
brain can't really focus anymore.
2019-06-12 16357, 2019
pristine__
lol
2019-06-12 16311, 2019
pristine__
Cool then, do we anything else to discuss?
2019-06-12 16332, 2019
pristine__
I will take care of new artists, empty dataframe, towards the end of month, no?
2019-06-12 16338, 2019
pristine__
new users*
2019-06-12 16341, 2019
ruaok
I don't. I just need to put my head down and work on the MSB mapping.
2019-06-12 16352, 2019
ruaok
as you make progress.
2019-06-12 16340, 2019
pristine__
yeah. New users thing should be handled delicately. I had many thoughts on it today.