in #metabrainz

20:23 PM
mayhem

yes, we need to address this, but I am not concerned that this is the only problem.
20:23 PM
lucifer

the loop facilitates this. this query will be executed once for consecutive listens, then listen and listen + 2 so on.
20:23 PM
so at a time all listens that reach the final group by in the same distance.
20:24 PM
mayhem needs minute visualize this
20:24 PM
https://www.irccloud.com/pastebin/qfcybIdE/
20:26 PM
mayhem

hmm, that paste isn't helping me.
20:27 PM
the count(*) really confuses me here.
20:28 PM
because count(*) counts the total number of rows output by symmetric_index, no?
20:28 PM
lucifer

its grouped by both mbids
20:28 PM
L58
20:28 PM
mayhem

yes.
20:29 PM
that means that count(*) refers to the size of the current grouping?
20:29 PM
lucifer

yes
20:29 PM
mayhem

🤯
20:29 PM
well, ok. that might explain a few difficulties I've had with other queries. lol
20:29 PM
lucifer

it counts the number of times (mbid1, mbid2) occured for the current step,
20:29 PM
mayhem nods
20:30 PM
since the distance is constant for a given step size, i multiply it with the weight
20:32 PM
mayhem

I'm still not seeing how this give the same result as the python code.
20:32 PM
lucifer

right so this completes 1 step.
20:32 PM
mayhem

yes, idx = 1
20:33 PM
lucifer

so now the process is repeated for other idx, and those have continously decreasing weights.
20:34 PM
L60 appends all of these together.
20:34 PM
since mbid pairs can occur again in different step sizes, we need a final group by.
20:35 PM
at L71, that happens and the HAVING clause removes the entires below threshold.
20:35 PM
mayhem

ok, I follow it now.
20:35 PM
lucifer

the COUNT(*) thing wasn't there in the original version. i added it later to avoid OOM when running on entire dataset.
20:36 PM
the original version just assigned the weight to each row based on the current idx and then everything was grouped in the final query.
20:37 PM
but all the recording similarity results we saw the other day had the COUNT(*) thing so I don't think it is the culprit.
20:37 PM
mayhem

ok, I'm quite impressed with this work, but I fail to see why this should not produce the same results.
20:37 PM
Inserted 795,000 rows.
20:37 PM
Generated 13,037,431 pairs (before pruning)
20:38 PM
ok, lets see.
20:39 PM
I feel that with user_id rather than user_name, the python results have gotten worse.
20:39 PM
sepultura 51
20:39 PM
err, no.
20:40 PM
lucifer

those look similar to me but yes that is the different between spark and gaga so i am tempted to point a finger at it.
20:41 PM
mayhem

wow. I'm stunned that my data went from ok to poro.
20:41 PM
poor.
20:41 PM
while making the user_name -> user_id change.
20:41 PM
lucifer

oh an example mbid?
20:41 PM
mayhem

should should change nothing at all!
20:42 PM
https://bono.metabrainz.org/artist-similarity-p...
20:42 PM
the usual. it got worse.
20:42 PM
let me recalculate the user_name version again to make sure I am not smoking something...
20:42 PM
well, poor expression.
20:43 PM
lucifer

i see. the top ones remained same but lower changed?
20:43 PM
"user_name -> user_id change" should change nothing but ...
20:43 PM
mayhem

not sure.
20:44 PM
ah, there is also the `if mbid_0 == mbid_1:`
20:44 PM
lucifer

that should only exclude same mbids but yeah possible.
20:44 PM
i don't see it on the github branch though? have you pushed the changes?
20:46 PM
mayhem

no.
20:46 PM
I just stashed them, since I am rerunning the user_id, no equal check code and outputting the results to a different table so we can compare.
20:46 PM
gonna be a moment.
20:46 PM
lucifer

👍
20:47 PM
mayhem

I think I like your idea of making a tiny data set and seeing exactly what comes out of both.
20:47 PM
lucifer

yeah it'll also help figure out flaws in the thought process.
20:48 PM
mayhem

your algorithm seems pretty good to me. nice to discover in a real example on how to think in spark.
20:48 PM
lucifer

:D
20:50 PM
mayhem

I'm also falling in love with CTEs (WITH statments)
20:50 PM
lucifer

oh yeah those are great!
20:50 PM
mayhem

heh, I caught myself saying the other day: ... and insert that into whatever data store we choose...
20:50 PM
what a stupid thing to say.
20:51 PM
we always choose postgres.
20:51 PM
always.
20:51 PM
lucifer

but wait till i open the timestamp PR and introduce you to transactions mess!!
20:51 PM
mayhem

oy
20:51 PM
lucifer

https://github.com/metabrainz/listenbrainz-serv...
20:52 PM
mayhem

no way. its turning into nightmare over this tiny detail???
20:53 PM
lucifer

incomplete branch yet but basically in this particular scenario, listen_delete_metadata table holds the rows to be deleted from listen. We fetch the rows from it and delete stuff from listen. after that we clear up the rows in the listen_delete_metadata table.
20:53 PM
however between the time rows are deleted from listen table and the time we get to clear up the listen_delete_metadata table, new rows may have been inserted in the latter. so we would have deleted rows from metadata table without deleting the actual listens.
20:54 PM
mayhem

but with transactional isolation that shouldn't be a big deal, or?
20:55 PM
lucifer

the default transaction isolation is READ_COMMITTED which we use
20:55 PM
so T1 this bulk delete in process, meanwhile T2 that insert comes and inserts a row. commits. its now visible to remaining statements in T1.
20:57 PM
https://www.postgresql.org/docs/12/transaction-...
20:57 PM
mayhem

fun, learned a new fact for the day.
20:57 PM
better go to bed.
20:57 PM
j/k
20:57 PM
lucifer

the first line here "The Repeatable Read isolation level only sees data committed before the transaction began; it never sees either uncommitted data or changes committed during transaction execution by concurrent transactions."
20:58 PM
which is what we want here.
20:58 PM
mayhem

and that probably locks the table or so?
20:58 PM
or why are we not using that?
20:59 PM
> Applications using this level must be prepared to retry transactions due to serialization failures.
20:59 PM
lucifer

it comes at a performance cost. i am not sure how much.
20:59 PM
mayhem

also shitty.
20:59 PM
lucifer

yeah
20:59 PM
mayhem

ok, lets brainstorm ways to overcome that.
20:59 PM
lucifer

i added the created column for that reason
21:00 PM
https://github.com/metabrainz/listenbrainz-serv...
21:00 PM
record the time before starting transaction and at every step delete only rows before it
21:00 PM
mayhem

I've gone for a serial field myself.
21:01 PM
I'd ...
21:01 PM
lucifer

oh yeah that should work too.
21:01 PM
mayhem

faster since the logic is guaranteed 1 cpu cycle
21:02 PM
lucifer

yup makes sense
21:02 PM
mayhem

yeah, ok reading those statements that is really not too horrid. unpleasant, but not horrid.
21:03 PM
lucifer

i'll add comments to make it easier.
21:03 PM
another way could have been fetch everything to python and then delete rows one by one or in batches manually. but i think that'd be slower
21:04 PM
it at least adds a round trip of entire listen_delete table data.
21:04 PM
mayhem

agreed.
21:04 PM
CTEs really will reduce the amount of python I write.
21:04 PM
but..
21:04 PM
I'm very curious to get to the bottom of the similarities stuff -- learn how it really behaves.
21:04 PM
lucifer

oh indeed.
21:05 PM
mayhem

we need to learn the differences between each of the python versions and then the difference between spark and python.
21:05 PM
I get a feeling that we're overtraining the data.
21:05 PM
too much data makes it noise.
21:05 PM
lucifer

yup makes sense
21:05 PM
mayhem

it would be nice if we could find another constraint to add in the the mix.
21:07 PM
I want to involve the similar users data somehow. but nothing has come from it.
21:07 PM
lucifer

hmm, unsure of any others we could add currently but i am wondering about removing 1. these current datasets are have an implicit condition that the listens should be listened in order and within a time frame.
21:08 PM
whereas CF is you listened to this in any order. so how does this correlate to the CF stuff.
21:09 PM
mayhem

interesting approach.
21:09 PM
the real problem is that we dont have both approaches to examine.
21:09 PM
lucifer

indeed
21:09 PM
mayhem

resultant data sets from both.. to examine.
21:09 PM
lucifer

both provide different types of data and CF is really opaque about the intermediate steps.
21:10 PM
mayhem

I think we should work to understand the variances in the current approach and get to something that we condering "best we can do with this alg and we understand it roughly".
21:10 PM
lucifer

another issue with CF is that all of it is highly mathematical, whereas this is mostly SQL so still understandable :)
21:10 PM
+1
21:10 PM
mayhem

then we should implement the CF based version and compare them side by side.
21:11 PM
OH!
21:11 PM
co-learning. that was the word, no?
21:11 PM
lucifer

transfer learning?
21:11 PM
mayhem

we can actually see about where the algs agree and combined the results.
21:11 PM
that is it!
21:11 PM
combine.
21:12 PM
lucifer

yup makes sense. a lot of this is going to experiments and seeing what works out the best.
21:13 PM
kind of how picard's magic similarity weights for finding track matches work lol.
21:13 PM
mayhem

but, in the end the two algs have different fundamental statements.
21:13 PM
I wonder if that code has changed much since I wrote that, lol.
21:13 PM
lucifer

CF ?
21:14 PM
mayhem

so, CF = if you've ever listened to a track,use how many times as input.
21:14 PM
in ours the tracks have to be listened by the same person, but in proximity.
21:14 PM
monkey has quit
21:14 PM
monkey joined the channel
21:15 PM
lucifer

there was actually an intro paper on CF i read, TV show recommender. in that they pruned the input dataset based on the time order before feeding it to CF model.
21:15 PM
mayhem makes a note about genetic algorithms and playlist generation
21:16 PM
lots of cool stuff still left to try! :D
21:17 PM
mayhem

yes.
21:17 PM
and remember, our current goal is "not bad!".
21:17 PM
lucifer

indeed.
21:17 PM
i am going to bed currently but will work on building the small dataset tomorrow. nn!
21:17 PM
mayhem

and once we have a full chain of everything, then we can work to improve. and hopefully will get help
21:17 PM
lucifer

+1
21:17 PM
mayhem

nn, I should get off the computer too.
21:17 PM
sleep well!
21:18 PM
lucifer

you too
21:29 PM
Mineo has quit