#metabrainz

/

20:23 PM
mayhem

yes, we need to address this, but I am not concerned that this is the only problem.

2022-02-09 04005, 2022

20:23 PM
lucifer

the loop facilitates this. this query will be executed once for consecutive listens, then listen and listen + 2 so on.

2022-02-09 04033, 2022

20:23 PM
lucifer

so at a time all listens that reach the final group by in the same distance.

2022-02-09 04019, 2022

20:24 PM
mayhem needs minute visualize this

2022-02-09 04041, 2022

20:24 PM
lucifer

https://www.irccloud.com/pastebin/qfcybIdE/

2022-02-09 04042, 2022

20:26 PM
mayhem

hmm, that paste isn't helping me.

2022-02-09 04004, 2022

20:27 PM
mayhem

the count(*) really confuses me here.

2022-02-09 04012, 2022

20:28 PM
mayhem

because count(*) counts the total number of rows output by symmetric_index, no?

2022-02-09 04030, 2022

20:28 PM
lucifer

its grouped by both mbids

2022-02-09 04043, 2022

20:28 PM
lucifer

L58

2022-02-09 04047, 2022

20:28 PM
mayhem

yes.

2022-02-09 04008, 2022

20:29 PM
mayhem

that means that count(*) refers to the size of the current grouping?

2022-02-09 04011, 2022

20:29 PM
lucifer

yes

2022-02-09 04018, 2022

20:29 PM
mayhem

🤯

2022-02-09 04047, 2022

20:29 PM
mayhem

well, ok. that might explain a few difficulties I've had with other queries. lol

2022-02-09 04051, 2022

20:29 PM
lucifer

it counts the number of times (mbid1, mbid2) occured for the current step,

2022-02-09 04059, 2022

20:29 PM
mayhem nods

2022-02-09 04025, 2022

20:30 PM
lucifer

since the distance is constant for a given step size, i multiply it with the weight

2022-02-09 04015, 2022

20:32 PM
mayhem

I'm still not seeing how this give the same result as the python code.

2022-02-09 04032, 2022

20:32 PM
lucifer

right so this completes 1 step.

2022-02-09 04057, 2022

20:32 PM
mayhem

yes, idx = 1

2022-02-09 04026, 2022

20:33 PM
lucifer

so now the process is repeated for other idx, and those have continously decreasing weights.

2022-02-09 04000, 2022

20:34 PM
lucifer

L60 appends all of these together.

2022-02-09 04034, 2022

20:34 PM
lucifer

since mbid pairs can occur again in different step sizes, we need a final group by.

2022-02-09 04007, 2022

20:35 PM
lucifer

at L71, that happens and the HAVING clause removes the entires below threshold.

2022-02-09 04015, 2022

20:35 PM
mayhem

ok, I follow it now.

2022-02-09 04053, 2022

20:35 PM
lucifer

the COUNT(*) thing wasn't there in the original version. i added it later to avoid OOM when running on entire dataset.

2022-02-09 04030, 2022

20:36 PM
lucifer

the original version just assigned the weight to each row based on the current idx and then everything was grouped in the final query.

2022-02-09 04013, 2022

20:37 PM
lucifer

but all the recording similarity results we saw the other day had the COUNT(*) thing so I don't think it is the culprit.

2022-02-09 04015, 2022

20:37 PM
mayhem

ok, I'm quite impressed with this work, but I fail to see why this should not produce the same results.

2022-02-09 04042, 2022

20:37 PM
mayhem

Inserted 795,000 rows.

2022-02-09 04057, 2022

20:37 PM
mayhem

Generated 13,037,431 pairs (before pruning)

2022-02-09 04002, 2022

20:38 PM
mayhem

ok, lets see.

2022-02-09 04019, 2022

20:39 PM
mayhem

I feel that with user_id rather than user_name, the python results have gotten worse.

2022-02-09 04037, 2022

20:39 PM
mayhem

sepultura 51

2022-02-09 04038, 2022

20:39 PM
mayhem

err, no.

2022-02-09 04033, 2022

20:40 PM
lucifer

those look similar to me but yes that is the different between spark and gaga so i am tempted to point a finger at it.

2022-02-09 04037, 2022

20:41 PM
mayhem

wow. I'm stunned that my data went from ok to poro.

2022-02-09 04038, 2022

20:41 PM
mayhem

poor.

2022-02-09 04050, 2022

20:41 PM
mayhem

while making the user_name -> user_id change.

2022-02-09 04055, 2022

20:41 PM
lucifer

oh an example mbid?

2022-02-09 04057, 2022

20:41 PM
mayhem

should should change nothing at all!

2022-02-09 04005, 2022

20:42 PM
mayhem

https://bono.metabrainz.org/artist-similarity-pyt…

2022-02-09 04014, 2022

20:42 PM
mayhem

the usual. it got worse.

2022-02-09 04044, 2022

20:42 PM
mayhem

let me recalculate the user_name version again to make sure I am not smoking something...

2022-02-09 04051, 2022

20:42 PM
mayhem

well, poor expression.

2022-02-09 04010, 2022

20:43 PM
lucifer

i see. the top ones remained same but lower changed?

2022-02-09 04031, 2022

20:43 PM
lucifer

"user_name -> user_id change" should change nothing but ...

2022-02-09 04047, 2022

20:43 PM
mayhem

not sure.

2022-02-09 04000, 2022

20:44 PM
mayhem

ah, there is also the `if mbid_0 == mbid_1:`

2022-02-09 04038, 2022

20:44 PM
lucifer

that should only exclude same mbids but yeah possible.

2022-02-09 04057, 2022

20:44 PM
lucifer

i don't see it on the github branch though? have you pushed the changes?

2022-02-09 04013, 2022

20:46 PM
mayhem

no.

2022-02-09 04048, 2022

20:46 PM
mayhem

I just stashed them, since I am rerunning the user_id, no equal check code and outputting the results to a different table so we can compare.

2022-02-09 04054, 2022

20:46 PM
mayhem

gonna be a moment.

2022-02-09 04059, 2022

20:46 PM
lucifer

👍

2022-02-09 04013, 2022

20:47 PM
mayhem

I think I like your idea of making a tiny data set and seeing exactly what comes out of both.

2022-02-09 04040, 2022

20:47 PM
lucifer

yeah it'll also help figure out flaws in the thought process.

2022-02-09 04019, 2022

20:48 PM
mayhem

your algorithm seems pretty good to me. nice to discover in a real example on how to think in spark.

2022-02-09 04059, 2022

20:48 PM
lucifer

:D

2022-02-09 04011, 2022

20:50 PM
mayhem

I'm also falling in love with CTEs (WITH statments)

2022-02-09 04025, 2022

20:50 PM
lucifer

oh yeah those are great!

2022-02-09 04052, 2022

20:50 PM
mayhem

heh, I caught myself saying the other day: ... and insert that into whatever data store we choose...

2022-02-09 04059, 2022

20:50 PM
mayhem

what a stupid thing to say.

2022-02-09 04006, 2022

20:51 PM
mayhem

we always choose postgres.

2022-02-09 04007, 2022

20:51 PM
mayhem

always.

2022-02-09 04009, 2022

20:51 PM
lucifer

but wait till i open the timestamp PR and introduce you to transactions mess!!

2022-02-09 04039, 2022

20:51 PM
mayhem

oy

2022-02-09 04049, 2022

20:51 PM
lucifer

https://github.com/metabrainz/listenbrainz-server…

2022-02-09 04034, 2022

20:52 PM
mayhem

no way. its turning into nightmare over this tiny detail???

2022-02-09 04030, 2022

20:53 PM
lucifer

incomplete branch yet but basically in this particular scenario, listen_delete_metadata table holds the rows to be deleted from listen. We fetch the rows from it and delete stuff from listen. after that we clear up the rows in the listen_delete_metadata table.

2022-02-09 04045, 2022

20:53 PM
lucifer

however between the time rows are deleted from listen table and the time we get to clear up the listen_delete_metadata table, new rows may have been inserted in the latter. so we would have deleted rows from metadata table without deleting the actual listens.

2022-02-09 04054, 2022

20:54 PM
mayhem

but with transactional isolation that shouldn't be a big deal, or?

2022-02-09 04010, 2022

20:55 PM
lucifer

the default transaction isolation is READ_COMMITTED which we use

2022-02-09 04056, 2022

20:55 PM
lucifer

so T1 this bulk delete in process, meanwhile T2 that insert comes and inserts a row. commits. its now visible to remaining statements in T1.

2022-02-09 04023, 2022

20:57 PM
lucifer

https://www.postgresql.org/docs/12/transaction-is…

2022-02-09 04037, 2022

20:57 PM
mayhem

fun, learned a new fact for the day.

2022-02-09 04040, 2022

20:57 PM
mayhem

better go to bed.

2022-02-09 04043, 2022

20:57 PM
mayhem

j/k

2022-02-09 04046, 2022

20:57 PM
lucifer

the first line here "The Repeatable Read isolation level only sees data committed before the transaction began; it never sees either uncommitted data or changes committed during transaction execution by concurrent transactions."

2022-02-09 04031, 2022

20:58 PM
lucifer

which is what we want here.

2022-02-09 04050, 2022

20:58 PM
mayhem

and that probably locks the table or so?

2022-02-09 04056, 2022

20:58 PM
mayhem

or why are we not using that?

2022-02-09 04014, 2022

20:59 PM
mayhem

> Applications using this level must be prepared to retry transactions due to serialization failures.

2022-02-09 04015, 2022

20:59 PM
lucifer

it comes at a performance cost. i am not sure how much.

2022-02-09 04016, 2022

20:59 PM
mayhem

also shitty.

2022-02-09 04020, 2022

20:59 PM
lucifer

yeah

2022-02-09 04027, 2022

20:59 PM
mayhem

ok, lets brainstorm ways to overcome that.

2022-02-09 04038, 2022

20:59 PM
lucifer

i added the created column for that reason

2022-02-09 04003, 2022

21:00 PM
lucifer

https://github.com/metabrainz/listenbrainz-server…

2022-02-09 04040, 2022

21:00 PM
lucifer

record the time before starting transaction and at every step delete only rows before it

2022-02-09 04041, 2022

21:00 PM
mayhem

I've gone for a serial field myself.

2022-02-09 04001, 2022

21:01 PM
mayhem

I'd ...

2022-02-09 04010, 2022

21:01 PM
lucifer

oh yeah that should work too.

2022-02-09 04026, 2022

21:01 PM
mayhem

faster since the logic is guaranteed 1 cpu cycle

2022-02-09 04007, 2022

21:02 PM
lucifer

yup makes sense

2022-02-09 04043, 2022

21:02 PM
mayhem

yeah, ok reading those statements that is really not too horrid. unpleasant, but not horrid.

2022-02-09 04008, 2022

21:03 PM
lucifer

i'll add comments to make it easier.

2022-02-09 04048, 2022

21:03 PM
lucifer

another way could have been fetch everything to python and then delete rows one by one or in batches manually. but i think that'd be slower

2022-02-09 04008, 2022

21:04 PM
lucifer

it at least adds a round trip of entire listen_delete table data.

2022-02-09 04009, 2022

21:04 PM
mayhem

agreed.

2022-02-09 04018, 2022

21:04 PM
mayhem

CTEs really will reduce the amount of python I write.

2022-02-09 04026, 2022

21:04 PM
mayhem

but..

2022-02-09 04028, 2022

21:04 PM
mayhem

I'm very curious to get to the bottom of the similarities stuff -- learn how it really behaves.

2022-02-09 04051, 2022

21:04 PM
lucifer

oh indeed.

2022-02-09 04004, 2022

21:05 PM
mayhem

we need to learn the differences between each of the python versions and then the difference between spark and python.

2022-02-09 04015, 2022

21:05 PM
mayhem

I get a feeling that we're overtraining the data.

2022-02-09 04026, 2022

21:05 PM
mayhem

too much data makes it noise.

2022-02-09 04031, 2022

21:05 PM
lucifer

yup makes sense

2022-02-09 04044, 2022

21:05 PM
mayhem

it would be nice if we could find another constraint to add in the the mix.

2022-02-09 04042, 2022

21:07 PM
mayhem

I want to involve the similar users data somehow. but nothing has come from it.

2022-02-09 04046, 2022

21:07 PM
lucifer

hmm, unsure of any others we could add currently but i am wondering about removing 1. these current datasets are have an implicit condition that the listens should be listened in order and within a time frame.

2022-02-09 04036, 2022

21:08 PM
lucifer

whereas CF is you listened to this in any order. so how does this correlate to the CF stuff.

2022-02-09 04000, 2022

21:09 PM
mayhem

interesting approach.

2022-02-09 04018, 2022

21:09 PM
mayhem

the real problem is that we dont have both approaches to examine.

2022-02-09 04027, 2022

21:09 PM
lucifer

indeed

2022-02-09 04028, 2022

21:09 PM
mayhem

resultant data sets from both.. to examine.

2022-02-09 04055, 2022

21:09 PM
lucifer

both provide different types of data and CF is really opaque about the intermediate steps.

2022-02-09 04037, 2022

21:10 PM
mayhem

I think we should work to understand the variances in the current approach and get to something that we condering "best we can do with this alg and we understand it roughly".

2022-02-09 04044, 2022

21:10 PM
lucifer

another issue with CF is that all of it is highly mathematical, whereas this is mostly SQL so still understandable :)

2022-02-09 04053, 2022

21:10 PM
lucifer

+1

2022-02-09 04057, 2022

21:10 PM
mayhem

then we should implement the CF based version and compare them side by side.

2022-02-09 04011, 2022

21:11 PM
mayhem

OH!

2022-02-09 04022, 2022

21:11 PM
mayhem

co-learning. that was the word, no?

2022-02-09 04035, 2022

21:11 PM
lucifer

transfer learning?

2022-02-09 04040, 2022

21:11 PM
mayhem

we can actually see about where the algs agree and combined the results.

2022-02-09 04043, 2022

21:11 PM
mayhem

that is it!

2022-02-09 04052, 2022

21:11 PM
mayhem

combine.

2022-02-09 04019, 2022

21:12 PM
lucifer

yup makes sense. a lot of this is going to experiments and seeing what works out the best.

2022-02-09 04002, 2022

21:13 PM
lucifer

kind of how picard's magic similarity weights for finding track matches work lol.

2022-02-09 04018, 2022

21:13 PM
mayhem

but, in the end the two algs have different fundamental statements.

2022-02-09 04030, 2022

21:13 PM
mayhem

I wonder if that code has changed much since I wrote that, lol.

2022-02-09 04038, 2022

21:13 PM
lucifer

CF ?

2022-02-09 04000, 2022

21:14 PM
mayhem

so, CF = if you've ever listened to a track,use how many times as input.

2022-02-09 04016, 2022

21:14 PM
mayhem

in ours the tracks have to be listened by the same person, but in proximity.

2022-02-09 04040, 2022

21:14 PM
monkey has quit

2022-02-09 04053, 2022

21:14 PM
monkey joined the channel

2022-02-09 04034, 2022

21:15 PM
lucifer

there was actually an intro paper on CF i read, TV show recommender. in that they pruned the input dataset based on the time order before feeding it to CF model.

2022-02-09 04054, 2022

21:15 PM
mayhem makes a note about genetic algorithms and playlist generation

2022-02-09 04044, 2022

21:16 PM
lucifer

lots of cool stuff still left to try! :D

2022-02-09 04000, 2022

21:17 PM
mayhem

yes.

2022-02-09 04013, 2022

21:17 PM
mayhem

and remember, our current goal is "not bad!".

2022-02-09 04019, 2022

21:17 PM
lucifer

indeed.

2022-02-09 04034, 2022

21:17 PM
lucifer

i am going to bed currently but will work on building the small dataset tomorrow. nn!

2022-02-09 04039, 2022

21:17 PM
mayhem

and once we have a full chain of everything, then we can work to improve. and hopefully will get help

2022-02-09 04048, 2022

21:17 PM
lucifer

+1

2022-02-09 04051, 2022

21:17 PM
mayhem

nn, I should get off the computer too.

2022-02-09 04055, 2022

21:17 PM
mayhem

sleep well!

2022-02-09 04002, 2022

21:18 PM
lucifer

you too

2022-02-09 04037, 2022

21:29 PM
Mineo has quit