no way. its turning into nightmare over this tiny detail???
2022-02-09 04030, 2022
lucifer
incomplete branch yet but basically in this particular scenario, listen_delete_metadata table holds the rows to be deleted from listen. We fetch the rows from it and delete stuff from listen. after that we clear up the rows in the listen_delete_metadata table.
2022-02-09 04045, 2022
lucifer
however between the time rows are deleted from listen table and the time we get to clear up the listen_delete_metadata table, new rows may have been inserted in the latter. so we would have deleted rows from metadata table without deleting the actual listens.
2022-02-09 04054, 2022
mayhem
but with transactional isolation that shouldn't be a big deal, or?
2022-02-09 04010, 2022
lucifer
the default transaction isolation is READ_COMMITTED which we use
2022-02-09 04056, 2022
lucifer
so T1 this bulk delete in process, meanwhile T2 that insert comes and inserts a row. commits. its now visible to remaining statements in T1.
the first line here "The Repeatable Read isolation level only sees data committed before the transaction began; it never sees either uncommitted data or changes committed during transaction execution by concurrent transactions."
2022-02-09 04031, 2022
lucifer
which is what we want here.
2022-02-09 04050, 2022
mayhem
and that probably locks the table or so?
2022-02-09 04056, 2022
mayhem
or why are we not using that?
2022-02-09 04014, 2022
mayhem
> Applications using this level must be prepared to retry transactions due to serialization failures.
2022-02-09 04015, 2022
lucifer
it comes at a performance cost. i am not sure how much.
record the time before starting transaction and at every step delete only rows before it
2022-02-09 04041, 2022
mayhem
I've gone for a serial field myself.
2022-02-09 04001, 2022
mayhem
I'd ...
2022-02-09 04010, 2022
lucifer
oh yeah that should work too.
2022-02-09 04026, 2022
mayhem
faster since the logic is guaranteed 1 cpu cycle
2022-02-09 04007, 2022
lucifer
yup makes sense
2022-02-09 04043, 2022
mayhem
yeah, ok reading those statements that is really not too horrid. unpleasant, but not horrid.
2022-02-09 04008, 2022
lucifer
i'll add comments to make it easier.
2022-02-09 04048, 2022
lucifer
another way could have been fetch everything to python and then delete rows one by one or in batches manually. but i think that'd be slower
2022-02-09 04008, 2022
lucifer
it at least adds a round trip of entire listen_delete table data.
2022-02-09 04009, 2022
mayhem
agreed.
2022-02-09 04018, 2022
mayhem
CTEs really will reduce the amount of python I write.
2022-02-09 04026, 2022
mayhem
but..
2022-02-09 04028, 2022
mayhem
I'm very curious to get to the bottom of the similarities stuff -- learn how it really behaves.
2022-02-09 04051, 2022
lucifer
oh indeed.
2022-02-09 04004, 2022
mayhem
we need to learn the differences between each of the python versions and then the difference between spark and python.
2022-02-09 04015, 2022
mayhem
I get a feeling that we're overtraining the data.
2022-02-09 04026, 2022
mayhem
too much data makes it noise.
2022-02-09 04031, 2022
lucifer
yup makes sense
2022-02-09 04044, 2022
mayhem
it would be nice if we could find another constraint to add in the the mix.
2022-02-09 04042, 2022
mayhem
I want to involve the similar users data somehow. but nothing has come from it.
2022-02-09 04046, 2022
lucifer
hmm, unsure of any others we could add currently but i am wondering about removing 1. these current datasets are have an implicit condition that the listens should be listened in order and within a time frame.
2022-02-09 04036, 2022
lucifer
whereas CF is you listened to this in any order. so how does this correlate to the CF stuff.
2022-02-09 04000, 2022
mayhem
interesting approach.
2022-02-09 04018, 2022
mayhem
the real problem is that we dont have both approaches to examine.
2022-02-09 04027, 2022
lucifer
indeed
2022-02-09 04028, 2022
mayhem
resultant data sets from both.. to examine.
2022-02-09 04055, 2022
lucifer
both provide different types of data and CF is really opaque about the intermediate steps.
2022-02-09 04037, 2022
mayhem
I think we should work to understand the variances in the current approach and get to something that we condering "best we can do with this alg and we understand it roughly".
2022-02-09 04044, 2022
lucifer
another issue with CF is that all of it is highly mathematical, whereas this is mostly SQL so still understandable :)
2022-02-09 04053, 2022
lucifer
+1
2022-02-09 04057, 2022
mayhem
then we should implement the CF based version and compare them side by side.
2022-02-09 04011, 2022
mayhem
OH!
2022-02-09 04022, 2022
mayhem
co-learning. that was the word, no?
2022-02-09 04035, 2022
lucifer
transfer learning?
2022-02-09 04040, 2022
mayhem
we can actually see about where the algs agree and combined the results.
2022-02-09 04043, 2022
mayhem
that is it!
2022-02-09 04052, 2022
mayhem
combine.
2022-02-09 04019, 2022
lucifer
yup makes sense. a lot of this is going to experiments and seeing what works out the best.
2022-02-09 04002, 2022
lucifer
kind of how picard's magic similarity weights for finding track matches work lol.
2022-02-09 04018, 2022
mayhem
but, in the end the two algs have different fundamental statements.
2022-02-09 04030, 2022
mayhem
I wonder if that code has changed much since I wrote that, lol.
2022-02-09 04038, 2022
lucifer
CF ?
2022-02-09 04000, 2022
mayhem
so, CF = if you've ever listened to a track,use how many times as input.
2022-02-09 04016, 2022
mayhem
in ours the tracks have to be listened by the same person, but in proximity.
2022-02-09 04040, 2022
monkey has quit
2022-02-09 04053, 2022
monkey joined the channel
2022-02-09 04034, 2022
lucifer
there was actually an intro paper on CF i read, TV show recommender. in that they pruned the input dataset based on the time order before feeding it to CF model.
2022-02-09 04054, 2022
mayhem makes a note about genetic algorithms and playlist generation
2022-02-09 04044, 2022
lucifer
lots of cool stuff still left to try! :D
2022-02-09 04000, 2022
mayhem
yes.
2022-02-09 04013, 2022
mayhem
and remember, our current goal is "not bad!".
2022-02-09 04019, 2022
lucifer
indeed.
2022-02-09 04034, 2022
lucifer
i am going to bed currently but will work on building the small dataset tomorrow. nn!
2022-02-09 04039, 2022
mayhem
and once we have a full chain of everything, then we can work to improve. and hopefully will get help