no way. its turning into nightmare over this tiny detail???
lucifer
incomplete branch yet but basically in this particular scenario, listen_delete_metadata table holds the rows to be deleted from listen. We fetch the rows from it and delete stuff from listen. after that we clear up the rows in the listen_delete_metadata table.
however between the time rows are deleted from listen table and the time we get to clear up the listen_delete_metadata table, new rows may have been inserted in the latter. so we would have deleted rows from metadata table without deleting the actual listens.
mayhem
but with transactional isolation that shouldn't be a big deal, or?
lucifer
the default transaction isolation is READ_COMMITTED which we use
so T1 this bulk delete in process, meanwhile T2 that insert comes and inserts a row. commits. its now visible to remaining statements in T1.
the first line here "The Repeatable Read isolation level only sees data committed before the transaction began; it never sees either uncommitted data or changes committed during transaction execution by concurrent transactions."
which is what we want here.
mayhem
and that probably locks the table or so?
or why are we not using that?
> Applications using this level must be prepared to retry transactions due to serialization failures.
lucifer
it comes at a performance cost. i am not sure how much.
record the time before starting transaction and at every step delete only rows before it
mayhem
I've gone for a serial field myself.
I'd ...
lucifer
oh yeah that should work too.
mayhem
faster since the logic is guaranteed 1 cpu cycle
lucifer
yup makes sense
mayhem
yeah, ok reading those statements that is really not too horrid. unpleasant, but not horrid.
lucifer
i'll add comments to make it easier.
another way could have been fetch everything to python and then delete rows one by one or in batches manually. but i think that'd be slower
it at least adds a round trip of entire listen_delete table data.
mayhem
agreed.
CTEs really will reduce the amount of python I write.
but..
I'm very curious to get to the bottom of the similarities stuff -- learn how it really behaves.
lucifer
oh indeed.
mayhem
we need to learn the differences between each of the python versions and then the difference between spark and python.
I get a feeling that we're overtraining the data.
too much data makes it noise.
lucifer
yup makes sense
mayhem
it would be nice if we could find another constraint to add in the the mix.
I want to involve the similar users data somehow. but nothing has come from it.
lucifer
hmm, unsure of any others we could add currently but i am wondering about removing 1. these current datasets are have an implicit condition that the listens should be listened in order and within a time frame.
whereas CF is you listened to this in any order. so how does this correlate to the CF stuff.
mayhem
interesting approach.
the real problem is that we dont have both approaches to examine.
lucifer
indeed
mayhem
resultant data sets from both.. to examine.
lucifer
both provide different types of data and CF is really opaque about the intermediate steps.
mayhem
I think we should work to understand the variances in the current approach and get to something that we condering "best we can do with this alg and we understand it roughly".
lucifer
another issue with CF is that all of it is highly mathematical, whereas this is mostly SQL so still understandable :)
+1
mayhem
then we should implement the CF based version and compare them side by side.
OH!
co-learning. that was the word, no?
lucifer
transfer learning?
mayhem
we can actually see about where the algs agree and combined the results.
that is it!
combine.
lucifer
yup makes sense. a lot of this is going to experiments and seeing what works out the best.
kind of how picard's magic similarity weights for finding track matches work lol.
mayhem
but, in the end the two algs have different fundamental statements.
I wonder if that code has changed much since I wrote that, lol.
lucifer
CF ?
mayhem
so, CF = if you've ever listened to a track,use how many times as input.
in ours the tracks have to be listened by the same person, but in proximity.
monkey has quit
monkey joined the channel
lucifer
there was actually an intro paper on CF i read, TV show recommender. in that they pruned the input dataset based on the time order before feeding it to CF model.
mayhem makes a note about genetic algorithms and playlist generation
lots of cool stuff still left to try! :D
mayhem
yes.
and remember, our current goal is "not bad!".
lucifer
indeed.
i am going to bed currently but will work on building the small dataset tomorrow. nn!
mayhem
and once we have a full chain of everything, then we can work to improve. and hopefully will get help