#metabrainz

/

      • mayhem
        yes, we need to address this, but I am not concerned that this is the only problem.
      • lucifer
        the loop facilitates this. this query will be executed once for consecutive listens, then listen and listen + 2 so on.
      • so at a time all listens that reach the final group by in the same distance.
      • mayhem needs minute visualize this
      • mayhem
        hmm, that paste isn't helping me.
      • the count(*) really confuses me here.
      • because count(*) counts the total number of rows output by symmetric_index, no?
      • lucifer
        its grouped by both mbids
      • L58
      • mayhem
        yes.
      • that means that count(*) refers to the size of the current grouping?
      • lucifer
        yes
      • mayhem
        🤯
      • well, ok. that might explain a few difficulties I've had with other queries. lol
      • lucifer
        it counts the number of times (mbid1, mbid2) occured for the current step,
      • mayhem nods
      • since the distance is constant for a given step size, i multiply it with the weight
      • mayhem
        I'm still not seeing how this give the same result as the python code.
      • lucifer
        right so this completes 1 step.
      • mayhem
        yes, idx = 1
      • lucifer
        so now the process is repeated for other idx, and those have continously decreasing weights.
      • L60 appends all of these together.
      • since mbid pairs can occur again in different step sizes, we need a final group by.
      • at L71, that happens and the HAVING clause removes the entires below threshold.
      • mayhem
        ok, I follow it now.
      • lucifer
        the COUNT(*) thing wasn't there in the original version. i added it later to avoid OOM when running on entire dataset.
      • the original version just assigned the weight to each row based on the current idx and then everything was grouped in the final query.
      • but all the recording similarity results we saw the other day had the COUNT(*) thing so I don't think it is the culprit.
      • mayhem
        ok, I'm quite impressed with this work, but I fail to see why this should not produce the same results.
      • Inserted 795,000 rows.
      • Generated 13,037,431 pairs (before pruning)
      • ok, lets see.
      • I feel that with user_id rather than user_name, the python results have gotten worse.
      • sepultura 51
      • err, no.
      • lucifer
        those look similar to me but yes that is the different between spark and gaga so i am tempted to point a finger at it.
      • mayhem
        wow. I'm stunned that my data went from ok to poro.
      • poor.
      • while making the user_name -> user_id change.
      • lucifer
        oh an example mbid?
      • mayhem
        should should change nothing at all!
      • the usual. it got worse.
      • let me recalculate the user_name version again to make sure I am not smoking something...
      • well, poor expression.
      • lucifer
        i see. the top ones remained same but lower changed?
      • "user_name -> user_id change" should change nothing but ...
      • mayhem
        not sure.
      • ah, there is also the `if mbid_0 == mbid_1:`
      • lucifer
        that should only exclude same mbids but yeah possible.
      • i don't see it on the github branch though? have you pushed the changes?
      • mayhem
        no.
      • I just stashed them, since I am rerunning the user_id, no equal check code and outputting the results to a different table so we can compare.
      • gonna be a moment.
      • lucifer
        👍
      • mayhem
        I think I like your idea of making a tiny data set and seeing exactly what comes out of both.
      • lucifer
        yeah it'll also help figure out flaws in the thought process.
      • mayhem
        your algorithm seems pretty good to me. nice to discover in a real example on how to think in spark.
      • lucifer
        :D
      • mayhem
        I'm also falling in love with CTEs (WITH statments)
      • lucifer
        oh yeah those are great!
      • mayhem
        heh, I caught myself saying the other day: ... and insert that into whatever data store we choose...
      • what a stupid thing to say.
      • we always choose postgres.
      • always.
      • lucifer
        but wait till i open the timestamp PR and introduce you to transactions mess!!
      • mayhem
        oy
      • lucifer
      • mayhem
        no way. its turning into nightmare over this tiny detail???
      • lucifer
        incomplete branch yet but basically in this particular scenario, listen_delete_metadata table holds the rows to be deleted from listen. We fetch the rows from it and delete stuff from listen. after that we clear up the rows in the listen_delete_metadata table.
      • however between the time rows are deleted from listen table and the time we get to clear up the listen_delete_metadata table, new rows may have been inserted in the latter. so we would have deleted rows from metadata table without deleting the actual listens.
      • mayhem
        but with transactional isolation that shouldn't be a big deal, or?
      • lucifer
        the default transaction isolation is READ_COMMITTED which we use
      • so T1 this bulk delete in process, meanwhile T2 that insert comes and inserts a row. commits. its now visible to remaining statements in T1.
      • mayhem
        fun, learned a new fact for the day.
      • better go to bed.
      • j/k
      • lucifer
        the first line here "The Repeatable Read isolation level only sees data committed before the transaction began; it never sees either uncommitted data or changes committed during transaction execution by concurrent transactions."
      • which is what we want here.
      • mayhem
        and that probably locks the table or so?
      • or why are we not using that?
      • > Applications using this level must be prepared to retry transactions due to serialization failures.
      • lucifer
        it comes at a performance cost. i am not sure how much.
      • mayhem
        also shitty.
      • lucifer
        yeah
      • mayhem
        ok, lets brainstorm ways to overcome that.
      • lucifer
        i added the created column for that reason
      • record the time before starting transaction and at every step delete only rows before it
      • mayhem
        I've gone for a serial field myself.
      • I'd ...
      • lucifer
        oh yeah that should work too.
      • mayhem
        faster since the logic is guaranteed 1 cpu cycle
      • lucifer
        yup makes sense
      • mayhem
        yeah, ok reading those statements that is really not too horrid. unpleasant, but not horrid.
      • lucifer
        i'll add comments to make it easier.
      • another way could have been fetch everything to python and then delete rows one by one or in batches manually. but i think that'd be slower
      • it at least adds a round trip of entire listen_delete table data.
      • mayhem
        agreed.
      • CTEs really will reduce the amount of python I write.
      • but..
      • I'm very curious to get to the bottom of the similarities stuff -- learn how it really behaves.
      • lucifer
        oh indeed.
      • mayhem
        we need to learn the differences between each of the python versions and then the difference between spark and python.
      • I get a feeling that we're overtraining the data.
      • too much data makes it noise.
      • lucifer
        yup makes sense
      • mayhem
        it would be nice if we could find another constraint to add in the the mix.
      • I want to involve the similar users data somehow. but nothing has come from it.
      • lucifer
        hmm, unsure of any others we could add currently but i am wondering about removing 1. these current datasets are have an implicit condition that the listens should be listened in order and within a time frame.
      • whereas CF is you listened to this in any order. so how does this correlate to the CF stuff.
      • mayhem
        interesting approach.
      • the real problem is that we dont have both approaches to examine.
      • lucifer
        indeed
      • mayhem
        resultant data sets from both.. to examine.
      • lucifer
        both provide different types of data and CF is really opaque about the intermediate steps.
      • mayhem
        I think we should work to understand the variances in the current approach and get to something that we condering "best we can do with this alg and we understand it roughly".
      • lucifer
        another issue with CF is that all of it is highly mathematical, whereas this is mostly SQL so still understandable :)
      • +1
      • mayhem
        then we should implement the CF based version and compare them side by side.
      • OH!
      • co-learning. that was the word, no?
      • lucifer
        transfer learning?
      • mayhem
        we can actually see about where the algs agree and combined the results.
      • that is it!
      • combine.
      • lucifer
        yup makes sense. a lot of this is going to experiments and seeing what works out the best.
      • kind of how picard's magic similarity weights for finding track matches work lol.
      • mayhem
        but, in the end the two algs have different fundamental statements.
      • I wonder if that code has changed much since I wrote that, lol.
      • lucifer
        CF ?
      • mayhem
        so, CF = if you've ever listened to a track,use how many times as input.
      • in ours the tracks have to be listened by the same person, but in proximity.
      • monkey has quit
      • monkey joined the channel
      • lucifer
        there was actually an intro paper on CF i read, TV show recommender. in that they pruned the input dataset based on the time order before feeding it to CF model.
      • mayhem makes a note about genetic algorithms and playlist generation
      • lots of cool stuff still left to try! :D
      • mayhem
        yes.
      • and remember, our current goal is "not bad!".
      • lucifer
        indeed.
      • i am going to bed currently but will work on building the small dataset tomorrow. nn!
      • mayhem
        and once we have a full chain of everything, then we can work to improve. and hopefully will get help
      • lucifer
        +1
      • mayhem
        nn, I should get off the computer too.
      • sleep well!
      • lucifer
        you too
      • Mineo has quit