#metabrainz

/

      • mayhem
        yes, we need to address this, but I am not concerned that this is the only problem.
      • 2022-02-09 04005, 2022

      • lucifer
        the loop facilitates this. this query will be executed once for consecutive listens, then listen and listen + 2 so on.
      • 2022-02-09 04033, 2022

      • lucifer
        so at a time all listens that reach the final group by in the same distance.
      • 2022-02-09 04019, 2022

      • mayhem needs minute visualize this
      • 2022-02-09 04041, 2022

      • lucifer
      • 2022-02-09 04042, 2022

      • mayhem
        hmm, that paste isn't helping me.
      • 2022-02-09 04004, 2022

      • mayhem
        the count(*) really confuses me here.
      • 2022-02-09 04012, 2022

      • mayhem
        because count(*) counts the total number of rows output by symmetric_index, no?
      • 2022-02-09 04030, 2022

      • lucifer
        its grouped by both mbids
      • 2022-02-09 04043, 2022

      • lucifer
        L58
      • 2022-02-09 04047, 2022

      • mayhem
        yes.
      • 2022-02-09 04008, 2022

      • mayhem
        that means that count(*) refers to the size of the current grouping?
      • 2022-02-09 04011, 2022

      • lucifer
        yes
      • 2022-02-09 04018, 2022

      • mayhem
        🤯
      • 2022-02-09 04047, 2022

      • mayhem
        well, ok. that might explain a few difficulties I've had with other queries. lol
      • 2022-02-09 04051, 2022

      • lucifer
        it counts the number of times (mbid1, mbid2) occured for the current step,
      • 2022-02-09 04059, 2022

      • mayhem nods
      • 2022-02-09 04025, 2022

      • lucifer
        since the distance is constant for a given step size, i multiply it with the weight
      • 2022-02-09 04015, 2022

      • mayhem
        I'm still not seeing how this give the same result as the python code.
      • 2022-02-09 04032, 2022

      • lucifer
        right so this completes 1 step.
      • 2022-02-09 04057, 2022

      • mayhem
        yes, idx = 1
      • 2022-02-09 04026, 2022

      • lucifer
        so now the process is repeated for other idx, and those have continously decreasing weights.
      • 2022-02-09 04000, 2022

      • lucifer
        L60 appends all of these together.
      • 2022-02-09 04034, 2022

      • lucifer
        since mbid pairs can occur again in different step sizes, we need a final group by.
      • 2022-02-09 04007, 2022

      • lucifer
        at L71, that happens and the HAVING clause removes the entires below threshold.
      • 2022-02-09 04015, 2022

      • mayhem
        ok, I follow it now.
      • 2022-02-09 04053, 2022

      • lucifer
        the COUNT(*) thing wasn't there in the original version. i added it later to avoid OOM when running on entire dataset.
      • 2022-02-09 04030, 2022

      • lucifer
        the original version just assigned the weight to each row based on the current idx and then everything was grouped in the final query.
      • 2022-02-09 04013, 2022

      • lucifer
        but all the recording similarity results we saw the other day had the COUNT(*) thing so I don't think it is the culprit.
      • 2022-02-09 04015, 2022

      • mayhem
        ok, I'm quite impressed with this work, but I fail to see why this should not produce the same results.
      • 2022-02-09 04042, 2022

      • mayhem
        Inserted 795,000 rows.
      • 2022-02-09 04057, 2022

      • mayhem
        Generated 13,037,431 pairs (before pruning)
      • 2022-02-09 04002, 2022

      • mayhem
        ok, lets see.
      • 2022-02-09 04019, 2022

      • mayhem
        I feel that with user_id rather than user_name, the python results have gotten worse.
      • 2022-02-09 04037, 2022

      • mayhem
        sepultura 51
      • 2022-02-09 04038, 2022

      • mayhem
        err, no.
      • 2022-02-09 04033, 2022

      • lucifer
        those look similar to me but yes that is the different between spark and gaga so i am tempted to point a finger at it.
      • 2022-02-09 04037, 2022

      • mayhem
        wow. I'm stunned that my data went from ok to poro.
      • 2022-02-09 04038, 2022

      • mayhem
        poor.
      • 2022-02-09 04050, 2022

      • mayhem
        while making the user_name -> user_id change.
      • 2022-02-09 04055, 2022

      • lucifer
        oh an example mbid?
      • 2022-02-09 04057, 2022

      • mayhem
        should should change nothing at all!
      • 2022-02-09 04005, 2022

      • mayhem
      • 2022-02-09 04014, 2022

      • mayhem
        the usual. it got worse.
      • 2022-02-09 04044, 2022

      • mayhem
        let me recalculate the user_name version again to make sure I am not smoking something...
      • 2022-02-09 04051, 2022

      • mayhem
        well, poor expression.
      • 2022-02-09 04010, 2022

      • lucifer
        i see. the top ones remained same but lower changed?
      • 2022-02-09 04031, 2022

      • lucifer
        "user_name -> user_id change" should change nothing but ...
      • 2022-02-09 04047, 2022

      • mayhem
        not sure.
      • 2022-02-09 04000, 2022

      • mayhem
        ah, there is also the `if mbid_0 == mbid_1:`
      • 2022-02-09 04038, 2022

      • lucifer
        that should only exclude same mbids but yeah possible.
      • 2022-02-09 04057, 2022

      • lucifer
        i don't see it on the github branch though? have you pushed the changes?
      • 2022-02-09 04013, 2022

      • mayhem
        no.
      • 2022-02-09 04048, 2022

      • mayhem
        I just stashed them, since I am rerunning the user_id, no equal check code and outputting the results to a different table so we can compare.
      • 2022-02-09 04054, 2022

      • mayhem
        gonna be a moment.
      • 2022-02-09 04059, 2022

      • lucifer
        👍
      • 2022-02-09 04013, 2022

      • mayhem
        I think I like your idea of making a tiny data set and seeing exactly what comes out of both.
      • 2022-02-09 04040, 2022

      • lucifer
        yeah it'll also help figure out flaws in the thought process.
      • 2022-02-09 04019, 2022

      • mayhem
        your algorithm seems pretty good to me. nice to discover in a real example on how to think in spark.
      • 2022-02-09 04059, 2022

      • lucifer
        :D
      • 2022-02-09 04011, 2022

      • mayhem
        I'm also falling in love with CTEs (WITH statments)
      • 2022-02-09 04025, 2022

      • lucifer
        oh yeah those are great!
      • 2022-02-09 04052, 2022

      • mayhem
        heh, I caught myself saying the other day: ... and insert that into whatever data store we choose...
      • 2022-02-09 04059, 2022

      • mayhem
        what a stupid thing to say.
      • 2022-02-09 04006, 2022

      • mayhem
        we always choose postgres.
      • 2022-02-09 04007, 2022

      • mayhem
        always.
      • 2022-02-09 04009, 2022

      • lucifer
        but wait till i open the timestamp PR and introduce you to transactions mess!!
      • 2022-02-09 04039, 2022

      • mayhem
        oy
      • 2022-02-09 04049, 2022

      • lucifer
      • 2022-02-09 04034, 2022

      • mayhem
        no way. its turning into nightmare over this tiny detail???
      • 2022-02-09 04030, 2022

      • lucifer
        incomplete branch yet but basically in this particular scenario, listen_delete_metadata table holds the rows to be deleted from listen. We fetch the rows from it and delete stuff from listen. after that we clear up the rows in the listen_delete_metadata table.
      • 2022-02-09 04045, 2022

      • lucifer
        however between the time rows are deleted from listen table and the time we get to clear up the listen_delete_metadata table, new rows may have been inserted in the latter. so we would have deleted rows from metadata table without deleting the actual listens.
      • 2022-02-09 04054, 2022

      • mayhem
        but with transactional isolation that shouldn't be a big deal, or?
      • 2022-02-09 04010, 2022

      • lucifer
        the default transaction isolation is READ_COMMITTED which we use
      • 2022-02-09 04056, 2022

      • lucifer
        so T1 this bulk delete in process, meanwhile T2 that insert comes and inserts a row. commits. its now visible to remaining statements in T1.
      • 2022-02-09 04023, 2022

      • lucifer
      • 2022-02-09 04037, 2022

      • mayhem
        fun, learned a new fact for the day.
      • 2022-02-09 04040, 2022

      • mayhem
        better go to bed.
      • 2022-02-09 04043, 2022

      • mayhem
        j/k
      • 2022-02-09 04046, 2022

      • lucifer
        the first line here "The Repeatable Read isolation level only sees data committed before the transaction began; it never sees either uncommitted data or changes committed during transaction execution by concurrent transactions."
      • 2022-02-09 04031, 2022

      • lucifer
        which is what we want here.
      • 2022-02-09 04050, 2022

      • mayhem
        and that probably locks the table or so?
      • 2022-02-09 04056, 2022

      • mayhem
        or why are we not using that?
      • 2022-02-09 04014, 2022

      • mayhem
        > Applications using this level must be prepared to retry transactions due to serialization failures.
      • 2022-02-09 04015, 2022

      • lucifer
        it comes at a performance cost. i am not sure how much.
      • 2022-02-09 04016, 2022

      • mayhem
        also shitty.
      • 2022-02-09 04020, 2022

      • lucifer
        yeah
      • 2022-02-09 04027, 2022

      • mayhem
        ok, lets brainstorm ways to overcome that.
      • 2022-02-09 04038, 2022

      • lucifer
        i added the created column for that reason
      • 2022-02-09 04003, 2022

      • lucifer
      • 2022-02-09 04040, 2022

      • lucifer
        record the time before starting transaction and at every step delete only rows before it
      • 2022-02-09 04041, 2022

      • mayhem
        I've gone for a serial field myself.
      • 2022-02-09 04001, 2022

      • mayhem
        I'd ...
      • 2022-02-09 04010, 2022

      • lucifer
        oh yeah that should work too.
      • 2022-02-09 04026, 2022

      • mayhem
        faster since the logic is guaranteed 1 cpu cycle
      • 2022-02-09 04007, 2022

      • lucifer
        yup makes sense
      • 2022-02-09 04043, 2022

      • mayhem
        yeah, ok reading those statements that is really not too horrid. unpleasant, but not horrid.
      • 2022-02-09 04008, 2022

      • lucifer
        i'll add comments to make it easier.
      • 2022-02-09 04048, 2022

      • lucifer
        another way could have been fetch everything to python and then delete rows one by one or in batches manually. but i think that'd be slower
      • 2022-02-09 04008, 2022

      • lucifer
        it at least adds a round trip of entire listen_delete table data.
      • 2022-02-09 04009, 2022

      • mayhem
        agreed.
      • 2022-02-09 04018, 2022

      • mayhem
        CTEs really will reduce the amount of python I write.
      • 2022-02-09 04026, 2022

      • mayhem
        but..
      • 2022-02-09 04028, 2022

      • mayhem
        I'm very curious to get to the bottom of the similarities stuff -- learn how it really behaves.
      • 2022-02-09 04051, 2022

      • lucifer
        oh indeed.
      • 2022-02-09 04004, 2022

      • mayhem
        we need to learn the differences between each of the python versions and then the difference between spark and python.
      • 2022-02-09 04015, 2022

      • mayhem
        I get a feeling that we're overtraining the data.
      • 2022-02-09 04026, 2022

      • mayhem
        too much data makes it noise.
      • 2022-02-09 04031, 2022

      • lucifer
        yup makes sense
      • 2022-02-09 04044, 2022

      • mayhem
        it would be nice if we could find another constraint to add in the the mix.
      • 2022-02-09 04042, 2022

      • mayhem
        I want to involve the similar users data somehow. but nothing has come from it.
      • 2022-02-09 04046, 2022

      • lucifer
        hmm, unsure of any others we could add currently but i am wondering about removing 1. these current datasets are have an implicit condition that the listens should be listened in order and within a time frame.
      • 2022-02-09 04036, 2022

      • lucifer
        whereas CF is you listened to this in any order. so how does this correlate to the CF stuff.
      • 2022-02-09 04000, 2022

      • mayhem
        interesting approach.
      • 2022-02-09 04018, 2022

      • mayhem
        the real problem is that we dont have both approaches to examine.
      • 2022-02-09 04027, 2022

      • lucifer
        indeed
      • 2022-02-09 04028, 2022

      • mayhem
        resultant data sets from both.. to examine.
      • 2022-02-09 04055, 2022

      • lucifer
        both provide different types of data and CF is really opaque about the intermediate steps.
      • 2022-02-09 04037, 2022

      • mayhem
        I think we should work to understand the variances in the current approach and get to something that we condering "best we can do with this alg and we understand it roughly".
      • 2022-02-09 04044, 2022

      • lucifer
        another issue with CF is that all of it is highly mathematical, whereas this is mostly SQL so still understandable :)
      • 2022-02-09 04053, 2022

      • lucifer
        +1
      • 2022-02-09 04057, 2022

      • mayhem
        then we should implement the CF based version and compare them side by side.
      • 2022-02-09 04011, 2022

      • mayhem
        OH!
      • 2022-02-09 04022, 2022

      • mayhem
        co-learning. that was the word, no?
      • 2022-02-09 04035, 2022

      • lucifer
        transfer learning?
      • 2022-02-09 04040, 2022

      • mayhem
        we can actually see about where the algs agree and combined the results.
      • 2022-02-09 04043, 2022

      • mayhem
        that is it!
      • 2022-02-09 04052, 2022

      • mayhem
        combine.
      • 2022-02-09 04019, 2022

      • lucifer
        yup makes sense. a lot of this is going to experiments and seeing what works out the best.
      • 2022-02-09 04002, 2022

      • lucifer
        kind of how picard's magic similarity weights for finding track matches work lol.
      • 2022-02-09 04018, 2022

      • mayhem
        but, in the end the two algs have different fundamental statements.
      • 2022-02-09 04030, 2022

      • mayhem
        I wonder if that code has changed much since I wrote that, lol.
      • 2022-02-09 04038, 2022

      • lucifer
        CF ?
      • 2022-02-09 04000, 2022

      • mayhem
        so, CF = if you've ever listened to a track,use how many times as input.
      • 2022-02-09 04016, 2022

      • mayhem
        in ours the tracks have to be listened by the same person, but in proximity.
      • 2022-02-09 04040, 2022

      • monkey has quit
      • 2022-02-09 04053, 2022

      • monkey joined the channel
      • 2022-02-09 04034, 2022

      • lucifer
        there was actually an intro paper on CF i read, TV show recommender. in that they pruned the input dataset based on the time order before feeding it to CF model.
      • 2022-02-09 04054, 2022

      • mayhem makes a note about genetic algorithms and playlist generation
      • 2022-02-09 04044, 2022

      • lucifer
        lots of cool stuff still left to try! :D
      • 2022-02-09 04000, 2022

      • mayhem
        yes.
      • 2022-02-09 04013, 2022

      • mayhem
        and remember, our current goal is "not bad!".
      • 2022-02-09 04019, 2022

      • lucifer
        indeed.
      • 2022-02-09 04034, 2022

      • lucifer
        i am going to bed currently but will work on building the small dataset tomorrow. nn!
      • 2022-02-09 04039, 2022

      • mayhem
        and once we have a full chain of everything, then we can work to improve. and hopefully will get help
      • 2022-02-09 04048, 2022

      • lucifer
        +1
      • 2022-02-09 04051, 2022

      • mayhem
        nn, I should get off the computer too.
      • 2022-02-09 04055, 2022

      • mayhem
        sleep well!
      • 2022-02-09 04002, 2022

      • lucifer
        you too
      • 2022-02-09 04037, 2022

      • Mineo has quit