#metabrainz

/

      • ruaok
      • 2017-04-19 10950, 2017

      • ruaok
        the influx schema bits are here:
      • 2017-04-19 10951, 2017

      • ruaok
      • 2017-04-19 10907, 2017

      • ruaok
        any feedback would be appreciated.
      • 2017-04-19 10925, 2017

      • ruaok
        I used to have only one measurement called "listen" which turned out to be a bad idea.
      • 2017-04-19 10944, 2017

      • ruaok
        each user has their own measurement now.
      • 2017-04-19 10922, 2017

      • zas
        why it was a bad idea?
      • 2017-04-19 10943, 2017

      • ruaok
        too many points in one measurement and the load on lemmy shot to 100.
      • 2017-04-19 10958, 2017

      • ruaok
        spread it across many more measurements and influx is much happier.
      • 2017-04-19 10905, 2017

      • zas
        yup, makes sense
      • 2017-04-19 10937, 2017

      • ruaok
        that was hard to see when first using influx. like other non relational data stores, it takes a while to change your thinking.
      • 2017-04-19 10959, 2017

      • zas
        well, in theory, using user_name as tag is ok, using separate measurements for each user is obviously more complex when it comes to sum up listens
      • 2017-04-19 10937, 2017

      • ruaok
        yeah, but summing up listens is old skool thinking. adding a new measurement to keep track of them is key.
      • 2017-04-19 10951, 2017

      • ruaok
        write more, don't update/delete
      • 2017-04-19 10956, 2017

      • zas
        yes, much faster approach
      • 2017-04-19 10927, 2017

      • bochecha has quit
      • 2017-04-19 10924, 2017

      • zag has quit
      • 2017-04-19 10918, 2017

      • d4rkie has quit
      • 2017-04-19 10945, 2017

      • D4RK-PH0ENiX joined the channel
      • 2017-04-19 10915, 2017

      • samj1912 joined the channel
      • 2017-04-19 10944, 2017

      • ruaok wonders if logging into jira is something he could track as a fitness activity in Google Fit.
      • 2017-04-19 10912, 2017

      • agentsim joined the channel
      • 2017-04-19 10949, 2017

      • agentsim has quit
      • 2017-04-19 10927, 2017

      • SothoTalKer joined the channel
      • 2017-04-19 10902, 2017

      • Freso
        Did anyone ever hear the complaint that "MB wants too much personal data during registration" before?
      • 2017-04-19 10913, 2017

      • Quesito hopes ruaok knows how unreliable the data is from wearables....
      • 2017-04-19 10920, 2017

      • SothoTalker_ has quit
      • 2017-04-19 10954, 2017

      • Quesito
        Freso: that has to be a cop out for I'm lazy....
      • 2017-04-19 10955, 2017

      • ruaok
        Quesito: yep, it is a suggestion at best.
      • 2017-04-19 10911, 2017

      • Quesito
        ;)
      • 2017-04-19 10925, 2017

      • SothoTalker_ joined the channel
      • 2017-04-19 10942, 2017

      • ruaok
        yeah, pure BS, Freso.
      • 2017-04-19 10917, 2017

      • SothoTalKer has quit
      • 2017-04-19 10947, 2017

      • alastairp
        Freso: kind of interesting that those releases are in mb at all!
      • 2017-04-19 10938, 2017

      • Freso
        alastairp: :)
      • 2017-04-19 10947, 2017

      • reosarevok
        Freso: I guess some people might somehow think that all the stuff in the profile is mandatory?
      • 2017-04-19 10957, 2017

      • reosarevok
        I know a bunch of artists are confused about all the fields in the Add Artist page for example because they assume every single one is mandatory and they don't know what IPI is or something
      • 2017-04-19 10921, 2017

      • Freso
        Yeah.
      • 2017-04-19 10927, 2017

      • reosarevok
        (to be fair, "bolded label means mandatory" is not obvious nor explained)
      • 2017-04-19 10948, 2017

      • Freso
        Will IPIs and ISNIs be moved to attributes with the schema change? Or will that have to be done later down the line?
      • 2017-04-19 10932, 2017

      • ruaok
        crap. the metabrainz press corps realized that chocolate should be present in the office at all times.
      • 2017-04-19 10942, 2017

      • ruaok
        anyone need chocolate?? come on down!
      • 2017-04-19 10932, 2017

      • naught101 joined the channel
      • 2017-04-19 10954, 2017

      • Rotab
        y...yes
      • 2017-04-19 10931, 2017

      • alastairp
        ruaok: LB-154, interesting
      • 2017-04-19 10931, 2017

      • BrainzBot
        LB-154: Use sets rather than dicts to unique timestamps https://tickets.metabrainz.org/browse/LB-154
      • 2017-04-19 10946, 2017

      • alastairp
        I thought that dict keys hash exactly the same as sets?
      • 2017-04-19 10913, 2017

      • ruaok
        functionally they are the same, but I agree with zas that sets are faster for this.
      • 2017-04-19 10922, 2017

      • alastairp
        If not, I've learned something
      • 2017-04-19 10928, 2017

      • ruaok
        from what I know.
      • 2017-04-19 10944, 2017

      • ruaok
        easy to find out. let me file this paper work and hack up a test.
      • 2017-04-19 10911, 2017

      • ruaok makes a calendar entry for 2019
      • 2017-04-19 10903, 2017

      • ruaok
      • 2017-04-19 10907, 2017

      • ruaok
        not at all what I expected.
      • 2017-04-19 10912, 2017

      • ruaok
        did I do that right?
      • 2017-04-19 10919, 2017

      • ruaok
        1.3257920742
      • 2017-04-19 10919, 2017

      • ruaok
        3.98488116264
      • 2017-04-19 10952, 2017

      • ruaok
        on 10M rows, now with verifying that the same results are generated.
      • 2017-04-19 10945, 2017

      • colbydray joined the channel
      • 2017-04-19 10949, 2017

      • alastairp
      • 2017-04-19 10933, 2017

      • Gentlecat
      • 2017-04-19 10909, 2017

      • alastairp
        sets.Set is very deprecated. I'm not sure if it has the same behavior as 'set'
      • 2017-04-19 10913, 2017

      • alastairp
        Interesting, I don't see a entry for set.add
      • 2017-04-19 10922, 2017

      • alastairp
        But there is for dict insert
      • 2017-04-19 10911, 2017

      • Gentlecat
      • 2017-04-19 10913, 2017

      • ruaok
        really, measuring this differently isn't going to change anything for a simple case like this.
      • 2017-04-19 10921, 2017

      • Gentlecat
        it's constant
      • 2017-04-19 10921, 2017

      • alastairp
        Yeah, I thought that was the case
      • 2017-04-19 10947, 2017

      • alastairp
        I wonder what the speed of appending to a list then calling set(list) is
      • 2017-04-19 10954, 2017

      • Gentlecat
      • 2017-04-19 10930, 2017

      • ruaok
        ok, now using set (not Set):
      • 2017-04-19 10935, 2017

      • ruaok
        1.37781596184
      • 2017-04-19 10935, 2017

      • ruaok
        1.85512280464
      • 2017-04-19 10940, 2017

      • Gentlecat
        however much time it takes to go through a list, I imagine
      • 2017-04-19 10944, 2017

      • ruaok
        sets are still slower.
      • 2017-04-19 10910, 2017

      • ruaok
        well, both do it the same way, so that doesn't factor into determining the difference between the two.
      • 2017-04-19 10925, 2017

      • alastairp
        Wow, that's really interesting
      • 2017-04-19 10939, 2017

      • alastairp
        That the set type is so much faster than the class
      • 2017-04-19 10959, 2017

      • alastairp
        Which version of python?
      • 2017-04-19 10910, 2017

      • ruaok
      • 2017-04-19 10915, 2017

      • ruaok
        hipster classic, of course.
      • 2017-04-19 10917, 2017

      • ruaok
        2.2!
      • 2017-04-19 10918, 2017

      • D4RK-PH0ENiX has quit
      • 2017-04-19 10946, 2017

      • alastairp
        This is where a nerd will decompile those two statements and see exactly how many operations each is doing :)
      • 2017-04-19 10913, 2017

      • ruaok
        thought about it. didn't care enough.
      • 2017-04-19 10919, 2017

      • ruaok
        checking 2 vs 3 though
      • 2017-04-19 10943, 2017

      • alastairp
        Cool
      • 2017-04-19 10901, 2017

      • ruaok
        O_O
      • 2017-04-19 10943, 2017

      • ruaok
      • 2017-04-19 10946, 2017

      • ruaok
        clear as mud. :)
      • 2017-04-19 10949, 2017

      • alastairp
        How many items are we considering?
      • 2017-04-19 10955, 2017

      • alastairp
        In lb
      • 2017-04-19 10900, 2017

      • ruaok
        dicts got faster, sets got slower.
      • 2017-04-19 10924, 2017

      • ruaok
        for that piece of code, we're talking about a batch which is limited to the size of a block of listens.
      • 2017-04-19 10926, 2017

      • ruaok
        sub 100.
      • 2017-04-19 10933, 2017

      • alastairp
        Hmm
      • 2017-04-19 10937, 2017

      • ruaok
        ie. not worth talking about for the most part.
      • 2017-04-19 10941, 2017

      • alastairp
        Yeah
      • 2017-04-19 10910, 2017

      • alastairp
        Sounds like premature optimisation... My recommendation would be to optimise for readability not speed
      • 2017-04-19 10923, 2017

      • ruaok
        the bigger problem we have is that we do too much serializing/deserializing the incoming listens. and passing over them too many times to sanity check them.
      • 2017-04-19 10940, 2017

      • alastairp
        Especially considering the two timestamp conversions that you're doing in the same loop
      • 2017-04-19 10945, 2017

      • ruaok
        well, I've done a pile of stress tests to see what is going to be a problem.
      • 2017-04-19 10901, 2017

      • alastairp
        Yeah, we wondered about that when we first did it
      • 2017-04-19 10908, 2017

      • alastairp
        Even with ujson?
      • 2017-04-19 10915, 2017

      • ruaok
        the influx schema was a problem, but is much better now.
      • 2017-04-19 10935, 2017

      • ruaok
        yes, even with ujson. we're just doing too much stuff that needs streamlining.
      • 2017-04-19 10900, 2017

      • ruaok
        to the point where importing listens the sanity checking the incoming data is the major bottleneck now.
      • 2017-04-19 10944, 2017

      • ruaok
        which is why I stuffed tons of fake listens directly into rabbitmq to find problems.
      • 2017-04-19 10902, 2017

      • alastairp
        Right
      • 2017-04-19 10918, 2017

      • ruaok
        I'm now happy that we'll avoid the obvious walls and i have other, less pressing issues on my radar.
      • 2017-04-19 10943, 2017

      • ruaok
        but my focus still remains data integrity, which I'm almost happy with.
      • 2017-04-19 10956, 2017

      • alastairp
        This is things like 'validate_listens'?
      • 2017-04-19 10908, 2017

      • alastairp
        Or is it bigger than that?
      • 2017-04-19 10914, 2017

      • ruaok
        I need to run a test and make sure that all data that enters correctly ends up at BQ.
      • 2017-04-19 10921, 2017

      • ruaok
        yeah, mostly all that.
      • 2017-04-19 10929, 2017

      • alastairp
        I'll see if I can put it through a profiler to see where the bad parts are
      • 2017-04-19 10938, 2017

      • alastairp
        That's not very difficult
      • 2017-04-19 10942, 2017

      • ruaok
        it is too early for a profiler.
      • 2017-04-19 10948, 2017

      • ruaok
        there is just stupid shit going on. :)
      • 2017-04-19 10913, 2017

      • ruaok
        we can find problems by inspection, much like shooting fish in a barrel.
      • 2017-04-19 10926, 2017

      • ruaok
        but, this is easy to fix as opposed to getting the schema wrong.
      • 2017-04-19 10928, 2017

      • alastairp
        Sure. So if you can point out a stupid thing, we can fix it. But the profiler will tell us the same thing without having to wait for you
      • 2017-04-19 10952, 2017

      • alastairp
        We can inspect for loops and thing
      • 2017-04-19 10909, 2017

      • ruaok
        sure, if you want to do more premature optimization, go for it. :) :)
      • 2017-04-19 10945, 2017

      • ruaok
        I think in passing the listens around we do more than one conversion of the data and that is dumb.
      • 2017-04-19 10956, 2017

      • Gentlecat
        profiler actually helps you figure out if something is premature or not
      • 2017-04-19 10922, 2017

      • ruaok
        do premature optimization with a profiler to determine if optimization is premature?
      • 2017-04-19 10925, 2017

      • ruaok
        I like it, very meta. :)
      • 2017-04-19 10933, 2017

      • ruaok
        and recursive.
      • 2017-04-19 10908, 2017

      • ruaok
        in any case, I've got much more signficant issues to tackle for the time being, so I am going to focus on those
      • 2017-04-19 10919, 2017

      • alastairp
        Just had a look at validate. There are not many loops :(
      • 2017-04-19 10920, 2017

      • ruaok
        I know what is slow and that is sufficient for now.
      • 2017-04-19 10940, 2017

      • ruaok
        I think the slowness comes one level up.
      • 2017-04-19 10941, 2017

      • alastairp
        Uuid validation. Perhaps that takes some time
      • 2017-04-19 10945, 2017

      • alastairp
        Ah, ok
      • 2017-04-19 10951, 2017

      • ruaok
        parse for validate, then store.
      • 2017-04-19 10957, 2017

      • D4RK-PH0ENiX joined the channel
      • 2017-04-19 10959, 2017

      • ruaok
        parse/convert again for dedeup
      • 2017-04-19 10916, 2017

      • Gentlecat
        if you don't know for sure what's slow then it's just guesswork
      • 2017-04-19 10917, 2017

      • alastairp
        Right, now I see where you're coming from
      • 2017-04-19 10937, 2017

      • alastairp
        That's definitely the right place to start, then
      • 2017-04-19 10937, 2017

      • Gentlecat
        and there are tools to help with that, so we should use them
      • 2017-04-19 10938, 2017

      • ruaok
        we need to review the code flow, not individual statements.
      • 2017-04-19 10911, 2017

      • Gentlecat
        http://www.brendangregg.com/flamegraphs.html would actually help with code flow
      • 2017-04-19 10926, 2017

      • Gentlecat
        if I understand what you mean by "code flow" correctly
      • 2017-04-19 10932, 2017

      • D4RK-PH0ENiX has quit
      • 2017-04-19 10939, 2017

      • D4RK-PH0ENiX joined the channel
      • 2017-04-19 10938, 2017

      • zas
        ruaok: making a loop to add to a set() isn't the best idea, when you can just pass the iterable as set() argument: s = set() ; for x in r; s.add(x) vs s = set(r)
      • 2017-04-19 10957, 2017

      • ruaok
        zas: sure, but for this test it doesn't matter.
      • 2017-04-19 10957, 2017

      • zas
        but i don't know if it is actually faster or how much it is