#metabrainz

/

      • agatzk has quit
      • agatzk joined the channel
      • chinmay joined the channel
      • hackerman2 joined the channel
      • hackerman has quit
      • hackerman2 is now known as hackerman
      • yellowhatpro
        yussir (ง •_•)ง
      • chinmay has quit
      • chinmay joined the channel
      • ROpdebee has quit
      • revi has quit
      • ijc has quit
      • KassOtsimine has quit
      • BenOckmore has quit
      • lucifer has quit
      • pprkut has quit
      • mruszczyk has quit
      • mayhem has quit
      • ijc joined the channel
      • pprkut joined the channel
      • BenOckmore joined the channel
      • mayhem joined the channel
      • lucifer joined the channel
      • KassOtsimine joined the channel
      • mruszczyk joined the channel
      • ROpdebee joined the channel
      • KassOtsimine has quit
      • KassOtsimine joined the channel
      • revi joined the channel
      • revi has quit
      • revi joined the channel
      • param has quit
      • FichteFoll has quit
      • param joined the channel
      • FichteFoll joined the channel
      • texke has quit
      • texke joined the channel
      • Shubh has quit
      • Shubh joined the channel
      • nbin has quit
      • MRiddickW has quit
      • MRiddickW joined the channel
      • nbin joined the channel
      • Etua joined the channel
      • mayhem
        monkey: so, the good news is that the mb metadata cache now fetches release tags.
      • the bad news is that no one cares about release tags. we should care about release group tags. 🤦‍♂️
      • lucifer: last night when I was going to bed I thought about a problem we have in LB.
      • namely writing loads of stats and other data into PG each day.
      • we're pushing PG quite hard to insert all this data and we only have a few users now.
      • this will become a bottleneck for us soon.
      • BrainzGit
        [sir] 14ta264 opened pull request #132 (03master…v3): V3 https://github.com/metabrainz/sir/pull/132
      • [sir] 14ta264 closed pull request #132 (03master…v3): V3 https://github.com/metabrainz/sir/pull/132
      • mayhem
        and I think we're using PG incorrectly here -- I think for the most part we never query the data in any way except to serve the data.
      • I wonder if we should write one very large file with all of the results in JSON or compressed JSON, grouped by user and then have a very small python module resolve a user name to a chunk on disk. then have nginx serve the data from a memory mapped file.
      • Etua has quit
      • alastairp
        morning
      • mayhem
        moin
      • alastairp
        that sounds similar to the issues we ended up having with AB
      • mayhem
        yes, same pattern.
      • alastairp
        if it's just a block of data that you store somewhere and serve directly back to people, static json sounds good
      • mayhem
        I am not 100% sure if we never query the data for anything, lucifer would know better.
      • alastairp
        right
      • mayhem
        but if we don't we should move to that system.
      • because it seems that every week we want to store more data.
      • and if we have a sudden explosion of users, bam, we're stuck
      • lucifer
        i agree with the PG issue. for fixing it though, i'd prefer looking into a ready made JSON data store.
      • we only query for calculating the artist map stats.
      • mayhem
        I think investigating JSON data store is a pretty good idea.
      • but I have this feeling that if we're going to replace 100% of each group of data each day, a data store is not a good way to go. it will always have a cost greater than simply writing a file to disk.
      • if we were replacing only a subset of the data each day, then absolutely yes a json data store.
      • lucifer
        i think that is what is stuff mongo do anyway. store it as json files and internally keep an index of which file is which record etc.
      • mayhem shudders when hearing mongo.
      • not proposing we use mongo :)
      • mayhem
        does mongo still lose data?
      • ok, good. :) phew.
      • but if the datastore is set up for that use case, then great.
      • lucifer
        my understanding is that document dbs/json data stores work alike the small python module you described. we can try some and if works then fine else we can try to write our own.
      • mayhem
        ok, sounds good.
      • lucifer
        (another option can be to create stats incrementally which might be possible but needs further exploration spark side)
      • mayhem
        well, that brings about incremental updates of large amounts of data. Not a good pattern for big data.
      • I think we will get more done in a total recalculation and replacement daily.
      • lucifer
        i see, makes sense.
      • alastairp
        and just throwing some ideas around - are these stats already stored in spark/hdfs somehow?
      • Etua joined the channel
      • lucifer
        they aren't currently but could be and an api could be exposed on top of spark. fwiw, there exists tools like Apache Pinot, for example, which are capable of reading off spark and exposing realtime user visible analytics.
      • i haven't explored those yet but the premise is really interesting and might be useful to look into such tools in future.
      • so that you could expose an internal api from spark to serve stats and have LB backend call it and forward the response.
      • Etua has quit
      • mayhem
        I was wondering about that, but I kinda dismissed it since the column major representation would have to get converted to row major and that sounds like a royal pain.
      • lucifer
        as long as we don't have to do it manually, it'd be fine i think
      • but spark doesn't mandate column major or row major, you could just store it as json files directly.
      • mayhem
        oh, very good. then that might be a very good solution.
      • atj
        you replace the entire dataset every day?
      • BrainzGit
        [bookbrainz-site] 14tr1ten opened pull request #844 (03master…achievement-progress): Feat(Achievement): Show achievement progress https://github.com/metabrainz/bookbrainz-site/p...
      • mayhem
        atj: yes
      • atj
        does the update have to be atomic?
      • mayhem
        it is easier to replace everything once a day than to work out the differences between days.
      • not in this case, no.
      • atj
        is the data indexed by a key, like a username or id?
      • mayhem
        there is no downside to having one user be served day old stats and another user be served fresh stats for a short window of time.
      • username.
      • lucifer
        the table is like (username, stat_type, stat_range, last_update, stat_json)
      • mayhem
        but there are/will be other tables as well, organized differently, but always with a primary key, likely to be user.
      • atj
        so I'm lacking a lot of context here so this might be totally wrong/unsuitable/stupid, but what immediately came to mind was a prefix based directory hierarchy, based on the hash of an ID
      • mayhem
        atj: that is exactly the direction we were going for, or at least me. so you're pretty much spot on.
      • except that filesystems don't like large numbers of files.
      • atj
        well, they don't like large numbers of files in a single directory
      • mayhem
        so my idea was to write to a single file and have user index: user, start offset, chunk size.
      • atj
        which is I think partly why this pattern came about
      • mayhem
        true, but even large numbers of files distributed into a well organized filesystem tree is always going to incur more overhead than writing one large massive file.
      • there is massive overhead in creating dirs, open and closing files.
      • all that can be skipped if we just open a file one, write it, then close it.
      • but we may not need to do that -- we may be able to write the results in HDFS on our cluster and serve them from there.
      • lucifer
        i don't know about the overheads but this one file could be rather huge, think 30-40G.
      • atj
        I agree with the overheads of managing lots of files, but managing huge files has it's own issues
      • *its
      • mayhem
        lucifer: oh, that would actually make the cluster part of our infrastructure that must be up at all times. perhaps not such a good idea afterall.
      • and I am not suggesting that we write ALL stats into one file.
      • but each class into one file, so that we have 10-15 large files.
      • last month stats one file, last year stats another, all time yet another
      • atj
        also, a directory structure makes sharding pretty trivial, if it comes to that
      • mayhem
        good point.
      • atj
        are you ever likely to want to open that 30G JSON file in vim to check something? :)
      • mayhem
        no, but I am also not suggesting 30G JSON files. perhaps a GB or so max.
      • atj
        ah, ok
      • anyway, just some thoughts that came to mind after reading your discussion
      • mayhem
        thanks for your input!
      • Etua joined the channel
      • atj
        mmap makes me nervous, but I think that's just because I've never properly understood how it works in practice :)
      • mayhem thanks his cranky com sci prof for making him implement a virtual memory scheme in a project for computer architectures 2
      • mayhem
        *comp
      • monkey
        mayhem:
      • Woops. Good and bad about the tags
      • mayhem
        software architorture, more like.
      • monkey
        We did talk about that issue: assembling tags from releases and rgs
      • atj
        heh
      • mayhem
        monkey: I'll fix up another PR this afternoon, I hope.
      • assembling them is pretty easy. we just need to figure out how to make the UI clear that the user is tagging a release group and not the release.
      • I'll need to send along the release group id as well.
      • lucifer
        mayhem: yeah indeed, i am not suggesting to do that right now. HDFS is slower than the filesystem probably and yes uptime is another issue. in the future, maybe when stuff is more stale and we have a tool at hand to make this stuff go faster.
      • Etua has quit
      • mayhem
        lucifer: agreed. lets start with looking at JSON doc stores.
      • hehe, I am suggesting this, knowing for the first time that, "aw fuck it, just use PG as always" is not an acceptable outcome
      • texke has quit
      • lucifer
        hehe lol :D
      • i am trying to find json data stores and half of my results are storing json it in PG lol
      • mayhem
        pretty soon the entire world will be stored in PG.
      • lucifer
        🤞
      • mayhem
        I'm just so glad that people are finally shutting up about mysql.
      • and sqlite has many more installs that mysql. heh.