#metabrainz

/

      • bukwurm
        Everything works related to infrastructure. Producers, Consumers work with multi processing and associated async tasks.
      • 2018-06-03 15409, 2018

      • bukwurm
        I have added a detailed readme with the module where ever I could, so that it becomes easy to understand.
      • 2018-06-03 15421, 2018

      • bukwurm
        I have added jsdocs every where possible.
      • 2018-06-03 15442, 2018

      • bukwurm
        Flow check is a future goal.
      • 2018-06-03 15409, 2018

      • bukwurm
        Roughly, this happens:
      • 2018-06-03 15447, 2018

      • bukwurm
        Producer master process kicks off, receives arguments (file names to read and processed).
      • 2018-06-03 15418, 2018

      • bukwurm
        Divides the args into chunks equal to number of workers.
      • 2018-06-03 15431, 2018

      • bukwurm
        Forks a worker
      • 2018-06-03 15459, 2018

      • bukwurm
        Worker initialises connection with rmq (each process must have it's own single connection).
      • 2018-06-03 15441, 2018

      • bukwurm
        *forks configured number of workers
      • 2018-06-03 15455, 2018

      • Dr-Flay joined the channel
      • 2018-06-03 15436, 2018

      • bukwurm
        Each worker runs configured number of instanceFunctions (function which run and process one file)
      • 2018-06-03 15404, 2018

      • bukwurm
        instanceFunctions do all the processing and return a promise containing results
      • 2018-06-03 15450, 2018

      • bukwurm
        Upon completion of processing of all the args, the worker aggregates the results from each instanceFunction, and send the result to the master process. It is then terminated.
      • 2018-06-03 15417, 2018

      • bukwurm
        The master in turn aggregates all the results from each worker process and then is terminated.
      • 2018-06-03 15428, 2018

      • LordSputnik
        OK, and you've finalized the formats for the data stored in the queue?
      • 2018-06-03 15456, 2018

      • bukwurm
        LordSputnik: That's WIP. It's specific to OL records, and I am working on it.
      • 2018-06-03 15415, 2018

      • bukwurm
        The thing is, now if we want to add LoC dumps import.
      • 2018-06-03 15439, 2018

      • bukwurm
        All we need is to write an instance function for a single dump
      • 2018-06-03 15443, 2018

      • bukwurm
        And that's it.
      • 2018-06-03 15405, 2018

      • bukwurm
        Or for any other data source for that matter.
      • 2018-06-03 15427, 2018

      • bukwurm
        The asyncCluster module would automatically make it run parallely.
      • 2018-06-03 15454, 2018

      • LordSputnik
        OK, that's good
      • 2018-06-03 15420, 2018

      • LordSputnik
        I'll probably be able to comment a lot more once I've been through the PRs
      • 2018-06-03 15436, 2018

      • bukwurm
        Something similar happens with the consumer process, just that it runs forever (it's supposed to listen to the queue continuously)
      • 2018-06-03 15448, 2018

      • bukwurm
        It only gets terminated using Ctrl+C
      • 2018-06-03 15427, 2018

      • LordSputnik
        Did you use winston in the end?
      • 2018-06-03 15441, 2018

      • bukwurm
        LordSputnik: Yes I settled on it.
      • 2018-06-03 15450, 2018

      • LordSputnik
        It looks fine to me, so good choice
      • 2018-06-03 15406, 2018

      • LordSputnik
        I will probably migrate the rest of the code to use it
      • 2018-06-03 15414, 2018

      • bukwurm
        Even if we don't use transports, it's still vastly superior to use than others.
      • 2018-06-03 15419, 2018

      • bukwurm
        LordSputnik: Great!
      • 2018-06-03 15443, 2018

      • bukwurm
        That's it for now.
      • 2018-06-03 15449, 2018

      • LordSputnik
        So do the import objects actually get created in the database now?
      • 2018-06-03 15428, 2018

      • bukwurm
        LordSputnik: No. That would require validators to be written, and finalising of the bb-data modules
      • 2018-06-03 15441, 2018

      • bukwurm
        Right now I am just logging the output.
      • 2018-06-03 15428, 2018

      • bukwurm
        I will try to finalise bb-data modules within two days.
      • 2018-06-03 15448, 2018

      • LordSputnik
        OK, so what's the next few steps in the plan?
      • 2018-06-03 15459, 2018

      • LordSputnik
        What will you be working on next week?
      • 2018-06-03 15405, 2018

      • bukwurm
        1. Finalise bb-data modules
      • 2018-06-03 15426, 2018

      • bukwurm
        2. Design data object to be produced by the producers for the OL dumps
      • 2018-06-03 15442, 2018

      • bukwurm
        3. Write validators for the incoming data
      • 2018-06-03 15459, 2018

      • bukwurm
        4. Connect bb-data modules and the consumer process
      • 2018-06-03 15421, 2018

      • bukwurm
        I will ignore the batch processing for now and look to implement one record import plan
      • 2018-06-03 15446, 2018

      • bukwurm
        Maybe next week we can work on optimising it
      • 2018-06-03 15446, 2018

      • LordSputnik
        OK, that sounds like a good set of goals
      • 2018-06-03 15407, 2018

      • LordSputnik
        With that in place you should have everything in the pipeline set up to do a basic import of the OL dumps
      • 2018-06-03 15410, 2018

      • LordSputnik
        Right?
      • 2018-06-03 15418, 2018

      • bukwurm
        LordSputnik: Yeah
      • 2018-06-03 15427, 2018

      • bukwurm
        That would mean full import completion
      • 2018-06-03 15414, 2018

      • LordSputnik
        So with those 4 objectives, maybe we can aim to get the whole set of OL dumps imported or starting to import by next meeting, to test this out?
      • 2018-06-03 15415, 2018

      • bukwurm
        I was also thinking on limiting number of messages passed off to consumers if the number of unacknowledged messages exceeds a limit
      • 2018-06-03 15444, 2018

      • LordSputnik
        Limit the number of messages in the queue at any time?
      • 2018-06-03 15408, 2018

      • bukwurm
        LordSputnik: No, the queue is durable and persistent and can handle large amount of data
      • 2018-06-03 15430, 2018

      • bukwurm
        However, if consumer is connected to the queue
      • 2018-06-03 15443, 2018

      • bukwurm
        It will immediately receive any new message
      • 2018-06-03 15458, 2018

      • bukwurm
        This can potentially lead to buffer overflow
      • 2018-06-03 15424, 2018

      • bukwurm
        If the consumer is slow to deal with older messages and new ones come in real quick
      • 2018-06-03 15427, 2018

      • bukwurm
        One technique provided is limit the number of messages sent by the rmq to the consumer if it hasn't yet acknowledged more than given number of messages sent previously.
      • 2018-06-03 15401, 2018

      • bukwurm
        That way queue will hold the messages till the consumer is ready to deal with them
      • 2018-06-03 15419, 2018

      • bukwurm
        It's a single option in the config, no biggie.
      • 2018-06-03 15435, 2018

      • bukwurm
        > maybe we can aim to get the whole set of OL dumps imported or starting to import by next meeting, to test this out?
      • 2018-06-03 15440, 2018

      • bukwurm
        Definitely!
      • 2018-06-03 15412, 2018

      • LordSputnik
        OK, but there has to be a limit on the queue too, because we'll have limited memory. Is there a fixed amount of memory RMQ will use? What does the producer do if RMQ rejects the message?
      • 2018-06-03 15453, 2018

      • bukwurm
        I have run producers on 500MB data dump split three ways, and it takes roughly 40 seconds on my laptop with a whole lot of stuff running.
      • 2018-06-03 15425, 2018

      • bukwurm
        LordSputnik: The producer will crash if the RMQ connection is refused.
      • 2018-06-03 15408, 2018

      • bukwurm
        I think we can limit the memory used by RMQ, but a better way would be
      • 2018-06-03 15426, 2018

      • bukwurm
        To feed only the amount of dump we are sure it can handle wholly at a time
      • 2018-06-03 15400, 2018

      • bukwurm
        Repeat the entire importing process one by one till we finish the entire import.
      • 2018-06-03 15429, 2018

      • LordSputnik
        OK, it would be good to have that limit to be dynamic, with RMQ and the producer talking to determine whether too much memory is in use
      • 2018-06-03 15439, 2018

      • LordSputnik
        That way we can easily port between servers
      • 2018-06-03 15451, 2018

      • central3 joined the channel
      • 2018-06-03 15455, 2018

      • bukwurm
        I am not sure how much disk memory we have at hand. Any ballpark?
      • 2018-06-03 15422, 2018

      • LordSputnik
        Hard disk space?
      • 2018-06-03 15431, 2018

      • bukwurm
        Yeah
      • 2018-06-03 15441, 2018

      • LordSputnik
        I don't actually know that
      • 2018-06-03 15455, 2018

      • LordSputnik
        ruaok or zas would probably know or be able to find out from the Google dashboard
      • 2018-06-03 15455, 2018

      • bukwurm
        2.5+ G authors, 500+ MB works
      • 2018-06-03 15406, 2018

      • bukwurm
        But 27+ G of editions
      • 2018-06-03 15408, 2018

      • bukwurm
        LordSputnik: I'll see how RMQ can be configured to limit it's memory.
      • 2018-06-03 15441, 2018

      • bukwurm
        Although I am unsure if the producer can be made to 'talk' with the RMQ. I'll look up the documentation.
      • 2018-06-03 15401, 2018

      • bukwurm
        If there's an error, we can catch and log it.
      • 2018-06-03 15417, 2018

      • central3
        are the entity data structures returned by lookup & browse web service requests identical? e.g. do they have the same properties?
      • 2018-06-03 15403, 2018

      • bukwurm
        LordSputnik: Ok, gotta go for dinner now. See you on Tuesday, if it's ok with you? :)
      • 2018-06-03 15433, 2018

      • LordSputnik
        bukwurm: RMQ must return some sort of response to indicate whether the message was accepted, right? We can configure a memory limit, probably, and then pause sending messages based on the response
      • 2018-06-03 15436, 2018

      • LordSputnik
        OK, that's fine
      • 2018-06-03 15440, 2018

      • LordSputnik
        Good meeting :)
      • 2018-06-03 15459, 2018

      • LordSputnik
        I'll start reviewing those PRs. Please could you update the -sql PR on Monday?
      • 2018-06-03 15440, 2018

      • yvanzo
        central3: They both follow the same schema, but search doesn’t support inc parameter used to include additional properties.
      • 2018-06-03 15411, 2018

      • yvanzo
        central3: About your previous question (null != ''), see https://chatlogs.metabrainz.org/brainzbot/metabra…
      • 2018-06-03 15424, 2018

      • central3
        yvanzo: is there any documentation that compares them (across browse & lookup)? i'm going though the process manually as i build rust-lang types for everything, but it is time consuming. especially since the json properties are in random order! :)
      • 2018-06-03 15409, 2018

      • central3
        hmm i wonder if the null vs "" rule intentionally applied throughout all the data. the barcode example makes sense, but for say the disambiguation property, it seems less clear
      • 2018-06-03 15402, 2018

      • bukwurm
        LordSputnik: Sorry, forgot about that PR. Tomorrow definitely.
      • 2018-06-03 15418, 2018

      • yvanzo
      • 2018-06-03 15439, 2018

      • yvanzo
        And for searches, there are examples for each entity type in https://musicbrainz.org/doc/Development/XML_Web_S…
      • 2018-06-03 15423, 2018

      • central3
        yvanzo: yep, this is what i'm going by
      • 2018-06-03 15434, 2018

      • yvanzo
        central3: sorry, I don’t found the exact doc you’re looking for :/
      • 2018-06-03 15416, 2018

      • central3
        i don't think the doc exists, it would not be of much use to most anyway
      • 2018-06-03 15458, 2018

      • central3
        i'm basically trying to identify all of the variations on say the Artist entity structure. e.g. as returned by lookup, browse, or as referenced by an ArtistCredit. there seems to be 3 separate forms for most entities (lookup, browse, and "referenced"/simple)
      • 2018-06-03 15409, 2018

      • central3
        e.g. the recordings prop exists on an Artist in a lookup, but not in browse. it would be kind of nice to have a strict doc of the structures, when going through and modelling them as struct/types
      • 2018-06-03 15441, 2018

      • central3
        but the docs as is are already quite helpful in any case
      • 2018-06-03 15403, 2018

      • yvanzo
        central3: disambiguation cannot be null.
      • 2018-06-03 15403, 2018

      • yvanzo
        WS/2 browse & lookup are definitely missing detailed examples. :(
      • 2018-06-03 15421, 2018

      • yvanzo
        central3: Recordings are only available from lookup since it is a subquery with inc parameter (which is documented), or am I missing smt?
      • 2018-06-03 15406, 2018

      • central3
        yvanzo: yeah that is correct. there is no inc value for recordings when browsing artists
      • 2018-06-03 15414, 2018

      • central3 has quit
      • 2018-06-03 15400, 2018

      • bukwurm has quit
      • 2018-06-03 15432, 2018

      • samj1912
      • 2018-06-03 15453, 2018

      • samj1912
        Might help to figure out the xml ws schema
      • 2018-06-03 15411, 2018

      • rsh7 has quit
      • 2018-06-03 15448, 2018

      • dragonzeron
        when does google summer of code start
      • 2018-06-03 15459, 2018

      • Freso
        dragonzeron: We're in the middle of it.
      • 2018-06-03 15404, 2018

      • TOPIC: MetaBrainz Community and Development channel | MusicBrainz non-development: #musicbrainz | GSoC https://goo.gl/7jsjG2 | Meeting agenda: Reviews
      • 2018-06-03 15422, 2018

      • Freso
      • 2018-06-03 15428, 2018

      • dragonzeron
        ok
      • 2018-06-03 15403, 2018

      • dragonzeron
        cool
      • 2018-06-03 15423, 2018

      • dragonzeron
        Ive been staying up to 4 am to work on music brainz
      • 2018-06-03 15438, 2018

      • reosarevok
        I've done that in the past, but I'd recommend not doing it too often. Sleep is important!
      • 2018-06-03 15448, 2018

      • reosarevok learned that by getting old :p
      • 2018-06-03 15423, 2018

      • SothoTalKer
        you're not even 30
      • 2018-06-03 15449, 2018

      • SothoTalKer
        reosarevok: you're a funny coordinator, very nice (:
      • 2018-06-03 15409, 2018

      • reosarevok
        I am 30!
      • 2018-06-03 15426, 2018

      • Freso
        Speaking of going to sleep…
      • 2018-06-03 15435, 2018

      • yvanzo
        reosarevok: old man
      • 2018-06-03 15420, 2018

      • SothoTalKer
        reosarevok: barely 30.
      • 2018-06-03 15453, 2018

      • SothoTalKer
        Birthday: May 12, 1988
      • 2018-06-03 15453, 2018

      • dragonzeron
        I am 18
      • 2018-06-03 15416, 2018

      • dragonzeron
        I would be going to the club but since I dont have someone to take me yet I am at home being productive
      • 2018-06-03 15426, 2018

      • yvanzo
        GeneralDiscourse: PG10 is not supported for now, but we are definitely going to move to it.
      • 2018-06-03 15435, 2018

      • rsh7 joined the channel
      • 2018-06-03 15433, 2018

      • central3 joined the channel
      • 2018-06-03 15418, 2018

      • Gazooo has quit
      • 2018-06-03 15428, 2018

      • Gazooo joined the channel
      • 2018-06-03 15411, 2018

      • rsh7 has quit
      • 2018-06-03 15448, 2018

      • D4RK-PH0ENiX has quit
      • 2018-06-03 15416, 2018

      • D4RK-PH0ENiX joined the channel
      • 2018-06-03 15418, 2018

      • D4RK-PH0ENiX has quit
      • 2018-06-03 15403, 2018

      • GeneralDiscourse
        yvanzo: I'm going to try and use in tree perl modules writing an ebuild for the deps. Thats solved a few of the errors. The shell rc file was amended and sourced initially, too. I'll keep you posted.
      • 2018-06-03 15411, 2018

      • flormynight joined the channel