Everything works related to infrastructure. Producers, Consumers work with multi processing and associated async tasks.
2018-06-03 15409, 2018
bukwurm
I have added a detailed readme with the module where ever I could, so that it becomes easy to understand.
2018-06-03 15421, 2018
bukwurm
I have added jsdocs every where possible.
2018-06-03 15442, 2018
bukwurm
Flow check is a future goal.
2018-06-03 15409, 2018
bukwurm
Roughly, this happens:
2018-06-03 15447, 2018
bukwurm
Producer master process kicks off, receives arguments (file names to read and processed).
2018-06-03 15418, 2018
bukwurm
Divides the args into chunks equal to number of workers.
2018-06-03 15431, 2018
bukwurm
Forks a worker
2018-06-03 15459, 2018
bukwurm
Worker initialises connection with rmq (each process must have it's own single connection).
2018-06-03 15441, 2018
bukwurm
*forks configured number of workers
2018-06-03 15455, 2018
Dr-Flay joined the channel
2018-06-03 15436, 2018
bukwurm
Each worker runs configured number of instanceFunctions (function which run and process one file)
2018-06-03 15404, 2018
bukwurm
instanceFunctions do all the processing and return a promise containing results
2018-06-03 15450, 2018
bukwurm
Upon completion of processing of all the args, the worker aggregates the results from each instanceFunction, and send the result to the master process. It is then terminated.
2018-06-03 15417, 2018
bukwurm
The master in turn aggregates all the results from each worker process and then is terminated.
2018-06-03 15428, 2018
LordSputnik
OK, and you've finalized the formats for the data stored in the queue?
2018-06-03 15456, 2018
bukwurm
LordSputnik: That's WIP. It's specific to OL records, and I am working on it.
2018-06-03 15415, 2018
bukwurm
The thing is, now if we want to add LoC dumps import.
2018-06-03 15439, 2018
bukwurm
All we need is to write an instance function for a single dump
2018-06-03 15443, 2018
bukwurm
And that's it.
2018-06-03 15405, 2018
bukwurm
Or for any other data source for that matter.
2018-06-03 15427, 2018
bukwurm
The asyncCluster module would automatically make it run parallely.
2018-06-03 15454, 2018
LordSputnik
OK, that's good
2018-06-03 15420, 2018
LordSputnik
I'll probably be able to comment a lot more once I've been through the PRs
2018-06-03 15436, 2018
bukwurm
Something similar happens with the consumer process, just that it runs forever (it's supposed to listen to the queue continuously)
2018-06-03 15448, 2018
bukwurm
It only gets terminated using Ctrl+C
2018-06-03 15427, 2018
LordSputnik
Did you use winston in the end?
2018-06-03 15441, 2018
bukwurm
LordSputnik: Yes I settled on it.
2018-06-03 15450, 2018
LordSputnik
It looks fine to me, so good choice
2018-06-03 15406, 2018
LordSputnik
I will probably migrate the rest of the code to use it
2018-06-03 15414, 2018
bukwurm
Even if we don't use transports, it's still vastly superior to use than others.
2018-06-03 15419, 2018
bukwurm
LordSputnik: Great!
2018-06-03 15443, 2018
bukwurm
That's it for now.
2018-06-03 15449, 2018
LordSputnik
So do the import objects actually get created in the database now?
2018-06-03 15428, 2018
bukwurm
LordSputnik: No. That would require validators to be written, and finalising of the bb-data modules
2018-06-03 15441, 2018
bukwurm
Right now I am just logging the output.
2018-06-03 15428, 2018
bukwurm
I will try to finalise bb-data modules within two days.
2018-06-03 15448, 2018
LordSputnik
OK, so what's the next few steps in the plan?
2018-06-03 15459, 2018
LordSputnik
What will you be working on next week?
2018-06-03 15405, 2018
bukwurm
1. Finalise bb-data modules
2018-06-03 15426, 2018
bukwurm
2. Design data object to be produced by the producers for the OL dumps
2018-06-03 15442, 2018
bukwurm
3. Write validators for the incoming data
2018-06-03 15459, 2018
bukwurm
4. Connect bb-data modules and the consumer process
2018-06-03 15421, 2018
bukwurm
I will ignore the batch processing for now and look to implement one record import plan
2018-06-03 15446, 2018
bukwurm
Maybe next week we can work on optimising it
2018-06-03 15446, 2018
LordSputnik
OK, that sounds like a good set of goals
2018-06-03 15407, 2018
LordSputnik
With that in place you should have everything in the pipeline set up to do a basic import of the OL dumps
2018-06-03 15410, 2018
LordSputnik
Right?
2018-06-03 15418, 2018
bukwurm
LordSputnik: Yeah
2018-06-03 15427, 2018
bukwurm
That would mean full import completion
2018-06-03 15414, 2018
LordSputnik
So with those 4 objectives, maybe we can aim to get the whole set of OL dumps imported or starting to import by next meeting, to test this out?
2018-06-03 15415, 2018
bukwurm
I was also thinking on limiting number of messages passed off to consumers if the number of unacknowledged messages exceeds a limit
2018-06-03 15444, 2018
LordSputnik
Limit the number of messages in the queue at any time?
2018-06-03 15408, 2018
bukwurm
LordSputnik: No, the queue is durable and persistent and can handle large amount of data
2018-06-03 15430, 2018
bukwurm
However, if consumer is connected to the queue
2018-06-03 15443, 2018
bukwurm
It will immediately receive any new message
2018-06-03 15458, 2018
bukwurm
This can potentially lead to buffer overflow
2018-06-03 15424, 2018
bukwurm
If the consumer is slow to deal with older messages and new ones come in real quick
2018-06-03 15427, 2018
bukwurm
One technique provided is limit the number of messages sent by the rmq to the consumer if it hasn't yet acknowledged more than given number of messages sent previously.
2018-06-03 15401, 2018
bukwurm
That way queue will hold the messages till the consumer is ready to deal with them
2018-06-03 15419, 2018
bukwurm
It's a single option in the config, no biggie.
2018-06-03 15435, 2018
bukwurm
> maybe we can aim to get the whole set of OL dumps imported or starting to import by next meeting, to test this out?
2018-06-03 15440, 2018
bukwurm
Definitely!
2018-06-03 15412, 2018
LordSputnik
OK, but there has to be a limit on the queue too, because we'll have limited memory. Is there a fixed amount of memory RMQ will use? What does the producer do if RMQ rejects the message?
2018-06-03 15453, 2018
bukwurm
I have run producers on 500MB data dump split three ways, and it takes roughly 40 seconds on my laptop with a whole lot of stuff running.
2018-06-03 15425, 2018
bukwurm
LordSputnik: The producer will crash if the RMQ connection is refused.
2018-06-03 15408, 2018
bukwurm
I think we can limit the memory used by RMQ, but a better way would be
2018-06-03 15426, 2018
bukwurm
To feed only the amount of dump we are sure it can handle wholly at a time
2018-06-03 15400, 2018
bukwurm
Repeat the entire importing process one by one till we finish the entire import.
2018-06-03 15429, 2018
LordSputnik
OK, it would be good to have that limit to be dynamic, with RMQ and the producer talking to determine whether too much memory is in use
2018-06-03 15439, 2018
LordSputnik
That way we can easily port between servers
2018-06-03 15451, 2018
central3 joined the channel
2018-06-03 15455, 2018
bukwurm
I am not sure how much disk memory we have at hand. Any ballpark?
2018-06-03 15422, 2018
LordSputnik
Hard disk space?
2018-06-03 15431, 2018
bukwurm
Yeah
2018-06-03 15441, 2018
LordSputnik
I don't actually know that
2018-06-03 15455, 2018
LordSputnik
ruaok or zas would probably know or be able to find out from the Google dashboard
2018-06-03 15455, 2018
bukwurm
2.5+ G authors, 500+ MB works
2018-06-03 15406, 2018
bukwurm
But 27+ G of editions
2018-06-03 15408, 2018
bukwurm
LordSputnik: I'll see how RMQ can be configured to limit it's memory.
2018-06-03 15441, 2018
bukwurm
Although I am unsure if the producer can be made to 'talk' with the RMQ. I'll look up the documentation.
2018-06-03 15401, 2018
bukwurm
If there's an error, we can catch and log it.
2018-06-03 15417, 2018
central3
are the entity data structures returned by lookup & browse web service requests identical? e.g. do they have the same properties?
2018-06-03 15403, 2018
bukwurm
LordSputnik: Ok, gotta go for dinner now. See you on Tuesday, if it's ok with you? :)
2018-06-03 15433, 2018
LordSputnik
bukwurm: RMQ must return some sort of response to indicate whether the message was accepted, right? We can configure a memory limit, probably, and then pause sending messages based on the response
2018-06-03 15436, 2018
LordSputnik
OK, that's fine
2018-06-03 15440, 2018
LordSputnik
Good meeting :)
2018-06-03 15459, 2018
LordSputnik
I'll start reviewing those PRs. Please could you update the -sql PR on Monday?
2018-06-03 15440, 2018
yvanzo
central3: They both follow the same schema, but search doesn’t support inc parameter used to include additional properties.
yvanzo: is there any documentation that compares them (across browse & lookup)? i'm going though the process manually as i build rust-lang types for everything, but it is time consuming. especially since the json properties are in random order! :)
2018-06-03 15409, 2018
central3
hmm i wonder if the null vs "" rule intentionally applied throughout all the data. the barcode example makes sense, but for say the disambiguation property, it seems less clear
2018-06-03 15402, 2018
bukwurm
LordSputnik: Sorry, forgot about that PR. Tomorrow definitely.
central3: sorry, I don’t found the exact doc you’re looking for :/
2018-06-03 15416, 2018
central3
i don't think the doc exists, it would not be of much use to most anyway
2018-06-03 15458, 2018
central3
i'm basically trying to identify all of the variations on say the Artist entity structure. e.g. as returned by lookup, browse, or as referenced by an ArtistCredit. there seems to be 3 separate forms for most entities (lookup, browse, and "referenced"/simple)
2018-06-03 15409, 2018
central3
e.g. the recordings prop exists on an Artist in a lookup, but not in browse. it would be kind of nice to have a strict doc of the structures, when going through and modelling them as struct/types
2018-06-03 15441, 2018
central3
but the docs as is are already quite helpful in any case
2018-06-03 15403, 2018
yvanzo
central3: disambiguation cannot be null.
2018-06-03 15403, 2018
yvanzo
WS/2 browse & lookup are definitely missing detailed examples. :(
2018-06-03 15421, 2018
yvanzo
central3: Recordings are only available from lookup since it is a subquery with inc parameter (which is documented), or am I missing smt?
2018-06-03 15406, 2018
central3
yvanzo: yeah that is correct. there is no inc value for recordings when browsing artists
Ive been staying up to 4 am to work on music brainz
2018-06-03 15438, 2018
reosarevok
I've done that in the past, but I'd recommend not doing it too often. Sleep is important!
2018-06-03 15448, 2018
reosarevok learned that by getting old :p
2018-06-03 15423, 2018
SothoTalKer
you're not even 30
2018-06-03 15449, 2018
SothoTalKer
reosarevok: you're a funny coordinator, very nice (:
2018-06-03 15409, 2018
reosarevok
I am 30!
2018-06-03 15426, 2018
Freso
Speaking of going to sleep…
2018-06-03 15435, 2018
yvanzo
reosarevok: old man
2018-06-03 15420, 2018
SothoTalKer
reosarevok: barely 30.
2018-06-03 15453, 2018
SothoTalKer
Birthday: May 12, 1988
2018-06-03 15453, 2018
dragonzeron
I am 18
2018-06-03 15416, 2018
dragonzeron
I would be going to the club but since I dont have someone to take me yet I am at home being productive
2018-06-03 15426, 2018
yvanzo
GeneralDiscourse: PG10 is not supported for now, but we are definitely going to move to it.
2018-06-03 15435, 2018
rsh7 joined the channel
2018-06-03 15433, 2018
central3 joined the channel
2018-06-03 15418, 2018
Gazooo has quit
2018-06-03 15428, 2018
Gazooo joined the channel
2018-06-03 15411, 2018
rsh7 has quit
2018-06-03 15448, 2018
D4RK-PH0ENiX has quit
2018-06-03 15416, 2018
D4RK-PH0ENiX joined the channel
2018-06-03 15418, 2018
D4RK-PH0ENiX has quit
2018-06-03 15403, 2018
GeneralDiscourse
yvanzo: I'm going to try and use in tree perl modules writing an ebuild for the deps. Thats solved a few of the errors. The shell rc file was amended and sourced initially, too. I'll keep you posted.