yes that happens currently with PG setup but that is not the behaviour with the proposed couchdb setup.
alastairp
mm, now I'm a bit confused as to how this works with the couchdb setup
lucifer
artists_this_month_Jan - [A's stats, B's stats]
A has't submitted in Feb, spark only generates B stats as no listens for A
artists_this_month_Feb - [B'stats]
once Feb database is finished inserting, end message comes from spark and deletes Jan's database.
alastairp
ah. there are only 2 databases present while the insert is happening into the new one?
lucifer
so when we query for A now in Feb, we get no stats.
alastairp
and then the old one is deleted?
lucifer
yes right
alastairp
I understood that there were always 2
ok, now that makes sense, thanks
lucifer
ah sorry.
alastairp
np
ok, dumps?
lucifer
sure
so we export stats biweekly as part of full dumps.
we'll have to add a new json dump for stats since PG dumps don't fit this.
the issue i am currently stuck at is how to coordinate dump process with insert process.
alastairp
so that we don't dump a half-inserted table?
lucifer
yes. that's 1 possibility, other one is that we are dumping the full database but end message arrives and deletes it while we haven't finished exporting.
alastairp
mmm, right
lucifer
first option i have in mind is this:
retrieve all user ids from databases. lookup each user's stats using usual lookup. some stats from today's database, others from yesterdays database.
this works but possible drawback is that it maybe slower and the dump might have stats for users from different days.
one user's this week is actually this week but other's last week because their stat for this week hasn't been generated yet.
but again this is status quo. so i am only concerned about the speed issue.
note that 2 dbs here because i am thinking of the bad case that export and insert time is conflicting.
alastairp
yes, right
you know, I'm thinking again that both of these problems (insertion and dumps) can probably be solved clearly by putting current db into postgres...
it seems like we're creating all sorts of workarounds because we decided that we didn't want to do it
lucifer
yes but there's another issue which isn't solved by putting database name in postgres
alastairp
well, two out of three aint bad
lucifer
stored yesterday's database name in postgres. new database created for today. database insertion completes then it goes on to delete yesterday's database.
we aren't done exporting yesterday's database but it went away.
we could probably build in retry and start exporting new one again and this time we know it won't error because no way its taking more 1 day to export
alastairp
yeah, we could also have a flag which says "this db is being dumped", and if so don't delete it
will you do the dump by retrieving batches, or will you get it all at once?
lucifer
alternatively we could do a SELECT FOR UPDATE lock which blocks the insert process from updating the database's name from in PG
batches.
alastairp
will the spark writer add multiple types of stats in a single run?
lucifer
umm actually lock may not work but yes flag sounds good.
Its always 1) Start message 2) Stats for particular type and range 3) End message
alastairp
not sure if we want it blocking after creating 1 type, and then having to wait for dumps to finish before it progresses onto the next
ah, so in the case of a block, it'd happen only in response to the end message for a particular type?
lucifer
yes
alastairp
and we'd also have to decide whose responsiblity it is to delete the old database if we find ourselves in this situation
lucifer
yes we have the option to 1) block spark reader till export for that stat is done and it can delete 2) leave the database as it is, it gets deleted next day 3) add cron job.
i like 2 most fwiw.
alastairp
yeah, 2 sounds good
lucifer
there's another way to do this if we want to avoid storing in PG. write a file named say LOCKED to database. spark reader checks for this file in couchdb database before deleting. if its there it moves on and then next day cleanup,
alastairp
if we don't want to use postgres, that also sounds fine
lucifer
this one is easier to implement currently so I'll try this out first, if i am stuck then will try out PG impl.
thanks!
alastairp
no problem. looking forward to see how it turns out
monkey: thanks for the clarification. I'll see what other pattern can be used to match the consistency of the existing code..
Some context: I am working on implementing filters for my soc project which includes some array manipulation in a `Filters` component and sharing it with `ReleaseCard` component. I think I will need some data sharing when I work on a `Timeline` component later
I am kind of stuck on implementing filters and I should have asked for help before instead of mindlessly trying out things until something works.
[musicbrainz-server] 14reosarevok opened pull request #2594 (03master…MBS-12515): MBS-12515: Check _gid_redirect table exists before trying to use it https://github.com/metabrainz/musicbrainz-serve...
[musicbrainz-server] 14reosarevok merged pull request #2594 (03master…MBS-12515): MBS-12515: Check _gid_redirect table exists before trying to use it https://github.com/metabrainz/musicbrainz-serve...