mmm. that's an interesting question which is kinda difficult to answer
2022-07-22 20357, 2022
Pratha-Fish
alastairp: That one's in the making to :)
2022-07-22 20301, 2022
Pratha-Fish
*too
2022-07-22 20313, 2022
alastairp
in fact, if you have a process that has a file open (e.g. you did `fp = open("/path/to/file")`) and you still haven't closed it
2022-07-22 20316, 2022
alastairp
then it's actually not deleted
2022-07-22 20326, 2022
alastairp
linux only deletes a file once all open file handles are closed
2022-07-22 20334, 2022
mayhem
alastairp: gràcies!
2022-07-22 20348, 2022
alastairp
so... maybe you have it open somewhere? there's a way of poking about to find a reference to it
2022-07-22 20309, 2022
alastairp
I've definitely used this to recover files before
2022-07-22 20318, 2022
Pratha-Fish
alastairp: Apparently VScode killed all instances before deleting it ⚰️
2022-07-22 20342, 2022
alastairp
it's also why (for example) if you delete all of your apache log files because you're running out of disk space, but don't restart apache, you find that your disk space hasn't actually been freed up
2022-07-22 20349, 2022
alastairp
(this one is from personal experience)
2022-07-22 20340, 2022
alastairp
and it's why apache has a mode where you can send it a signal (SIGHUP) and it'll close and re-open all of its file handles, so when you use a tool such as logrotate to manage your logfiles you can tell apache to give up the filehandle so that you can delete them
2022-07-22 20303, 2022
Pratha-Fish
Wow that one's interesting
2022-07-22 20349, 2022
Pratha-Fish
By Apache do you mean Apache web server or a general trait amongst Apache products (spark, etc?)
2022-07-22 20321, 2022
skelly37
outsidecontext, zas, rdswift: Not-proper exit finally solved with just one line!
yvanzo, bitmap: CASCADE seems like the least bad option to me, but that's a schema change - all good fixes seem like schema changes unless I'm missing something
2022-07-22 20328, 2022
reosarevok is also old, apparently
2022-07-22 20325, 2022
alastairp
reosarevok: it happens to the best of us
2022-07-22 20332, 2022
akshaaatt
Hi yellowhatpro! It’s fine that you thought about the animations, but as you said, we shiuld proceed with the mockup for now.
2022-07-22 20339, 2022
akshaaatt
Should*
2022-07-22 20335, 2022
akshaaatt
Hi aerozol! That sounds nice but we need effort in getting the profile page first. The listening now feature sounds cool, but don’t know how complex it would be to implement.
2022-07-22 20302, 2022
Lotheric has quit
2022-07-22 20343, 2022
Lotheric joined the channel
2022-07-22 20346, 2022
Lotheric has quit
2022-07-22 20341, 2022
Lotheric joined the channel
2022-07-22 20304, 2022
skelly37
zas, outsidecontext: I've changed my recent PR a bit, made it as little "dirty" as possible by storing info about unexpected removal in Pipe and then checking it after we capture the tagger.run()'s exit code to determine if we can exit nicely.
2022-07-22 20341, 2022
skelly37 has quit
2022-07-22 20304, 2022
Hotrod2k joined the channel
2022-07-22 20324, 2022
Hotrod2k
Hello Looking for an irc channel to download books?
2022-07-22 20324, 2022
Hotrod2k
#bookbrainz
2022-07-22 20311, 2022
alastairp
Hotrod2k: no
2022-07-22 20332, 2022
Hotrod2k has quit
2022-07-22 20331, 2022
yuzie has quit
2022-07-22 20312, 2022
lucifer
mayhem: 👍
2022-07-22 20324, 2022
lucifer
alastairp: hi! around?
2022-07-22 20347, 2022
alastairp
lucifer: I am
2022-07-22 20309, 2022
lucifer
alastairp: brainstorming about couchdb dumps and wanted to discuss the same.
2022-07-22 20313, 2022
alastairp
sure
2022-07-22 20344, 2022
lucifer
we generate stats from spark daily. once the stats have been generated we send a start message to LB, the reader creates a database to store the stat named as {stat_type}_{range}_YYYYMMDD (usually today's date).
2022-07-22 20316, 2022
lucifer
then we send all the stats and insert it in this database. finally we send an end message to mark end of stats deletion.
2022-07-22 20337, 2022
lucifer
LB side goes on to delete older database of that particular stat and range.
2022-07-22 20343, 2022
alastairp
the spark reader does this?
2022-07-22 20346, 2022
lucifer
yes
2022-07-22 20353, 2022
alastairp
ok
2022-07-22 20303, 2022
lucifer
we do not have a way to mark databases as ready so what we do is query database whose {stat_type}_{range} when the user wants to view a stat in descending order of the date suffix.
2022-07-22 20321, 2022
alastairp
oh, interesting
2022-07-22 20331, 2022
lucifer
say we have artists_week_20220722 and artists_week_20220723
2022-07-22 20348, 2022
alastairp
can you rename databases?
2022-07-22 20350, 2022
lucifer
check in the 23 one, if user's data not found check in 22, still not found then 204.
2022-07-22 20313, 2022
lucifer
no. (future versions of couchdb may support it though)
2022-07-22 20321, 2022
alastairp
boo
2022-07-22 20352, 2022
alastairp
what's a "database" in couchdb?
2022-07-22 20358, 2022
alastairp
does it match a pg database, or a pg table?
2022-07-22 20341, 2022
lucifer
i'd consider it a table. with each row as 1 document.
most of the relevant code to interact with couchdb is located here.
2022-07-22 20323, 2022
alastairp
so just to confirm - your current question is to know how we can decide if a coucbdb database is ready to be queried?
2022-07-22 20302, 2022
lucifer
ah no, we can't decide it afaik so we just try latest to oldest.
2022-07-22 20350, 2022
alastairp
did we discuss having a separate piece of data indicating the db to check? (e.g. in redis or postgres)
2022-07-22 20300, 2022
lucifer
it should be 2 http queries at most, today's database or yesterday's. we delete daily so 2 databases at most exist at any time (when the latest oone is currently not ready).
2022-07-22 20301, 2022
alastairp
I recall us talking about it but can't remember what we decided
2022-07-22 20332, 2022
lucifer
yes we discussed that, iirc we decided in favor of the current impl.
2022-07-22 20332, 2022
alastairp
is this 2 queries per [requesting a user's stats]?
2022-07-22 20339, 2022
lucifer
yes
2022-07-22 20305, 2022
lucifer
2 HTTP queries to be specific.
2022-07-22 20316, 2022
alastairp
so it's possible that for 1 user they might have something in today's db, but for another user it's not yet been inserted and so it goes to yesterday's?
2022-07-22 20326, 2022
lucifer
yes
2022-07-22 20337, 2022
lucifer
that happens currently as well fwiw.
2022-07-22 20341, 2022
alastairp
right
2022-07-22 20359, 2022
lucifer
also some stats may be for today but others old for the same user
2022-07-22 20320, 2022
alastairp
I'm just thinking through a few cases: is it possible that a user will have stats for yesterday, but even after they're fully computed and inserted into the db there are none for today?
2022-07-22 20314, 2022
lucifer
yes. that can happen in 2 cases. the user deleted their account or they didn't submit any listens for current range.
2022-07-22 20333, 2022
alastairp
yeah, no listens for current range is what I was thinking of
2022-07-22 20334, 2022
lucifer
however, in case the user lookup in db fails so couchdb is never queried.
2022-07-22 20342, 2022
alastairp
right
2022-07-22 20328, 2022
lucifer
case 2, its possible on days like 1st of the year when the last year stat changes years.
2022-07-22 20359, 2022
lucifer
also this is also the current behavior of the pg tables.
2022-07-22 20318, 2022
alastairp
you know what, I was just wondering about going back to our original discussion about saving the "current" table somewhere. and that's actually 2 queries anyway
2022-07-22 20328, 2022
Pratha-Fish
alastairp: I had to leave the mapper script on hold today due to some network issues. I'll try to get it done over the weekend though, so if there are any instructions you'd like to give me in advance, please go ahead :)
2022-07-22 20343, 2022
alastairp
Pratha-Fish: hi, no problem. I don't think I have anything extra that I need to add
2022-07-22 20347, 2022
lucifer
yeah indeed.
2022-07-22 20352, 2022
alastairp
let's talk either over the weekend or on monday
2022-07-22 20302, 2022
Pratha-Fish
alastairp: Alright :)
2022-07-22 20313, 2022
alastairp
lucifer: can couchdb's bulk insert methods help us?
2022-07-22 20320, 2022
lucifer
i guess redis may be faster than http but not very reliable. PG probably similar to http.
2022-07-22 20331, 2022
alastairp
(how large is a stats run? is it a risk for us to insert everything at once?)
2022-07-22 20343, 2022
lucifer
yes, already using those but batches of 10-25.
2022-07-22 20310, 2022
lucifer
yeah can't insert all in 1 go because each stat type X range combination ranges from 40 MB - 1.5 G.
2022-07-22 20314, 2022
alastairp
right. not all at once
2022-07-22 20315, 2022
alastairp
mmhm
2022-07-22 20356, 2022
alastairp
approximately how long does it take to store all stats?
2022-07-22 20317, 2022
lucifer
i'll need to check but >4 hrs iirc.
2022-07-22 20331, 2022
lucifer
this includes time to generate those as well.
2022-07-22 20323, 2022
lucifer
in the long run, it'll probably become infeasible to generate all this data daily but maybe wrong. we'll see when that time nears.
2022-07-22 20352, 2022
alastairp
yeah, right. at some point in time it might be better to add in a flag for the current db too
2022-07-22 20320, 2022
alastairp
but at the moment, if we already have this behaviour in postgres, and if it'll seamlessly work as new data gets added, I don't see a huge problem with it
2022-07-22 20322, 2022
lucifer
sorry, not sure which flag you meant?
2022-07-22 20317, 2022
alastairp
oh, a setting in postgres or redis which says what the current database is
2022-07-22 20300, 2022
lucifer
ah ok, i was referring to inserting part becoming infeasible but sure we can try redis/pg later.
2022-07-22 20315, 2022
lucifer
there's 1 difference from current behaviour. say you suddenly stopped submitting listens, the pg tables are never cleared so old stats remain but in couchdb we delete old stats daily so outdatded will no longer remain.
2022-07-22 20322, 2022
alastairp
oh yes, that may also be a problem in the future
2022-07-22 20350, 2022
lucifer
maybe we can discuss some options for that at summit.
2022-07-22 20358, 2022
alastairp
but we have different couchdb tables for each stat range type, right?
2022-07-22 20305, 2022
lucifer
right
2022-07-22 20317, 2022
alastairp
so if I stop submitting, in 1 day I'll stop having daily stats, in 1 week I'll stop having weekly stats
2022-07-22 20324, 2022
lucifer
right
2022-07-22 20328, 2022
alastairp
and my yearly stats won't stop showing until the next time we compute yearly ones?
2022-07-22 20340, 2022
alastairp
so that sounds like an improvement over the current setup?
2022-07-22 20342, 2022
lucifer
we compute all stats daily.
2022-07-22 20331, 2022
lucifer
but all years listens are considered for yearly stats so if you stop submitting for a year then that stat will go away
2022-07-22 20308, 2022
alastairp
yes, right. I'm following
2022-07-22 20329, 2022
lucifer
in the current setup, you'd continue seeing the last year's stats
2022-07-22 20300, 2022
lucifer
its probably more accurate indeed.
2022-07-22 20355, 2022
lucifer
any other suggestions about insert/fetch process? if not let's move to dumps.
2022-07-22 20300, 2022
alastairp
oh, mmm
2022-07-22 20325, 2022
alastairp
can you give me some examples with date boundaries for seeing last [period]'s stats?
2022-07-22 20338, 2022
alastairp
e.g. imagine if I have stats for Jan 1-Jan 31
2022-07-22 20345, 2022
alastairp
and then I stop listening all of Feb
2022-07-22 20352, 2022
alastairp
then during Feb I'll see Jan's stats?
2022-07-22 20359, 2022
lucifer
yes
2022-07-22 20316, 2022
alastairp
and a user who is actively listening... will see month-to-date stats, or still Jan?
2022-07-22 20355, 2022
lucifer
we have 2 monthly ranges. Last Month is full last month in this case, always Jan. the other range is This Month which is to-date.
2022-07-22 20334, 2022
lucifer
so for This Month, user who submits in Feb will see Feb stats whereas the not submitting one will continue to see Jan.
2022-07-22 20307, 2022
alastairp
does a stat document include what range it's for?
2022-07-22 20334, 2022
lucifer
yes in current setup its column. in couchdb setup, the database name contains it.
2022-07-22 20345, 2022
alastairp
if we're able to identify it (after we retrieve it from storage), it probably makes sense to say "you have no stats for the current month"
2022-07-22 20350, 2022
alastairp
oh
2022-07-22 20302, 2022
alastairp
but we don't know if this is because there are none, or if it's not been inserted yet
2022-07-22 20306, 2022
alastairp
back to this again
2022-07-22 20348, 2022
lucifer
we'll search in the older db and if there are none there either then say no stats.
2022-07-22 20353, 2022
alastairp
so it does feel a bit weird to say "month-to-date" and show Jan when we're in Feb
2022-07-22 20304, 2022
lucifer
indeed
2022-07-22 20355, 2022
alastairp
but as you said, this is the same as current behaviour?
2022-07-22 20317, 2022
lucifer
this is the current behaviour because we never delete the old stats. this happens because we only insert in stats table in PG. when spark sends new stats existing stats are overwritten. if no stat for this month is sent by spark then the old one remains.
2022-07-22 20331, 2022
alastairp
right
2022-07-22 20349, 2022
lucifer
in the couchdb setup, we only keep the stats that spark sent this time and get rid of the old db entirely. so the old stat goes away.
2022-07-22 20316, 2022
alastairp
I can see a possibility that we might want to set it up somehow so that "current x" stats always show the current stats, but for now I think it's OK to leave as-is
2022-07-22 20318, 2022
lucifer
so month-to-date does not show anything in Feb for the user who submitted nothing in Feb.
2022-07-22 20317, 2022
lucifer
iiuc, then the couchdb setup already does that "current x" always showing the current stats?
the way I was understanding it, the user who hasn't listened to anything in feb will show "month of jan" when showing month-to-date stats, even in feb?