0:38 AM
Nyanko-sensei has quit
0:39 AM
Nyanko-sensei joined the channel
2:37 AM
yokel has quit
2:44 AM
yokel joined the channel
4:17 AM
Nyanko-sensei has quit
4:28 AM
Nyanko-sensei joined the channel
4:40 AM
Nyanko-sensei has quit
4:45 AM
Nyanko-sensei joined the channel
5:03 AM
Nyanko-sensei has quit
5:13 AM
Nyanko-sensei joined the channel
5:57 AM
AmandeeKumar joined the channel
6:01 AM
AmandeeKumar is now known as AmandeepKumar
6:09 AM
Nyanko-sensei has quit
6:17 AM
Nyanko-sensei joined the channel
6:36 AM
AmandeepKumar has quit
6:44 AM
_lucifer has quit
6:44 AM
_lucifer joined the channel
6:45 AM
revi has quit
6:47 AM
D4RK-PH0_ has quit
6:48 AM
revi joined the channel
7:20 AM
AmandeeKumar joined the channel
7:27 AM
AmandeeKumar has quit
7:29 AM
yvanzo
mo’’in’
8:03 AM
sumedh joined the channel
8:21 AM
rdswift has quit
8:27 AM
rdswift joined the channel
8:34 AM
sumedh has quit
8:53 AM
Nyanko-sensei has quit
8:58 AM
Nyanko-sensei joined the channel
9:25 AM
ruaok
mo'in!
9:26 AM
yvanzo: zas: shall we put some thought behind fixing trille today?
9:26 AM
yvanzo
ruaok: is listenbrainaz using rabbitmq too?
9:27 AM
ruaok
yes
9:29 AM
yvanzo
MB uses it to update search indexes, it should preferably not be stopped or search indexes won’t be up-to-date.
9:29 AM
I don’t think that this is what take the most resources on trille, but we could move this queue to PostgreSQL.
9:30 AM
ruaok
LB is less sensitive. if a user cannot submit a listen, clients must re-try. so, restarts are ok.
9:32 AM
9:32 AM
something is a miss. load on trille keeps growing, traffic in rabbitmq hasn't grown.
9:34 AM
yvanzo
IMHO, CB has probably a lot of margin for reducing its footprint on resources.
9:34 AM
ruaok
Does CB use RMQ or CB is on trille as well?
9:35 AM
yvanzo
it is on trille as well
9:35 AM
ruaok
have we ascertained if the resource hog is CB or RMQ?
9:37 AM
teletgraf is the top process on trille? that feels odd to me.
9:41 AM
yvanzo
zas: Is it possible to monitor trille’s containers from grafana more closely, for example using cadvisor?
9:41 AM
zas
well, we already have reports for containers on trille
9:41 AM
ruaok
you read my mind, we need to a have a % usage per container graph....
9:42 AM
zas
9:42 AM
ruaok
zas: got link?
9:42 AM
heh
9:42 AM
yvanzo
9:42 AM
zas
ignore empty graphs, scroll down
9:43 AM
guys, we already have 2 suspects: rabbitmq and critiquebrainz-redis
9:43 AM
first one is known to eat cpu for nothing in certain cases
9:44 AM
second one is actually having huge write to disk spikes
9:44 AM
ruaok
I just dont see rabbitmq as the culprit. but I see redis. those peaks in BlkIo are worrying.
9:44 AM
zas
yes^^
9:44 AM
it writes far too much data
9:45 AM
yesterday I reduced share of trille mbs to almost nothing, so only few queries goes to it, and even with that, we still have very slow queries (read: seconds instead of milliseconds)
9:45 AM
ruaok
9:45 AM
zas
on some ws queries (usually < 100ms) we can reach > 10s
9:45 AM
ruaok
that looks like we need to investigate wtf is happening here.
9:46 AM
what did we do on 10/9, for instance?
9:48 AM
yvanzo
9:49 AM
ruaok
yvanzo: thank you. that helps.
9:49 AM
so, rabbitmq is not the problem. agreed?
9:50 AM
yvanzo
+1
9:51 AM
ruaok
9:51 AM
that coincides with a CB release.
9:51 AM
and presumably a CB container restart.
9:52 AM
though the release on 10.26 didn't cause the same drop. perhaps redis was not restarted then?
9:52 AM
zas,yvanzo has redis been restarted recently?
9:53 AM
yvanzo
last time 2 months ago
9:53 AM
zas
this instance of redis doesn't run with --appendonly=yes, like most instances, so it doesn't use aof
9:54 AM
ruaok
any objections to restarting it to see what happens to the graph?
9:54 AM
zas
but imho beam.smp cannot be excluded yet
9:54 AM
ruaok
I could see a situation where CB is keeping a list in redis that keeps growing. and it is written over and over again.
9:54 AM
a bug, for sure.
9:54 AM
zas: I didn't. I'm trying to exclude on clear trouble maker to get another data point.
9:55 AM
*one
9:55 AM
_lucifer: ping
9:56 AM
zas
2273 process (beam.smp) is writing a lot
9:56 AM
ruaok
can you tell where the data goes, zas?
9:56 AM
zas
wait
9:57 AM
I catched redis write ops
9:57 AM
ruaok
beam is deffo the highest disk user. didn't CB add more telegraf logging? could it be overdoing it?
9:57 AM
zas
it goes up to 50mb/s
9:57 AM
ruaok
yes, that is why I am focusing on redis.
9:57 AM
zas
while beam.smp doesn't go over 400kb/s
9:57 AM
ruaok
beam.smp is an issue too, but its less spikey.
9:58 AM
zas
yes, so unlikely to cause huge delays we see
9:58 AM
ruaok
I'm going to restart redis, ok?
9:58 AM
yvanzo
It seems due to be CB usage of redis.
9:58 AM
zas
not sure restarting it will help, but you can try
9:59 AM
ruaok
it might drop the traffic back to 0 and then start growing again. but that would clearly indicate CB is doing something bad.
10:00 AM
yvanzo
we will probably the same graph as from September.
10:00 AM
_lucifer
ruaok: pong
10:00 AM
yvanzo
At least, it should be lower CPU/Mem usage first.
10:00 AM
ruaok
peaks are 5-6 mins apart.
10:00 AM
hi _lucifer !
10:01 AM
can you please follow the scroll back for the last 20 minutes?
10:01 AM
_lucifer
sure
10:01 AM
ruaok
we're seeing strange redis use coming from CB.
10:01 AM
I'm curious if redis use in CB has recently changed. is there anything that gets processed every 5 minutes or so?
10:02 AM
10:03 AM
zas, yvanzo : as expected the disk io for redis has dropped to nothing.
10:04 AM
beam.smp too. which might suggest that whatever the redis bug is it might be logging info to telegraf.
10:05 AM
Gazooo7949440 has quit
10:05 AM
_lucifer
there were a couple of bug fixes regarding that, a key mismatch but nothing comes to mind that happens at a regular interval
10:06 AM
ruaok
ok, the regular interval might be a redis behaviour due to increased use.
10:06 AM
_lucifer
all data served by CB is cached
10:06 AM
Gazooo7949440 joined the channel
10:06 AM
ruaok
what is weird is that we are seeing a lot of data being written to redis, but with very little read. that is fully upside-down of what it should be.
10:07 AM
_lucifer
yeah right
10:07 AM
can we like a sample of the latest read/writes?
10:07 AM
ruaok
so, the primary traffic for CB comes from MB hitting its API.
10:08 AM
/ws/1/review/?limit=1&offset=0&release_group=ee9b6cad-ee58-3529-81ba-cc204769459c&sort=rating
10:08 AM
_lucifer
*view a sample
10:08 AM
ruaok
is the endpoint that gets all the traffic. can you please review the entire code chain of this endpoint and review the redis use in great detail to see if we can find something where redis might be used incorrectly?
10:09 AM
_lucifer
sure, i'll do that
10:09 AM
ruaok
_lucifer: that is what I am hoping to do next. what is being written and read. let me dig
10:10 AM
wow.
10:11 AM
an incredible number of new keys are being generated in redis.
10:11 AM
_lucifer
my preliminary guess is that if there is no bug, then different MB entities are being queried (different release groups being viewed but the same page is not viewed frequently) but they are not viewed that often. hence, the writes are frequent but the reads are not.
10:12 AM
ruaok
6 keys a second are being created. that would be the problem.
10:15 AM
there must be a problem with the page cache.
10:16 AM
worst case, each page fetched from MB would cause 1 fetch and 1 write to redis.
10:16 AM
but I would expect some cache hits, so there should be fewer writes than reads.
10:16 AM
do you know where the cache keys are generated in CB, _lucifer?
10:16 AM
_lucifer
yes a sec
10:18 AM
10:19 AM
ruaok
was caching in brainzutils changed recently?
10:20 AM
_lucifer
no doesn't seem so, the last commit to brainzutils cache was 2 years agp
10:20 AM
10:20 AM
ruaok
_lucifer: could you do me a quick favor? can you disable caching from that function and make a small PR?
10:21 AM
then we can deploy that and observe.
10:21 AM
_lucifer
sure, on it
10:21 AM
ruaok
because right now caching is creating load problems , not solving them.
10:21 AM
thx
10:36 AM
BrainzGit
10:36 AM
ruaok
thx
10:42 AM
_lucifer
1 test is failing but that is expected.
10:43 AM
ruaok
agreed.
10:43 AM
let me see about deploying.
10:44 AM
BrainzGit
10:54 AM
_lucifer
I think MB only requires the review text and review ratings but CB provides additional entity data which MB already has. The number of reviews is much less than the number of entities. So for most entities, we are just caching the entity data which MB is not useful to MB. I can add a MB mode to cache only the review text and ratings. I think that would reduce the cache writes to a large extent.
10:55 AM
yvanzo
+1
10:55 AM
ruaok
oh, that is interesting. maybe make a separate endpoint for that?
10:56 AM
alastairp
hello. reading backlog
10:56 AM
need help with CB release?
10:57 AM
ruaok
hopefully not. 🤞