IRC Logs for #metabrainz | MetaBrainz Chatlogs

0:38 AM
Nyanko-sensei has quit
0:39 AM
Nyanko-sensei joined the channel
2:37 AM
yokel has quit
2:44 AM
yokel joined the channel
4:17 AM
Nyanko-sensei has quit
4:28 AM
Nyanko-sensei joined the channel
4:40 AM
Nyanko-sensei has quit
4:45 AM
Nyanko-sensei joined the channel
5:03 AM
Nyanko-sensei has quit
5:13 AM
Nyanko-sensei joined the channel
5:57 AM
AmandeeKumar joined the channel
6:01 AM
AmandeeKumar is now known as AmandeepKumar
6:09 AM
Nyanko-sensei has quit
6:17 AM
Nyanko-sensei joined the channel
6:36 AM
AmandeepKumar has quit
6:44 AM
_lucifer has quit
6:44 AM
_lucifer joined the channel
6:45 AM
revi has quit
6:47 AM
D4RK-PH0_ has quit
6:48 AM
revi joined the channel
7:20 AM
AmandeeKumar joined the channel
7:27 AM
AmandeeKumar has quit
7:29 AM
yvanzo

mo’’in’
8:03 AM
sumedh joined the channel
8:21 AM
rdswift has quit
8:27 AM
rdswift joined the channel
8:34 AM
sumedh has quit
8:53 AM
Nyanko-sensei has quit
8:58 AM
Nyanko-sensei joined the channel
9:25 AM
ruaok

mo'in!
9:26 AM
yvanzo: zas: shall we put some thought behind fixing trille today?
9:26 AM
yvanzo

ruaok: is listenbrainaz using rabbitmq too?
9:27 AM
ruaok

yes
9:29 AM
yvanzo

MB uses it to update search indexes, it should preferably not be stopped or search indexes won’t be up-to-date.
9:29 AM
I don’t think that this is what take the most resources on trille, but we could move this queue to PostgreSQL.
9:30 AM
ruaok

LB is less sensitive. if a user cannot submit a listen, clients must re-try. so, restarts are ok.
9:32 AM
https://stats.metabrainz.org/d/000000025/all-lo...
9:32 AM
something is a miss. load on trille keeps growing, traffic in rabbitmq hasn't grown.
9:34 AM
yvanzo

IMHO, CB has probably a lot of margin for reducing its footprint on resources.
9:34 AM
ruaok

Does CB use RMQ or CB is on trille as well?
9:35 AM
yvanzo

it is on trille as well
9:35 AM
ruaok

have we ascertained if the resource hog is CB or RMQ?
9:37 AM
teletgraf is the top process on trille? that feels odd to me.
9:41 AM
yvanzo

zas: Is it possible to monitor trille’s containers from grafana more closely, for example using cadvisor?
9:41 AM
zas

well, we already have reports for containers on trille
9:41 AM
ruaok

you read my mind, we need to a have a % usage per container graph....
9:42 AM
zas

https://stats.metabrainz.org/d/000000051/hetzne...
9:42 AM
ruaok

zas: got link?
9:42 AM
heh
9:42 AM
yvanzo

https://stats.metabrainz.org/d/000000051/hetzne...
9:42 AM
zas

ignore empty graphs, scroll down
9:43 AM
guys, we already have 2 suspects: rabbitmq and critiquebrainz-redis
9:43 AM
first one is known to eat cpu for nothing in certain cases
9:44 AM
second one is actually having huge write to disk spikes
9:44 AM
ruaok

I just dont see rabbitmq as the culprit. but I see redis. those peaks in BlkIo are worrying.
9:44 AM
zas

yes^^
9:44 AM
it writes far too much data
9:45 AM
yesterday I reduced share of trille mbs to almost nothing, so only few queries goes to it, and even with that, we still have very slow queries (read: seconds instead of milliseconds)
9:45 AM
ruaok

https://stats.metabrainz.org/d/000000051/hetzne...
9:45 AM
zas

on some ws queries (usually < 100ms) we can reach > 10s
9:45 AM
ruaok

that looks like we need to investigate wtf is happening here.
9:46 AM
what did we do on 10/9, for instance?
9:48 AM
yvanzo

with less empty graphs: https://stats.metabrainz.org/d/000000051/hetzne...
9:49 AM
ruaok

yvanzo: thank you. that helps.
9:49 AM
so, rabbitmq is not the problem. agreed?
9:50 AM
yvanzo

+1
9:51 AM
ruaok

https://github.com/metabrainz/critiquebrainz/re...
9:51 AM
that coincides with a CB release.
9:51 AM
and presumably a CB container restart.
9:52 AM
though the release on 10.26 didn't cause the same drop. perhaps redis was not restarted then?
9:52 AM
zas,yvanzo has redis been restarted recently?
9:53 AM
yvanzo

last time 2 months ago
9:53 AM
zas

this instance of redis doesn't run with --appendonly=yes, like most instances, so it doesn't use aof
9:54 AM
ruaok

any objections to restarting it to see what happens to the graph?
9:54 AM
zas

but imho beam.smp cannot be excluded yet
9:54 AM
ruaok

I could see a situation where CB is keeping a list in redis that keeps growing. and it is written over and over again.
9:54 AM
a bug, for sure.
9:54 AM
zas: I didn't. I'm trying to exclude on clear trouble maker to get another data point.
9:55 AM
*one
9:55 AM
_lucifer: ping
9:56 AM
zas

2273 process (beam.smp) is writing a lot
9:56 AM
ruaok

can you tell where the data goes, zas?
9:56 AM
zas

wait
9:57 AM
I catched redis write ops
9:57 AM
ruaok

beam is deffo the highest disk user. didn't CB add more telegraf logging? could it be overdoing it?
9:57 AM
zas

it goes up to 50mb/s
9:57 AM
ruaok

yes, that is why I am focusing on redis.
9:57 AM
zas

while beam.smp doesn't go over 400kb/s
9:57 AM
ruaok

beam.smp is an issue too, but its less spikey.
9:58 AM
zas

yes, so unlikely to cause huge delays we see
9:58 AM
ruaok

I'm going to restart redis, ok?
9:58 AM
yvanzo

It seems due to be CB usage of redis.
9:58 AM
zas

not sure restarting it will help, but you can try
9:59 AM
ruaok

it might drop the traffic back to 0 and then start growing again. but that would clearly indicate CB is doing something bad.
10:00 AM
yvanzo

we will probably the same graph as from September.
10:00 AM
_lucifer

ruaok: pong
10:00 AM
yvanzo

At least, it should be lower CPU/Mem usage first.
10:00 AM
ruaok

peaks are 5-6 mins apart.
10:00 AM
hi _lucifer !
10:01 AM
can you please follow the scroll back for the last 20 minutes?
10:01 AM
_lucifer

sure
10:01 AM
ruaok

we're seeing strange redis use coming from CB.
10:01 AM
I'm curious if redis use in CB has recently changed. is there anything that gets processed every 5 minutes or so?
10:02 AM
it is causing massive disk io spikes: https://stats.metabrainz.org/d/000000051/hetzne...
10:03 AM
zas, yvanzo : as expected the disk io for redis has dropped to nothing.
10:04 AM
beam.smp too. which might suggest that whatever the redis bug is it might be logging info to telegraf.
10:05 AM
Gazooo7949440 has quit
10:05 AM
_lucifer

there were a couple of bug fixes regarding that, a key mismatch but nothing comes to mind that happens at a regular interval
10:06 AM
ruaok

ok, the regular interval might be a redis behaviour due to increased use.
10:06 AM
_lucifer

all data served by CB is cached
10:06 AM
Gazooo7949440 joined the channel
10:06 AM
ruaok

what is weird is that we are seeing a lot of data being written to redis, but with very little read. that is fully upside-down of what it should be.
10:07 AM
_lucifer

yeah right
10:07 AM
can we like a sample of the latest read/writes?
10:07 AM
ruaok

so, the primary traffic for CB comes from MB hitting its API.
10:08 AM
/ws/1/review/?limit=1&offset=0&release_group=ee9b6cad-ee58-3529-81ba-cc204769459c&sort=rating
10:08 AM
_lucifer

*view a sample
10:08 AM
ruaok

is the endpoint that gets all the traffic. can you please review the entire code chain of this endpoint and review the redis use in great detail to see if we can find something where redis might be used incorrectly?
10:09 AM
_lucifer

sure, i'll do that
10:09 AM
ruaok

_lucifer: that is what I am hoping to do next. what is being written and read. let me dig
10:10 AM
wow.
10:11 AM
an incredible number of new keys are being generated in redis.
10:11 AM
_lucifer

my preliminary guess is that if there is no bug, then different MB entities are being queried (different release groups being viewed but the same page is not viewed frequently) but they are not viewed that often. hence, the writes are frequent but the reads are not.
10:12 AM
ruaok

6 keys a second are being created. that would be the problem.
10:15 AM
there must be a problem with the page cache.
10:16 AM
worst case, each page fetched from MB would cause 1 fetch and 1 write to redis.
10:16 AM
but I would expect some cache hits, so there should be fewer writes than reads.
10:16 AM
do you know where the cache keys are generated in CB, _lucifer?
10:16 AM
_lucifer

yes a sec
10:18 AM
ruaok: https://github.com/metabrainz/critiquebrainz/bl...
10:19 AM
ruaok

was caching in brainzutils changed recently?
10:20 AM
_lucifer

no doesn't seem so, the last commit to brainzutils cache was 2 years agp
10:20 AM
https://github.com/metabrainz/brainzutils-pytho...
10:20 AM
ruaok

_lucifer: could you do me a quick favor? can you disable caching from that function and make a small PR?
10:21 AM
then we can deploy that and observe.
10:21 AM
_lucifer

sure, on it
10:21 AM
ruaok

because right now caching is creating load problems , not solving them.
10:21 AM
thx
10:36 AM
BrainzGit

[critiquebrainz] amCap1712 opened pull request #337 (master…disable-cache): Disable release-group caching https://github.com/metabrainz/critiquebrainz/pu...
10:36 AM
ruaok

thx
10:42 AM
_lucifer

1 test is failing but that is expected.
10:43 AM
ruaok

agreed.
10:43 AM
let me see about deploying.
10:44 AM
BrainzGit

[critiquebrainz] mayhem merged pull request #337 (master…disable-cache): Disable release-group caching https://github.com/metabrainz/critiquebrainz/pu...
10:54 AM
_lucifer

I think MB only requires the review text and review ratings but CB provides additional entity data which MB already has. The number of reviews is much less than the number of entities. So for most entities, we are just caching the entity data which MB is not useful to MB. I can add a MB mode to cache only the review text and ratings. I think that would reduce the cache writes to a large extent.
10:55 AM
yvanzo

+1
10:55 AM
ruaok

oh, that is interesting. maybe make a separate endpoint for that?
10:56 AM
alastairp

hello. reading backlog
10:56 AM
need help with CB release?
10:57 AM
ruaok

hopefully not. 🤞