We had also a lot of transient zombies processes on rakim, it seems it was related to sir-solr9-prod container, I just restarted containers there to be sure
It seems there is some instability on solr cloud side, number of threads is pretty high on 6 nodes over 8
and response times were quite slow for a while on certain nodes
I'll restart solr nodes with very high number of threads one by one, I just did with solr1 and it seems to come back to normal
yvanzo[m]
What the name of the dashboard?
zas[m]
SolrCloud 9
restarting solr nodes seems to work, number of threads goes down from 3k+ to ~650 after the restart (which is the usual number, as before incident). Not sure what happened though.
I restarted 1,2,3,4 already, first 2 ones took a long time (like 5 minutes), but now they restart much faster. I'm restarting 5 atm, and 7 is next (and last).
6 & 8 were running as normal
weird, 1 & 2 took ~5 minutes to restart, 3 & 4 like ~1 minute, and 5 & 7 just few seconds
All nodes are now on par regarding cpu/mem/threads
Let's see if we still get those 504s
yvanzo[m]
resolved so far
please give me access to the SolrCloud 9 dashboard when you have time
zas[m]
Yes, everything's back to normal after I restarted last node