hibiscuskazeneko: in fact, there was issue awith mb-solr-6 node
2018-10-12 28510, 2018
zas
i had to restart solr process
2018-10-12 28548, 2018
zas
the server was shutdown at 6:33 UTC, and came back online 10 minutes later, i suppose that's due to an Hetzner maintenance task, but i can't find anything about it, so i asked them. The issue was caused by solr process which didn't recover well, and for some reason, was returning 500 errors, i'll have to tune haproxy health checks to take this case in account
2018-10-12 28549, 2018
zas
basically it was online, but generating errors, which disappeared after a simple restart of the process, which joined again the solr cloud without issue.
2018-10-12 28508, 2018
hibiscuskazeneko has quit
2018-10-12 28511, 2018
zas
solr was erroring with "o.a.s.s.SolrDispatchFilter Error processing the request. CoreContainer is either not initialized or shutting down." on every request, filled the logs with that, the exact cause is yet to be determined.
2018-10-12 28507, 2018
hibiscuskazeneko joined the channel
2018-10-12 28530, 2018
zas
ok, i modified health checks so haproxy checks on actual query, and if it ends in anything but 200 declares the node as unhealthy. It should prevent sending errors to users (but it doesn't address the core issue, which is undetermined yet)
2018-10-12 28555, 2018
zas
another issue, may be related, sir-beta container on queen spawn a lot of python -m sir amqp_watch processes
it has something about adding support for new languages
2018-10-12 28528, 2018
hibiscuskazeneko has quit
2018-10-12 28550, 2018
zas
ignore telegram alerts, just restarted grafana server
2018-10-12 28532, 2018
ruaok
mooooin.
2018-10-12 28559, 2018
ruaok
hmmm. starbucks coffee to get the day started. what could possibly go 'rong?
2018-10-12 28513, 2018
zas
good morning SF
2018-10-12 28504, 2018
rdswift joined the channel
2018-10-12 28535, 2018
Freso
Morning. :)
2018-10-12 28515, 2018
ruaok
zas: what server was taken out by hetzner?
2018-10-12 28538, 2018
zas
solr-cloud-6 VM, but it was somehow planned, i didn't get the info, because Robot notices for cloud machines wasn't checked (i subscribed before hetzner cloud was a thing)
2018-10-12 28558, 2018
zas
that's the drawback i was talking about during summit
2018-10-12 28506, 2018
ruaok
are you getting the notifications now?
2018-10-12 28517, 2018
zas
i should
2018-10-12 28546, 2018
zas
for some reason, solr on this node didn't restarted properly, i couldn't reproduce the issue (i tried hard)
2018-10-12 28536, 2018
ruaok
I noticed a message about solr not being available on #musicbrainz, but that it fixed itself.
2018-10-12 28537, 2018
zas
but the core problem was haproxy not taking out this node, because it was healthy enough according to its checks
2018-10-12 28543, 2018
ruaok
I guess those two were related.
2018-10-12 28553, 2018
ruaok
ah!
2018-10-12 28502, 2018
zas
25% of search requests failed for 2 hours
2018-10-12 28531, 2018
zas
i fixed health checks, everything is back to normal
2018-10-12 28558, 2018
ruaok
what was wrong with the checks?
2018-10-12 28507, 2018
ruaok
alive, but not answering queries?
2018-10-12 28511, 2018
zas
well, solr was answering.... 500
2018-10-12 28520, 2018
zas
yup
2018-10-12 28538, 2018
zas
i changed it to check 200 on actual query
2018-10-12 28557, 2018
ruaok
I didn't see a nagion notification for it. are those nodes nagios monitored?
2018-10-12 28528, 2018
zas
nope
2018-10-12 28553, 2018
ruaok
why not?
2018-10-12 28538, 2018
zas
not done yet
2018-10-12 28548, 2018
zas
and nagios checks are a pain to maintain
2018-10-12 28517, 2018
zas
plus this problem wouldn't have been noticed by a standard nagios check
2018-10-12 28507, 2018
zas
alerts based on grafana/influxdb are there to fill gaps left by nagios
2018-10-12 28506, 2018
ruaok
I'm glad we don't have to pay for hetzners chaos monkey services.
2018-10-12 28516, 2018
ruaok
but it sure helps us find SPoFs. :)
2018-10-12 28522, 2018
zas
yup :)
2018-10-12 28546, 2018
zas
btw, there were 100+ alerts on telegram
2018-10-12 28500, 2018
zas
i noticed as soon i woke up
2018-10-12 28513, 2018
ruaok
ah, yes that massive spam firehose that monkey and I ignore.
2018-10-12 28555, 2018
ruaok
do we need to make yet another one that filters 99% and just sends a "yo, shit is one fire" go look at the others one?
2018-10-12 28533, 2018
ruaok
sheraton peeps: stuffing two coffee bags into the machine in one go makes a reliable cup of coffee.
2018-10-12 28508, 2018
ruaok
also sunnyvale peeps: I am planning on getting a lyft to breakfast, a weed shop and then trader joes, the eccentric food shop, then back to the hotel. anyone game?
2018-10-12 28544, 2018
outsidecontext has quit
2018-10-12 28500, 2018
zas
wait me, booking a flight ;)
2018-10-12 28526, 2018
zas
ruaok: alert system in grafana is quite new, but it improves on each release, i guess things will improve over the time regarding spammy alerts (and i refine them along time too)
2018-10-12 28500, 2018
zas
it has no "flapping" status like nagios has yet
2018-10-12 28556, 2018
Freso
ruaok: I'm up for going to Trader Joe's. Just finishing up breakfast and have no interest in weed shops though. 🙈
2018-10-12 28517, 2018
ruaok
can't pick and choose.
2018-10-12 28535, 2018
ruaok
the new route is in and out burger, trader joes, weed, back to the hotel.
2018-10-12 28514, 2018
iliekcomputers
When? I just got up, can join in a half hour.
2018-10-12 28547, 2018
ruaok
10:45 in the lobby then. :)
2018-10-12 28527, 2018
iliekcomputers
Woo, cool. :)
2018-10-12 28512, 2018
yvanzo
will be there too :)
2018-10-12 28503, 2018
Freso
If you're willing to swing by Aloft and pick me up, I'm down as well. Otherwise, I'll see you tonight.
2018-10-12 28546, 2018
iliekcomputers
I'm in the lobby
2018-10-12 28526, 2018
UmkaDK_ joined the channel
2018-10-12 28554, 2018
UmkaDK has quit
2018-10-12 28559, 2018
ruaok
If you really want to Freso . Two places are of no interest to you, but hey.
2018-10-12 28526, 2018
ruaok
If you really want to, we can swing by in about 10 mins.
2018-10-12 28531, 2018
Freso
I'd like to meet up with you guys and hang out, activity in question is secondary. I'll be outside Aloft in 10. :)