#metabrainz

/

      • to81 joined the channel
      • 2017-06-04 15519, 2017

      • d4rkie has quit
      • 2017-06-04 15549, 2017

      • to81 has quit
      • 2017-06-04 15522, 2017

      • CatQuest
        oh ho, so various obviously spam things are being blocked as the logs are comming up?
      • 2017-06-04 15522, 2017

      • CatQuest
        !zas
      • 2017-06-04 15540, 2017

      • CatQuest
        hm
      • 2017-06-04 15540, 2017

      • CatQuest
        !m zas
      • 2017-06-04 15540, 2017

      • BrainzBot
        You're doing good work, zas!
      • 2017-06-04 15559, 2017

      • hibiscuskazeneko has quit
      • 2017-06-04 15534, 2017

      • agentsim joined the channel
      • 2017-06-04 15503, 2017

      • arbenina_ has quit
      • 2017-06-04 15535, 2017

      • CatQuest
        wtf is crazy webcrawler
      • 2017-06-04 15514, 2017

      • CatQuest
        should i report users who are obviously spam in the list?
      • 2017-06-04 15509, 2017

      • drsaunders joined the channel
      • 2017-06-04 15552, 2017

      • SothoTalKer
        CatQuest: i guess reporting the top users in the list could be reported if they are spammers
      • 2017-06-04 15550, 2017

      • SothoTalKer
        yeah
      • 2017-06-04 15558, 2017

      • SothoTalKer
        whatever i wanted to say there
      • 2017-06-04 15505, 2017

      • CatQuest
        i alo think that mayve freso et all are going trought thme anyway so it's probably not important to do it
      • 2017-06-04 15554, 2017

      • SothoTalKer
        many of those will be purged when the spam domain emails deletion will be in place i guess
      • 2017-06-04 15545, 2017

      • Slurpee joined the channel
      • 2017-06-04 15550, 2017

      • D4RK-PH0ENiX joined the channel
      • 2017-06-04 15520, 2017

      • agentsim has quit
      • 2017-06-04 15551, 2017

      • CatQuest
        indeed
      • 2017-06-04 15501, 2017

      • to81 joined the channel
      • 2017-06-04 15527, 2017

      • to81 has quit
      • 2017-06-04 15505, 2017

      • samj1912 joined the channel
      • 2017-06-04 15548, 2017

      • to81 joined the channel
      • 2017-06-04 15517, 2017

      • agentsim joined the channel
      • 2017-06-04 15549, 2017

      • to81 has quit
      • 2017-06-04 15514, 2017

      • Freso
        CatQuest: Yeah; no reason to report spammers based on that report/page. They will hopefully get dealt with in an automated fashion and may be useful for data gathering until then.
      • 2017-06-04 15526, 2017

      • CatQuest
        :D
      • 2017-06-04 15529, 2017

      • to81 joined the channel
      • 2017-06-04 15520, 2017

      • to81 has quit
      • 2017-06-04 15528, 2017

      • hibiscuskazeneko joined the channel
      • 2017-06-04 15534, 2017

      • agentsim has quit
      • 2017-06-04 15548, 2017

      • agentsim joined the channel
      • 2017-06-04 15509, 2017

      • github joined the channel
      • 2017-06-04 15509, 2017

      • github
        [musicbrainz-server] zas closed pull request #519: Disallow more stuff in robots.txt and use Crawl-delay option (master...master) https://git.io/vH2dT
      • 2017-06-04 15509, 2017

      • github has left the channel
      • 2017-06-04 15559, 2017

      • github joined the channel
      • 2017-06-04 15559, 2017

      • github
        [musicbrainz-server] zas opened pull request #520: Update robots.txt (production...robots) https://git.io/vHaU1
      • 2017-06-04 15559, 2017

      • github has left the channel
      • 2017-06-04 15543, 2017

      • github joined the channel
      • 2017-06-04 15543, 2017

      • github
        [musicbrainz-server] mwiencek closed pull request #520: Update robots.txt (production...robots) https://git.io/vHaU1
      • 2017-06-04 15544, 2017

      • github has left the channel
      • 2017-06-04 15502, 2017

      • SothoTalKer
        ohh nice :)
      • 2017-06-04 15519, 2017

      • to81 joined the channel
      • 2017-06-04 15520, 2017

      • to81 has quit
      • 2017-06-04 15534, 2017

      • to81 joined the channel
      • 2017-06-04 15538, 2017

      • ruaok
        zas: can we please try and graph new users/hour in grafana?
      • 2017-06-04 15513, 2017

      • zas
        ruaok: where do i get the data ?
      • 2017-06-04 15555, 2017

      • ruaok
        I think we would need to have musicbrainz-server add a point to influx.
      • 2017-06-04 15507, 2017

      • ruaok
        username and IP address it was registered from.
      • 2017-06-04 15536, 2017

      • zas
        I asked for this some time ago, a bunch of metrics in json format, so we can use json telegraf plugin
      • 2017-06-04 15524, 2017

      • zas
        btw, yesterday 67% of website queries (excluding ws/search/static) was from search engines bots
      • 2017-06-04 15522, 2017

      • zas
      • 2017-06-04 15547, 2017

      • zas
        which had a Crawl-delay and disallow some urls + some bots
      • 2017-06-04 15541, 2017

      • zas
        i also blocked some IPs, mostly chinese bots intentionnaly ignoring robots.txt
      • 2017-06-04 15519, 2017

      • SothoTalKer
        i guess we will see some different stats over the next few weeks then
      • 2017-06-04 15546, 2017

      • zas
        bitmap: how hard would it to adds a json output for a bunch of mb related metrics (like statistics we have already), see https://github.com/influxdata/telegraf/tree/maste…
      • 2017-06-04 15554, 2017

      • ruaok
        zas: on the metrics front, please open a ticket for that.
      • 2017-06-04 15515, 2017

      • ruaok
        we need to start collecting these stats and when the count ticks up, we need to start blocking IPs.
      • 2017-06-04 15527, 2017

      • ruaok
        until we come up with a more effective captcha...
      • 2017-06-04 15515, 2017

      • bitmap
        zas: depends on what kind of stats, but probably not hard
      • 2017-06-04 15510, 2017

      • bitmap
        and what kind of authentication it needs
      • 2017-06-04 15517, 2017

      • hibiscuskazeneko has quit
      • 2017-06-04 15506, 2017

      • zas
      • 2017-06-04 15506, 2017

      • BrainzBot
        MBS-9364: Provide some metrics through a json end point
      • 2017-06-04 15537, 2017

      • zas
        no auth, just provide count of entities every minute (or more if too heavy)
      • 2017-06-04 15528, 2017

      • SothoTalKer
        quite nice that MB is crawled by an anti piracy bot: musobot/1.0 :D
      • 2017-06-04 15545, 2017

      • bitmap
      • 2017-06-04 15528, 2017

      • reosarevok sighs
      • 2017-06-04 15502, 2017

      • SothoTalKer
        i bet it does not obey robots.txt :)
      • 2017-06-04 15536, 2017

      • reosarevok
        I thought the hard part of converting the data from this encyclopedia into wiki articles would be to convert the text into wiki-style text
      • 2017-06-04 15557, 2017

      • reosarevok
        Turns out just getting the text into a form that can be programmatically dealt with is a huge pita
      • 2017-06-04 15515, 2017

      • SothoTalKer
        this?
      • 2017-06-04 15537, 2017

      • reosarevok
      • 2017-06-04 15538, 2017

      • reosarevok
        This
      • 2017-06-04 15544, 2017

      • reosarevok
        The Encyclopedia of Estonian Scientists :p
      • 2017-06-04 15521, 2017

      • reosarevok
        Turns out manipulating 1500 pages of text is annoying :p
      • 2017-06-04 15525, 2017

      • SothoTalKer
        well, 1913 for the IV-version o.o
      • 2017-06-04 15534, 2017

      • reosarevok
        volume :)
      • 2017-06-04 15549, 2017

      • SothoTalKer
        that, too :D
      • 2017-06-04 15554, 2017

      • reosarevok
        Yes, there's apparently a lot of Estonian scientists :D
      • 2017-06-04 15516, 2017

      • reosarevok switches the approach to "if given an entry, deal with it" first, before trying to actually split the entries
      • 2017-06-04 15508, 2017

      • SothoTalKer
        the question is: are all of those scientists notable enough for wikipedia :)
      • 2017-06-04 15503, 2017

      • reosarevok
        For the Estonian one? Yeah
      • 2017-06-04 15524, 2017

      • reosarevok
        (it was Wikimedia Estonia who was looking into doing this actually)
      • 2017-06-04 15525, 2017

      • Leftmost
        Are there any decent IRC clients for OSX that don't require money?
      • 2017-06-04 15502, 2017

      • ruaok
        if you want, I can add out to the team irccoud account
      • 2017-06-04 15536, 2017

      • Leftmost
        Is that in any way a pain in the ass?
      • 2017-06-04 15559, 2017

      • Leo_Verto[m]
        zas: by the way, do you know what 46.229.171.212 is? all the other top IPs seem to be the big search engine crawlers but that one seems to be a CDN and hosting service located in the netherlands
      • 2017-06-04 15527, 2017

      • ruaok
        Leftmost: no more than you are a pain the ass. :)
      • 2017-06-04 15558, 2017

      • zas
        Leo_Verto[m]: 46.229.171.212 was just doing same query over and over, with spammy stuff appended to the url, now blocked
      • 2017-06-04 15520, 2017

      • Leftmost
        So it's a huge pain in the ass. :)
      • 2017-06-04 15538, 2017

      • ruaok
        yep, but I am willing to endure it nonetheless.
      • 2017-06-04 15515, 2017

      • ruaok
        pm me the email address you want to use for this
      • 2017-06-04 15523, 2017

      • Leo_Verto[m]
        zas: abuse@datawebglobalgroup.com might be interested in that then
      • 2017-06-04 15545, 2017

      • regagain joined the channel
      • 2017-06-04 15559, 2017

      • regagain
        Hello everybody!
      • 2017-06-04 15508, 2017

      • Leo_Verto[m]
        Hey!
      • 2017-06-04 15515, 2017

      • Mineo joined the channel
      • 2017-06-04 15547, 2017

      • CatQuest
        hi rega! \o
      • 2017-06-04 15538, 2017

      • Leftmost_ joined the channel
      • 2017-06-04 15521, 2017

      • Leftmost has quit
      • 2017-06-04 15521, 2017

      • Leftmost_ is now known as Leftmost
      • 2017-06-04 15539, 2017

      • hibiscuskazeneko joined the channel
      • 2017-06-04 15501, 2017

      • to81 has quit
      • 2017-06-04 15511, 2017

      • rahulr has quit
      • 2017-06-04 15511, 2017

      • ferbncode has quit
      • 2017-06-04 15512, 2017

      • iliekcomputers has quit
      • 2017-06-04 15550, 2017

      • ferbncode joined the channel
      • 2017-06-04 15553, 2017

      • iliekcomputers joined the channel
      • 2017-06-04 15506, 2017

      • to81 joined the channel
      • 2017-06-04 15508, 2017

      • to81 has quit
      • 2017-06-04 15513, 2017

      • samj1912 has quit
      • 2017-06-04 15541, 2017

      • arbenina joined the channel
      • 2017-06-04 15512, 2017

      • iliekcomputers
        ruaok: will you have the time to work on LB tomorrow? I was hoping to get the python3 PR merged :)
      • 2017-06-04 15535, 2017

      • ruaok
        yes. you're near the top of list.
      • 2017-06-04 15548, 2017

      • ruaok
        (even though it is a bank holiday here)
      • 2017-06-04 15505, 2017

      • iliekcomputers
        Awesome! Thanks. :)
      • 2017-06-04 15526, 2017

      • kyan joined the channel
      • 2017-06-04 15511, 2017

      • zas
        ruaok: i'm proceeding to block ws abusers, mainly IPs doing repeated queries ending with 403s (bad UA mostly, will not complain), i blocked top 500 IPs at network level: https://stats.metabrainz.org/dashboard/db/mbstats…
      • 2017-06-04 15540, 2017

      • hibiscuskazeneko has quit
      • 2017-06-04 15549, 2017

      • lazka has quit
      • 2017-06-04 15534, 2017

      • antgel joined the channel
      • 2017-06-04 15523, 2017

      • arbenina has quit
      • 2017-06-04 15552, 2017

      • arbenina joined the channel
      • 2017-06-04 15514, 2017

      • hibiscuskazeneko joined the channel
      • 2017-06-04 15534, 2017

      • SothoTalKer
        zas: how can you actually get a 403?
      • 2017-06-04 15524, 2017

      • zas
        using the ws without a proper ua string to start with
      • 2017-06-04 15516, 2017

      • zas
        those i blocked continue hammering the ws, even when getting not a single positive response...
      • 2017-06-04 15516, 2017

      • SothoTalKer
        ah. lucky me that my script had a reasonable default UA :)
      • 2017-06-04 15545, 2017

      • reosarevok
        zas: how much do we care once they're blocked?
      • 2017-06-04 15558, 2017

      • reosarevok
        They still consume *something* to get the response, right?
      • 2017-06-04 15505, 2017

      • reosarevok
        Even if the response is fuck off?
      • 2017-06-04 15534, 2017

      • zas
        Well, since they are blocked at IP level, using ipset, that's a very a low overhead, compared to blocked at http level
      • 2017-06-04 15500, 2017

      • reosarevok
        Ok, so just laughing at them is ok then?
      • 2017-06-04 15541, 2017

      • lengtche joined the channel
      • 2017-06-04 15536, 2017

      • to81 joined the channel
      • 2017-06-04 15540, 2017

      • zas
        reosarevok: yes, they don't care responses from ws, i doubt they'll even noticed they were blocked
      • 2017-06-04 15552, 2017

      • zas
        if someone complains, well, we'll handle the case ;)
      • 2017-06-04 15543, 2017

      • SothoTalKer
        zas: does this mean the WS should be a bit less overloaded? ^-^
      • 2017-06-04 15558, 2017

      • zas
        not really
      • 2017-06-04 15507, 2017

      • zas
        because those never hit backends
      • 2017-06-04 15534, 2017

      • SothoTalKer
        hmh
      • 2017-06-04 15553, 2017

      • zas
        but in next days i'll have a look at other kinds of abuses, and see if we can reduce the traffic a bit
      • 2017-06-04 15530, 2017

      • SothoTalKer
        blocking a few bots does help i guess
      • 2017-06-04 15533, 2017

      • hibiscuskazeneko has quit
      • 2017-06-04 15547, 2017

      • SothoTalKer
        and rate limiting the others