#metabrainz

/

      • ansh[m]
        Reviewed and merged holycow23's PR for introducing a new stat on LB Stats Page.
      • Great work on that!
      • And Reviewed some GSoC proposals
      • That’s it for me.
      • mayhem: next?
      • mayhem[m]
        ay!
      • reosarevok[m]
        Still on my list for today: reosarevok, Gautam Shorewala, kellnerd
      • mayhem[m]
        last week was the usual guff running MetaBrainz.
      • but I spent most of my time porting the fast-fuzzy code to C++. while C++ has gotten a lot more modern, I had forgotten how much I hate C++. and what a pain it it is to work with.
      • but we need this code to be fast, so I delved in.
      • running the same code in python took 3 minutes.
      • after a week in C++, its still not working, lol.
      • (I blame nmslib)
      • fin. kellnerd go!
      • kellnerd[m] joined the channel
      • kellnerd[m]
        Hey there!
      • Last week I've implemented seeding of external links for recordings in Harmony.
      • I mainly did this as a bonus because I wanted to improve Harmony's error messages in case someone looks up a "track" URL instead of a release URL.
      • After I did a release, it turned out that this isn't just a niche feature to give hardcore users RSI syndrome [like I thought)(https://github.com/kellnerd/harmony/is...).
      • s/)/]/
      • Apparently many people like it or were even anticipating it (to improve BrainzPlayer's matching accuracy, for example) and the forum thread was flooded by feedback I still have to work through...
      • (I am still busy with university, but I see one more candidate for an MBS ticket in there, the seeded recording form loses ISRCs which were added while the tab was open.)
      • That's it, go reosarevok!
      • reosarevok[m]
        Hi!
      • My main task for the week was schema change tickets, primarily MBS-9253
      • BrainzBot
        MBS-9253: List EP release groups above singles on artist pages https://tickets.metabrainz.org/browse/MBS-9253
      • reosarevok[m]
        I also started MBS-13768
      • BrainzBot
        MBS-13768: Add MBIDs to mediums https://tickets.metabrainz.org/browse/MBS-13768
      • reosarevok[m]
        Other than that, the usual small tickets, releasing prod today, and community work
      • (on second read, that makes it sound like I was picking up trash, doesn't it)
      • suvid[m] joined the channel
      • suvid[m]
        Hi reosarevok
      • Can I also give a review?
      • Thanks
      • reosarevok[m]
        Sure
      • So, suvid go!
      • suvid[m]
        okie
      • reosarevok[m]
        Still in my list after you: Gautam Shorewala
      • suvid[m]
        So I finished my first draft of GSoC proposal this week and put it on the community forum... (full message at <https://matrix.chatbrainz.org/_matrix/media/v3/...>)
      • That's it for me :)
      • Gautam Shorewala: you can go!
      • reosarevok[m]
        rustynova: did you want to give a review too btw?
      • Or RustyNova :)
      • Seems neither is around right now - should we have a defence talk, zas?
      • zas[m]
        Yes.
      • rustynova[m]
        Am here
      • reosarevok[m]
        rustynova: ok, do you want to give a melba review? :)
      • zas[m]
        Since 2 weeks we are trying to mitigate web traffic originating for AI web scrapers
      • reosarevok[m]
        zas: let's wait a second :)
      • rustynova[m]
        I've started to work back on Melba after getting reminded that it existed by the many GSOC applications. I plan to contribute slowly but surely now that it's in a somewhat stable state and not bound by someone else's GSOC.
      • Last week I made 3 prs, one to clean up the code of the poller, another to merge an useless crate, and lastly one for wider dependency updates.
      • Planning on checking the tests very soon as some decide to fail if you look at them wrong
      • zas[m]
        ok
      • rustynova[m]
        That's all for me.
      • zas[m]
        :D that was 1 second
      • reosarevok[m]
        Thanks! And thanks for looking into melba :)
      • monkey[m]
        Peachy
      • reosarevok[m]
        zas: let's start again from the beginning on defence :)
      • rustynova[m]
        (I readied my message for when I had the opportunity to check the chat)
      • mayhem[m]
        zas: do you know *which* AI companies are doing this?
      • zas[m]
        So, we get this traffic, and they try hard to evade any rate limit, robots.txt etc..
      • mayhem: of course not
      • They use rotational residential proxies (basically they do only few hits a day using one IP), randomized User Agents
      • For us, it meant something like 5 to 10 times the normal traffic on MB website
      • So it translates to up to 10x more servers to handle that, IF it doesn't increase
      • For our users, it meant very degraded response times (for example, for 0.5s to 5 seconds per query)
      • mayhem[m]
        do you have a sample of what the user agent strings look like?
      • zas[m]
        Yes, like thousands of perfectly legit user agent strings
      • mayhem[m]
        copied from other browsers and apps?
      • zas[m]
        They also use techniques to simulate normal browsing
      • mayhem[m]
        here is what I am getting at: I think you should collect as much data as you have to help us identify the bad actors.
      • zas[m]
        (thanks to julian45 for those links)
      • mayhem[m]
        post that on a blog post and ask for help from the general public.
      • zas[m]
        There's no point in "identifying" them... we can't.
      • Some rotational proxy services claim to have 100M IPs
      • So, we need "something" to limit their impact, but I doubt we are able to totally get rid of it
      • mayhem[m]
        do we need a "making sure you're not a bot" page, at least for now?
      • I wonder if we can somehow poison the data we sent to them.
      • zas[m]
        Yes, I think so. Because it can get worse very soon
      • julian45[m]
        possibly yeah, which is where cloudflare could help, or running something like [anubis](https://anubis.techaro.lol/), which you can observe in, e.g., gitlab.gnome.org
      • mayhem[m]
        if we could identify them, we could send bad data.
      • julian45[m]
        mayhem[m]: Re: poisoning, CloudFlare had a product announcement about this recently, one sec
      • > Today, we’re excited to announce AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect “no crawl” directives. When you opt in, Cloudflare will automatically deploy an AI-generated set of linked pages when we detect inappropriate bot activity, without the need for customers to create any custom rules.
      • > AI Labyrinth is available on an opt-in basis to all customers, including the Free plan.
      • mayhem[m]
        AI vs AI. <sob> no lol, just <sob>
      • zas[m]
        Also, it would be useful to separate MB website & web service more (using a separate domain), in order to improve filtering (we expect mostly humans on the website, and mostly robots on the web service)
      • yvanzo[m]
        I will look into the captcha stuff for MB again.
      • julian45[m]
        In favor of zas' proposal of separating web service into a subdomain
      • * a subdomain as well
      • mayhem[m]
        do you have a sample of what their requests look like?
      • julian45[m]
        Varies all the time. Usually there's a hit to set-language and then a redirect to a release or something, but that's not always the case. user-agent (advertisement of what browser/client the endpoint is using) also varies a lot
      • reosarevok[m]
        bitmap, yvanzo: is there any reason we should *not* have a subdomain for the API?
      • zas[m]
        Most advanced bots are now able to 100% mimicking human traffic (including most captchas), so we also need to work on performance improvement, and scalability
      • monkey[m]
        Time for auth tokens for all API requests ? 😢
      • mayhem[m]
        reosarevok[m]: I arranged this eaaaarly on. and I was ridiculed for it and it was taken out.
      • outsidecontext[m
        I find this really strange what effort the AI companies put into this, when for MB they could just run a mirror and get all the data fast and up-to-date
      • mayhem[m]
        just saying. <lolsob>
      • yvanzo[m]
        reosarevok: Just backwards compatibility. This isn’t an easily actionable change either.
      • zas[m]
        We already have ws.musicbrainz.org/ws/2/ btw
      • mayhem[m]
        or download the JSON data, and ingest it!
      • monkey[m]
        outsidecontext[m: I think this is more about a generic crawling solution that will work for multiple targets, otherwise they would download a copy of the data...
      • mayhem[m]
        LB uses api.lb.org so, lets use api
      • zas[m]
        outsidecontext: yes, I totally agree, but it seems AI-related scapers aren't that intelligent, they just suck data from the web
      • reosarevok[m]
        s/data from the web//
      • mayhem[m]
        if your whole infra is setup for that, it makes sense. google works the same way.
      • zas[m]
      • mayhem[m]
        great
      • zas[m]
        We need to update documentation, and likely add proper redirects
      • reosarevok[m]
        But the redirects will still have the same issue as any other MB site page, right?
      • Otherwise, we'd just filter on /ws/ on the url
      • So I guess the idea is if people get blocked for looking like bots you tell them to use the api. subdomain or?
      • zas[m]
        reosarevok was also asking if using services like Cloudflare would eventually conflict with our policies
      • reosarevok[m]
        I don't know what the privacy implications are. I understand anubis is self-hosted and might be less problematic in that way?
      • But maybe neither is problematic :)
      • zas[m]
        Cloudflare has a program for open source entities we may ask for it
      • monkey[m]
        Are both of those adapted for API requests? i.e. not just for web pages?
      • reosarevok[m]
        Well in any case we mostly just need them for web pages
      • I understand the WS can take a lot more hits, it's just the site that struggles?
      • julian45[m]
        With e.g., cloudflare we could simply have it take on requests for main MB.org but not api.MB.org
      • Ditto for anubis
      • zas[m]
        API & (human-oriented) web pages need different kinds of protection
      • mayhem[m]
        if we used anubis, we would need to whitelist google crawlers.
      • zas[m]
        Cloudflare are services for both
      • s/are/has/
      • julian45[m]
        Rejecting non-auth'd API requests would probably also help, and there are such things as "API gateways" (krakend is one, I think) that could also help us in this respect
      • outsidecontext[m
        probably not helpful for our specific issue, but I found this attempt to have AI bots waste their time on pointless pages interesting: https://fosstodon.org/@aaron@zadzmo.org/1138291...
      • zas[m]
        The problem with API gateways for MB API is that we do too much stuff on gateways right now, so we need to review things and it's not really a turnkey solution, but we definitively need to go in this direction
      • julian45[m]
        outsidecontext[m: This is part of the idea of underlying Cloudflare's "AI labyrinth" and, to an extent, anubis (same idea, different method)
      • mayhem[m]
        I guess AI will really destroy everything, but I didn't expect it to go this way.
      • I think the only real hope we have is.... the AI bubble bursting.
      • monkey[m]
        Eek.
      • reosarevok[m]
        We still need to do something until that happens though :)
      • mayhem[m]
        I think we need to check out anubis for the time being.
      • zas[m]
        I don't expect it to happen... nor that people will stop to use LLMs or anything like that, we need solutions now
      • mayhem[m]
        would blacklisting all ali cloud IPs be a good idea?
      • that at least should take care of some of them
      • reosarevok[m]
        I would assume we've done that already
      • Given how many IPs we've blacklisted
      • julian45[m]
        mayhem[m]: Broadly agreed. In parallel, how about also seeing what we could do with Cloudflare given their offerings to FOSS projects?
      • zas[m]
        As explained, blocking IPs will not work on the long term: they use services that provide infinite number of dynamic/residential IPs and when we block them, we end blocking our legit users
      • reosarevok[m]
        We've had a lot of support messages asking for unblocking so far
      • mayhem[m]
        I'm not suggesting a total fix with that, but a reduction in severity
      • reosarevok[m]
        Of seemingly perfectly legit accounts
      • zas[m]
        It can mitigate things somehow (I identified 700 IP blocks used in last 2 days just for this purpose)
      • I blocked 1M IPs, and keep doing it
      • monkey[m] wonders if someone is working on an AI scraper IP addresses blocklist FOSS of sorts