in #metabrainz

18:19 PM
ansh[m]

Reviewed and merged holycow23's PR for introducing a new stat on LB Stats Page.
18:19 PM
Great work on that!
18:19 PM
And Reviewed some GSoC proposals
18:19 PM
That’s it for me.
18:19 PM
mayhem: next?
18:19 PM
mayhem[m]

ay!
18:19 PM
reosarevok[m]

Still on my list for today: reosarevok, Gautam Shorewala, kellnerd
18:20 PM
mayhem[m]

last week was the usual guff running MetaBrainz.
18:20 PM
but I spent most of my time porting the fast-fuzzy code to C++. while C++ has gotten a lot more modern, I had forgotten how much I hate C++. and what a pain it it is to work with.
18:20 PM
but we need this code to be fast, so I delved in.
18:21 PM
running the same code in python took 3 minutes.
18:21 PM
after a week in C++, its still not working, lol.
18:21 PM
(I blame nmslib)
18:21 PM
fin. kellnerd go!
18:21 PM
kellnerd[m] joined the channel
18:21 PM
kellnerd[m]

Hey there!
18:22 PM
Last week I've implemented seeding of external links for recordings in Harmony.
18:22 PM
I mainly did this as a bonus because I wanted to improve Harmony's error messages in case someone looks up a "track" URL instead of a release URL.
18:22 PM
After I did a release, it turned out that this isn't just a niche feature to give hardcore users RSI syndrome [like I thought)(https://github.com/kellnerd/harmony/is...).
18:22 PM
s/)/]/
18:23 PM
Apparently many people like it or were even anticipating it (to improve BrainzPlayer's matching accuracy, for example) and the forum thread was flooded by feedback I still have to work through...
18:23 PM
(I am still busy with university, but I see one more candidate for an MBS ticket in there, the seeded recording form loses ISRCs which were added while the tab was open.)
18:24 PM
That's it, go reosarevok!
18:24 PM
reosarevok[m]

Hi!
18:24 PM
My main task for the week was schema change tickets, primarily MBS-9253
18:24 PM
BrainzBot

MBS-9253: List EP release groups above singles on artist pages https://tickets.metabrainz.org/browse/MBS-9253
18:25 PM
reosarevok[m]

I also started MBS-13768
18:25 PM
BrainzBot

MBS-13768: Add MBIDs to mediums https://tickets.metabrainz.org/browse/MBS-13768
18:25 PM
reosarevok[m]

Other than that, the usual small tickets, releasing prod today, and community work
18:25 PM
(on second read, that makes it sound like I was picking up trash, doesn't it)
18:26 PM
suvid[m] joined the channel
18:26 PM
suvid[m]

Hi reosarevok
18:26 PM
Can I also give a review?
18:26 PM
Thanks
18:26 PM
reosarevok[m]

Sure
18:26 PM
So, suvid go!
18:26 PM
suvid[m]

okie
18:26 PM
reosarevok[m]

Still in my list after you: Gautam Shorewala
18:28 PM
suvid[m]

So I finished my first draft of GSoC proposal this week and put it on the community forum... (full message at <https://matrix.chatbrainz.org/_matrix/media/v3/...>)
18:28 PM
That's it for me :)
18:28 PM
Gautam Shorewala: you can go!
18:30 PM
reosarevok[m]

rustynova: did you want to give a review too btw?
18:30 PM
Or RustyNova :)
18:31 PM
Seems neither is around right now - should we have a defence talk, zas?
18:31 PM
zas[m]

Yes.
18:31 PM
rustynova[m]

Am here
18:32 PM
reosarevok[m]

rustynova: ok, do you want to give a melba review? :)
18:32 PM
zas[m]

Since 2 weeks we are trying to mitigate web traffic originating for AI web scrapers
18:32 PM
reosarevok[m]

zas: let's wait a second :)
18:32 PM
rustynova[m]

I've started to work back on Melba after getting reminded that it existed by the many GSOC applications. I plan to contribute slowly but surely now that it's in a somewhat stable state and not bound by someone else's GSOC.
18:32 PM
Last week I made 3 prs, one to clean up the code of the poller, another to merge an useless crate, and lastly one for wider dependency updates.
18:32 PM
Planning on checking the tests very soon as some decide to fail if you look at them wrong
18:32 PM
zas[m]

ok
18:33 PM
rustynova[m]

That's all for me.
18:33 PM
zas[m]

:D that was 1 second
18:33 PM
reosarevok[m]

Thanks! And thanks for looking into melba :)
18:33 PM
monkey[m]

Peachy
18:33 PM
reosarevok[m]

zas: let's start again from the beginning on defence :)
18:34 PM
rustynova[m]

(I readied my message for when I had the opportunity to check the chat)
18:34 PM
mayhem[m]

zas: do you know *which* AI companies are doing this?
18:34 PM
zas[m]

So, we get this traffic, and they try hard to evade any rate limit, robots.txt etc..
18:34 PM
mayhem: of course not
18:35 PM
They use rotational residential proxies (basically they do only few hits a day using one IP), randomized User Agents
18:36 PM
For us, it meant something like 5 to 10 times the normal traffic on MB website
18:36 PM
So it translates to up to 10x more servers to handle that, IF it doesn't increase
18:37 PM
For our users, it meant very degraded response times (for example, for 0.5s to 5 seconds per query)
18:37 PM
mayhem[m]

do you have a sample of what the user agent strings look like?
18:38 PM
zas[m]

Yes, like thousands of perfectly legit user agent strings
18:38 PM
mayhem[m]

copied from other browsers and apps?
18:38 PM
zas[m]

They also use techniques to simulate normal browsing
18:38 PM
We aren't alone -> https://thelibre.news/foss-infrastructure-is-un...
18:38 PM
https://drewdevault.com/2025/03/17/2025-03-17-S...
18:38 PM
mayhem[m]

here is what I am getting at: I think you should collect as much data as you have to help us identify the bad actors.
18:39 PM
zas[m]

(thanks to julian45 for those links)
18:39 PM
mayhem[m]

post that on a blog post and ask for help from the general public.
18:39 PM
zas[m]

There's no point in "identifying" them... we can't.
18:40 PM
Some rotational proxy services claim to have 100M IPs
18:40 PM
So, we need "something" to limit their impact, but I doubt we are able to totally get rid of it
18:41 PM
mayhem[m]

do we need a "making sure you're not a bot" page, at least for now?
18:41 PM
I wonder if we can somehow poison the data we sent to them.
18:41 PM
zas[m]

Yes, I think so. Because it can get worse very soon
18:41 PM
julian45[m]

possibly yeah, which is where cloudflare could help, or running something like [anubis](https://anubis.techaro.lol/), which you can observe in, e.g., gitlab.gnome.org
18:41 PM
mayhem[m]

if we could identify them, we could send bad data.
18:41 PM
julian45[m]

mayhem[m]: Re: poisoning, CloudFlare had a product announcement about this recently, one sec
18:42 PM
https://blog.cloudflare.com/ai-labyrinth/... (full message at <https://matrix.chatbrainz.org/_matrix/media/v3/...>)
18:42 PM
* https://blog.cloudflare.com/ai-labyrinth/
18:42 PM
> Today, we’re excited to announce AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect “no crawl” directives. When you opt in, Cloudflare will automatically deploy an AI-generated set of linked pages when we detect inappropriate bot activity, without the need for customers to create any custom rules.
18:42 PM
> AI Labyrinth is available on an opt-in basis to all customers, including the Free plan.
18:43 PM
mayhem[m]

AI vs AI. <sob> no lol, just <sob>
18:43 PM
zas[m]

Also, it would be useful to separate MB website & web service more (using a separate domain), in order to improve filtering (we expect mostly humans on the website, and mostly robots on the web service)
18:43 PM
yvanzo[m]

I will look into the captcha stuff for MB again.
18:43 PM
julian45[m]

In favor of zas' proposal of separating web service into a subdomain
18:43 PM
* a subdomain as well
18:43 PM
mayhem[m]

do you have a sample of what their requests look like?
18:44 PM
julian45[m]

Varies all the time. Usually there's a hit to set-language and then a redirect to a release or something, but that's not always the case. user-agent (advertisement of what browser/client the endpoint is using) also varies a lot
18:44 PM
reosarevok[m]

bitmap, yvanzo: is there any reason we should *not* have a subdomain for the API?
18:44 PM
zas[m]

Most advanced bots are now able to 100% mimicking human traffic (including most captchas), so we also need to work on performance improvement, and scalability
18:45 PM
monkey[m]

Time for auth tokens for all API requests ? 😢
18:45 PM
mayhem[m]

reosarevok[m]: I arranged this eaaaarly on. and I was ridiculed for it and it was taken out.
18:45 PM
outsidecontext[m

I find this really strange what effort the AI companies put into this, when for MB they could just run a mirror and get all the data fast and up-to-date
18:45 PM
mayhem[m]

just saying. <lolsob>
18:45 PM
yvanzo[m]

reosarevok: Just backwards compatibility. This isn’t an easily actionable change either.
18:45 PM
zas[m]

We already have ws.musicbrainz.org/ws/2/ btw
18:45 PM
mayhem[m]

or download the JSON data, and ingest it!
18:46 PM
monkey[m]

outsidecontext[m: I think this is more about a generic crawling solution that will work for multiple targets, otherwise they would download a copy of the data...
18:46 PM
mayhem[m]

LB uses api.lb.org so, lets use api
18:46 PM
zas[m]

outsidecontext: yes, I totally agree, but it seems AI-related scapers aren't that intelligent, they just suck data from the web
18:46 PM
reosarevok[m]

s/data from the web//
18:46 PM
mayhem[m]

if your whole infra is setup for that, it makes sense. google works the same way.
18:46 PM
zas[m]

api.musicbrainz.org (ok?)
18:47 PM
mayhem[m]

great
18:47 PM
zas[m]

We need to update documentation, and likely add proper redirects
18:47 PM
reosarevok[m]

But the redirects will still have the same issue as any other MB site page, right?
18:48 PM
Otherwise, we'd just filter on /ws/ on the url
18:48 PM
So I guess the idea is if people get blocked for looking like bots you tell them to use the api. subdomain or?
18:48 PM
zas[m]

reosarevok was also asking if using services like Cloudflare would eventually conflict with our policies
18:49 PM
reosarevok[m]

I don't know what the privacy implications are. I understand anubis is self-hosted and might be less problematic in that way?
18:49 PM
But maybe neither is problematic :)
18:49 PM
zas[m]

Cloudflare has a program for open source entities we may ask for it
18:50 PM
monkey[m]

Are both of those adapted for API requests? i.e. not just for web pages?
18:50 PM
reosarevok[m]

Well in any case we mostly just need them for web pages
18:51 PM
I understand the WS can take a lot more hits, it's just the site that struggles?
18:51 PM
julian45[m]

With e.g., cloudflare we could simply have it take on requests for main MB.org but not api.MB.org
18:51 PM
Ditto for anubis
18:51 PM
zas[m]

API & (human-oriented) web pages need different kinds of protection
18:51 PM
mayhem[m]

if we used anubis, we would need to whitelist google crawlers.
18:51 PM
zas[m]

Cloudflare are services for both
18:51 PM
s/are/has/
18:52 PM
julian45[m]

Rejecting non-auth'd API requests would probably also help, and there are such things as "API gateways" (krakend is one, I think) that could also help us in this respect
18:53 PM
outsidecontext[m

probably not helpful for our specific issue, but I found this attempt to have AI bots waste their time on pointless pages interesting: https://fosstodon.org/@aaron@zadzmo.org/1138291...
18:53 PM
zas[m]

The problem with API gateways for MB API is that we do too much stuff on gateways right now, so we need to review things and it's not really a turnkey solution, but we definitively need to go in this direction
18:54 PM
julian45[m]

outsidecontext[m: This is part of the idea of underlying Cloudflare's "AI labyrinth" and, to an extent, anubis (same idea, different method)
18:54 PM
mayhem[m]

I guess AI will really destroy everything, but I didn't expect it to go this way.
18:56 PM
I think the only real hope we have is.... the AI bubble bursting.
18:56 PM
monkey[m]

Eek.
18:56 PM
reosarevok[m]

We still need to do something until that happens though :)
18:56 PM
mayhem[m]

I think we need to check out anubis for the time being.
18:56 PM
zas[m]

I don't expect it to happen... nor that people will stop to use LLMs or anything like that, we need solutions now
18:57 PM
mayhem[m]

would blacklisting all ali cloud IPs be a good idea?
18:57 PM
that at least should take care of some of them
18:57 PM
reosarevok[m]

I would assume we've done that already
18:57 PM
Given how many IPs we've blacklisted
18:57 PM
julian45[m]

mayhem[m]: Broadly agreed. In parallel, how about also seeing what we could do with Cloudflare given their offerings to FOSS projects?
18:57 PM
zas[m]

As explained, blocking IPs will not work on the long term: they use services that provide infinite number of dynamic/residential IPs and when we block them, we end blocking our legit users
18:58 PM
reosarevok[m]

We've had a lot of support messages asking for unblocking so far
18:58 PM
mayhem[m]

I'm not suggesting a total fix with that, but a reduction in severity
18:58 PM
reosarevok[m]

Of seemingly perfectly legit accounts
18:58 PM
zas[m]

It can mitigate things somehow (I identified 700 IP blocks used in last 2 days just for this purpose)
18:58 PM
I blocked 1M IPs, and keep doing it
18:59 PM
monkey[m] wonders if someone is working on an AI scraper IP addresses blocklist FOSS of sorts