Reviewed and merged holycow23's PR for introducing a new stat on LB Stats Page.
2025-03-24 08306, 2025
ansh[m]
Great work on that!
2025-03-24 08327, 2025
ansh[m]
And Reviewed some GSoC proposals
2025-03-24 08331, 2025
ansh[m]
That’s it for me.
2025-03-24 08343, 2025
ansh[m]
mayhem: next?
2025-03-24 08350, 2025
mayhem[m]
ay!
2025-03-24 08353, 2025
reosarevok[m]
Still on my list for today: reosarevok, Gautam Shorewala, kellnerd
2025-03-24 08305, 2025
mayhem[m]
last week was the usual guff running MetaBrainz.
2025-03-24 08349, 2025
mayhem[m]
but I spent most of my time porting the fast-fuzzy code to C++. while C++ has gotten a lot more modern, I had forgotten how much I hate C++. and what a pain it it is to work with.
2025-03-24 08358, 2025
mayhem[m]
but we need this code to be fast, so I delved in.
2025-03-24 08309, 2025
mayhem[m]
running the same code in python took 3 minutes.
2025-03-24 08321, 2025
mayhem[m]
after a week in C++, its still not working, lol.
2025-03-24 08328, 2025
mayhem[m]
(I blame nmslib)
2025-03-24 08337, 2025
mayhem[m]
fin. kellnerd go!
2025-03-24 08350, 2025
kellnerd[m] joined the channel
2025-03-24 08350, 2025
kellnerd[m]
Hey there!
2025-03-24 08308, 2025
kellnerd[m]
Last week I've implemented seeding of external links for recordings in Harmony.
2025-03-24 08323, 2025
kellnerd[m]
I mainly did this as a bonus because I wanted to improve Harmony's error messages in case someone looks up a "track" URL instead of a release URL.
Apparently many people like it or were even anticipating it (to improve BrainzPlayer's matching accuracy, for example) and the forum thread was flooded by feedback I still have to work through...
2025-03-24 08352, 2025
kellnerd[m]
(I am still busy with university, but I see one more candidate for an MBS ticket in there, the seeded recording form loses ISRCs which were added while the tab was open.)
2025-03-24 08318, 2025
kellnerd[m]
That's it, go reosarevok!
2025-03-24 08328, 2025
reosarevok[m]
Hi!
2025-03-24 08349, 2025
reosarevok[m]
My main task for the week was schema change tickets, primarily MBS-9253
Other than that, the usual small tickets, releasing prod today, and community work
2025-03-24 08356, 2025
reosarevok[m]
(on second read, that makes it sound like I was picking up trash, doesn't it)
2025-03-24 08322, 2025
suvid[m] joined the channel
2025-03-24 08323, 2025
suvid[m]
Hi reosarevok
2025-03-24 08323, 2025
suvid[m]
Can I also give a review?
2025-03-24 08323, 2025
suvid[m]
Thanks
2025-03-24 08333, 2025
reosarevok[m]
Sure
2025-03-24 08347, 2025
reosarevok[m]
So, suvid go!
2025-03-24 08355, 2025
suvid[m]
okie
2025-03-24 08358, 2025
reosarevok[m]
Still in my list after you: Gautam Shorewala
2025-03-24 08325, 2025
suvid[m]
So I finished my first draft of GSoC proposal this week and put it on the community forum... (full message at <https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/RJGkBUrXmrpaIqCqUWskcptr>)
2025-03-24 08341, 2025
suvid[m]
That's it for me :)
2025-03-24 08352, 2025
suvid[m]
Gautam Shorewala: you can go!
2025-03-24 08313, 2025
reosarevok[m]
rustynova: did you want to give a review too btw?
2025-03-24 08333, 2025
reosarevok[m]
Or RustyNova :)
2025-03-24 08344, 2025
reosarevok[m]
Seems neither is around right now - should we have a defence talk, zas?
2025-03-24 08356, 2025
zas[m]
Yes.
2025-03-24 08358, 2025
rustynova[m]
Am here
2025-03-24 08317, 2025
reosarevok[m]
rustynova: ok, do you want to give a melba review? :)
2025-03-24 08325, 2025
zas[m]
Since 2 weeks we are trying to mitigate web traffic originating for AI web scrapers
2025-03-24 08338, 2025
reosarevok[m]
zas: let's wait a second :)
2025-03-24 08346, 2025
rustynova[m]
I've started to work back on Melba after getting reminded that it existed by the many GSOC applications. I plan to contribute slowly but surely now that it's in a somewhat stable state and not bound by someone else's GSOC.
2025-03-24 08346, 2025
rustynova[m]
Last week I made 3 prs, one to clean up the code of the poller, another to merge an useless crate, and lastly one for wider dependency updates.
2025-03-24 08346, 2025
rustynova[m]
Planning on checking the tests very soon as some decide to fail if you look at them wrong
2025-03-24 08348, 2025
zas[m]
ok
2025-03-24 08309, 2025
rustynova[m]
That's all for me.
2025-03-24 08318, 2025
zas[m]
:D that was 1 second
2025-03-24 08323, 2025
reosarevok[m]
Thanks! And thanks for looking into melba :)
2025-03-24 08347, 2025
monkey[m]
Peachy
2025-03-24 08354, 2025
reosarevok[m]
zas: let's start again from the beginning on defence :)
2025-03-24 08316, 2025
rustynova[m]
(I readied my message for when I had the opportunity to check the chat)
2025-03-24 08327, 2025
mayhem[m]
zas: do you know *which* AI companies are doing this?
2025-03-24 08347, 2025
zas[m]
So, we get this traffic, and they try hard to evade any rate limit, robots.txt etc..
2025-03-24 08353, 2025
zas[m]
mayhem: of course not
2025-03-24 08338, 2025
zas[m]
They use rotational residential proxies (basically they do only few hits a day using one IP), randomized User Agents
2025-03-24 08306, 2025
zas[m]
For us, it meant something like 5 to 10 times the normal traffic on MB website
2025-03-24 08331, 2025
zas[m]
So it translates to up to 10x more servers to handle that, IF it doesn't increase
2025-03-24 08315, 2025
zas[m]
For our users, it meant very degraded response times (for example, for 0.5s to 5 seconds per query)
2025-03-24 08338, 2025
mayhem[m]
do you have a sample of what the user agent strings look like?
2025-03-24 08305, 2025
zas[m]
Yes, like thousands of perfectly legit user agent strings
2025-03-24 08319, 2025
mayhem[m]
copied from other browsers and apps?
2025-03-24 08326, 2025
zas[m]
They also use techniques to simulate normal browsing
here is what I am getting at: I think you should collect as much data as you have to help us identify the bad actors.
2025-03-24 08306, 2025
zas[m]
(thanks to julian45 for those links)
2025-03-24 08321, 2025
mayhem[m]
post that on a blog post and ask for help from the general public.
2025-03-24 08331, 2025
zas[m]
There's no point in "identifying" them... we can't.
2025-03-24 08305, 2025
zas[m]
Some rotational proxy services claim to have 100M IPs
2025-03-24 08354, 2025
zas[m]
So, we need "something" to limit their impact, but I doubt we are able to totally get rid of it
2025-03-24 08304, 2025
mayhem[m]
do we need a "making sure you're not a bot" page, at least for now?
2025-03-24 08329, 2025
mayhem[m]
I wonder if we can somehow poison the data we sent to them.
2025-03-24 08337, 2025
zas[m]
Yes, I think so. Because it can get worse very soon
2025-03-24 08337, 2025
julian45[m]
possibly yeah, which is where cloudflare could help, or running something like [anubis](https://anubis.techaro.lol/), which you can observe in, e.g., gitlab.gnome.org
2025-03-24 08343, 2025
mayhem[m]
if we could identify them, we could send bad data.
2025-03-24 08352, 2025
julian45[m]
mayhem[m]: Re: poisoning, CloudFlare had a product announcement about this recently, one sec
> Today, we’re excited to announce AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect “no crawl” directives. When you opt in, Cloudflare will automatically deploy an AI-generated set of linked pages when we detect inappropriate bot activity, without the need for customers to create any custom rules.
2025-03-24 08341, 2025
julian45[m]
> AI Labyrinth is available on an opt-in basis to all customers, including the Free plan.
2025-03-24 08301, 2025
mayhem[m]
AI vs AI. <sob> no lol, just <sob>
2025-03-24 08303, 2025
zas[m]
Also, it would be useful to separate MB website & web service more (using a separate domain), in order to improve filtering (we expect mostly humans on the website, and mostly robots on the web service)
2025-03-24 08313, 2025
yvanzo[m]
I will look into the captcha stuff for MB again.
2025-03-24 08343, 2025
julian45[m]
In favor of zas' proposal of separating web service into a subdomain
2025-03-24 08345, 2025
julian45[m]
* a subdomain as well
2025-03-24 08348, 2025
mayhem[m]
do you have a sample of what their requests look like?
2025-03-24 08327, 2025
julian45[m]
Varies all the time. Usually there's a hit to set-language and then a redirect to a release or something, but that's not always the case. user-agent (advertisement of what browser/client the endpoint is using) also varies a lot
2025-03-24 08328, 2025
reosarevok[m]
bitmap, yvanzo: is there any reason we should *not* have a subdomain for the API?
2025-03-24 08346, 2025
zas[m]
Most advanced bots are now able to 100% mimicking human traffic (including most captchas), so we also need to work on performance improvement, and scalability
2025-03-24 08300, 2025
monkey[m]
Time for auth tokens for all API requests ? 😢
2025-03-24 08303, 2025
mayhem[m]
reosarevok[m]: I arranged this eaaaarly on. and I was ridiculed for it and it was taken out.
2025-03-24 08313, 2025
outsidecontext[m
I find this really strange what effort the AI companies put into this, when for MB they could just run a mirror and get all the data fast and up-to-date
2025-03-24 08315, 2025
mayhem[m]
just saying. <lolsob>
2025-03-24 08335, 2025
yvanzo[m]
reosarevok: Just backwards compatibility. This isn’t an easily actionable change either.
outsidecontext[m: I think this is more about a generic crawling solution that will work for multiple targets, otherwise they would download a copy of the data...
We need to update documentation, and likely add proper redirects
2025-03-24 08355, 2025
reosarevok[m]
But the redirects will still have the same issue as any other MB site page, right?
2025-03-24 08306, 2025
reosarevok[m]
Otherwise, we'd just filter on /ws/ on the url
2025-03-24 08326, 2025
reosarevok[m]
So I guess the idea is if people get blocked for looking like bots you tell them to use the api. subdomain or?
2025-03-24 08353, 2025
zas[m]
reosarevok was also asking if using services like Cloudflare would eventually conflict with our policies
2025-03-24 08324, 2025
reosarevok[m]
I don't know what the privacy implications are. I understand anubis is self-hosted and might be less problematic in that way?
2025-03-24 08334, 2025
reosarevok[m]
But maybe neither is problematic :)
2025-03-24 08343, 2025
zas[m]
Cloudflare has a program for open source entities we may ask for it
2025-03-24 08334, 2025
monkey[m]
Are both of those adapted for API requests? i.e. not just for web pages?
2025-03-24 08355, 2025
reosarevok[m]
Well in any case we mostly just need them for web pages
2025-03-24 08311, 2025
reosarevok[m]
I understand the WS can take a lot more hits, it's just the site that struggles?
2025-03-24 08315, 2025
julian45[m]
With e.g., cloudflare we could simply have it take on requests for main MB.org but not api.MB.org
2025-03-24 08322, 2025
julian45[m]
Ditto for anubis
2025-03-24 08324, 2025
zas[m]
API & (human-oriented) web pages need different kinds of protection
2025-03-24 08328, 2025
mayhem[m]
if we used anubis, we would need to whitelist google crawlers.
2025-03-24 08346, 2025
zas[m]
Cloudflare are services for both
2025-03-24 08357, 2025
zas[m]
s/are/has/
2025-03-24 08309, 2025
julian45[m]
Rejecting non-auth'd API requests would probably also help, and there are such things as "API gateways" (krakend is one, I think) that could also help us in this respect
The problem with API gateways for MB API is that we do too much stuff on gateways right now, so we need to review things and it's not really a turnkey solution, but we definitively need to go in this direction
2025-03-24 08324, 2025
julian45[m]
outsidecontext[m: This is part of the idea of underlying Cloudflare's "AI labyrinth" and, to an extent, anubis (same idea, different method)
2025-03-24 08340, 2025
mayhem[m]
I guess AI will really destroy everything, but I didn't expect it to go this way.
2025-03-24 08302, 2025
mayhem[m]
I think the only real hope we have is.... the AI bubble bursting.
2025-03-24 08310, 2025
monkey[m]
Eek.
2025-03-24 08324, 2025
reosarevok[m]
We still need to do something until that happens though :)
2025-03-24 08336, 2025
mayhem[m]
I think we need to check out anubis for the time being.
2025-03-24 08354, 2025
zas[m]
I don't expect it to happen... nor that people will stop to use LLMs or anything like that, we need solutions now
2025-03-24 08302, 2025
mayhem[m]
would blacklisting all ali cloud IPs be a good idea?
2025-03-24 08310, 2025
mayhem[m]
that at least should take care of some of them
2025-03-24 08326, 2025
reosarevok[m]
I would assume we've done that already
2025-03-24 08334, 2025
reosarevok[m]
Given how many IPs we've blacklisted
2025-03-24 08334, 2025
julian45[m]
mayhem[m]: Broadly agreed. In parallel, how about also seeing what we could do with Cloudflare given their offerings to FOSS projects?
2025-03-24 08351, 2025
zas[m]
As explained, blocking IPs will not work on the long term: they use services that provide infinite number of dynamic/residential IPs and when we block them, we end blocking our legit users
2025-03-24 08314, 2025
reosarevok[m]
We've had a lot of support messages asking for unblocking so far
2025-03-24 08318, 2025
mayhem[m]
I'm not suggesting a total fix with that, but a reduction in severity
2025-03-24 08323, 2025
reosarevok[m]
Of seemingly perfectly legit accounts
2025-03-24 08328, 2025
zas[m]
It can mitigate things somehow (I identified 700 IP blocks used in last 2 days just for this purpose)
2025-03-24 08348, 2025
zas[m]
I blocked 1M IPs, and keep doing it
2025-03-24 08305, 2025
monkey[m] wonders if someone is working on an AI scraper IP addresses blocklist FOSS of sorts