TOPIC: MusicBrainz Community | See #metabrainz for development and the other *Brainz’s | Channel is logged; see https://musicbrainz.org/doc/IRC for details | Latest Release: https://blog.metabrainz.org/?p=8719 | Picard 2.6 Beta 2 released! https://picard.musicbrainz.org
i've been meaning to join for a while now, but never found the time for it :P
anyways, I've just finished some talks with ArchiveTeam, they're going to queue up all URLs from the latest MB dump later today into one of their projects, and set up a live data feed to archive new URLs as they are added
those will eventually be injected into the wayback machine
MBS-9009: Every time a Homepage/Blog/Discography/Biography URL is submitted to MB, it should also be submitted to the Wayback Machine
reosarevok
Neat!
ROpdebee
edit notes aren't included in the live data feed though, so those won't be archived automatically (yet). reosarevok: Is there any way we could get those in a feed too?
RikkoM joined the channel
rxrog has quit
reosarevok
ROpdebee: probably not, since they're not meant to be public-public (they require login)
pretty happy about this, for like when things get removed from beatport or bandcamp
useful to have an archive to be able to refer back to
musicfan joined the channel
musicfan
I am attempting to add a band named Soraia (https://www.soraia.com/). There's already another unrelated entry with that name, so I'm attempting to disambiguate, but the disambiguation field remains red no matter what, which does not allow me to submit. Is this a known issue or am I doing something wrong?
SirPHOENiX17 joined the channel
SirPHOENiX1 has quit
SirPHOENiX17 is now known as SirPHOENiX1
finalsummer
whatgear looks like a SEO spam site and should be blacklisted. equipboard looks legitimate and has actual user contributions, whatgear looks like just outdated scraped(?) info from the former
Yeah. Added a ton of “official home page” links which clearly aren’t. When called out on one, said, “Oops!”
User reported.
CatQuest
[14:46] <ROpdebee> anyways, I've just finished some talks with ArchiveTeam, they're going to queue up all URLs from the latest MB dump later today into one of their projects, and set up a live data feed to archive new URLs as they are added
wait so urls in the edit notes or what?
becasue omg
BUT url entities automatically being put in the IA?
👏 🎉 👏
ROpdebee: whooooo
[14:47] <ROpdebee> those will eventually be injected into the wayback machine
I'm very excited about this!!!!
(even if it's not edit note urls today)
musicfan: hi! can you give a sreenshot of how you're doing it?
ROpdebee
Well, currently most of the URL entities in the 2021-03-13 mbdump have been grabbed by archiveteam. they'll be uploaded to IA and injected into the wayback machine sometime soon. new or updated URL entities should be grabbed via the live data feed, I've been told that'll be set up later today
as for edit notes and annotations, those will likely be done every three days with new data from the dumps, but i'm still working on extracting the URLs as i'd like to replicate the way the MB server does it to make sure we're parsing them consistently
CatQuest
like reo said I'm not sure edit notes are possible :/ but annotation ones should be
but this is *such* a help! so many old urls are gone and they were the proof or input fro many releases. some releases you can't even find any more
I wonder 🤔 would it be possible to get a simple list of urls that no longer resolve + weren't already in the ia?
or rather maybe not a list but to see the number of them
ROpdebee
probably not, the effort that would take would be equivalent to the actual archival
CatQuest
hah, alright
ROpdebee
or maybe a bit less, but would still take as many requests (>6M for URL entities alone)
CatQuest
yea not viable
anyway this is excellent news! this exact thing is something I've always worried about. it should have been in function ages ago <3
but from now on new urls should be caught so no *new* urls are "lost"
oh. btw maybe you should also do this with BookBrainz.org
still fairly undeveloped but should also evnetually be able to link to all kinds of things
woudl be great to have the url archiving fro mthe get-go
ROpdebee
what could be useful though, is a periodic dump of all recently "used" URLs on MB
CatQuest
used how?
entered?
ROpdebee
say once a day a file is uploaded to some FTP with URLs that have been entered into edit notes, annotations, URL entities which have been edited (or their ARs added/removed/edited)
CatQuest
hm
reosarevok: ?
ROpdebee
that file could be injected into AT's queue immediately with little effort, and you'd get a snapshot of the URLs in the exact state as when they were used
CatQuest
reosarevok: how hard would it be to create a dump ofthe urls entered into edit notes?
i personally don't know the data (just a long standing editor+BBstylecat and MB Instrument Inserter) you' wanna talk to reo, yvanzo, zas, etc
ROpdebee
yeah i'm just throwing out ideas, now we'd have to download and process fairly large DB dumps to get just a couple thousand new links every three days
to be clear though, we can still get the urls in edit notes from one of the dump files
reosarevok
bitmap: ^ does this seem like something that could be done kinda like with the json dumps?
RikkoM has quit
ROpdebee
also, couple of caveats: The project it's being inserted into currently doesn't archive page requisites (images, css, js, etc) but it's better than nothing I guess
and Spotify links probably aren't useful, since it loads data dynamically from the API, and those responses aren't captured either. so you'll just get broken pages :( I told them about this, but it's a wontfix situation