TOPIC: MusicBrainz Community | See #metabrainz for development and the other *Brainz’s | Channel is logged; see https://musicbrainz.org/doc/IRC for details | Latest Release: https://blog.metabrainz.org/?p=8719 | Picard 2.6 Beta 2 released! https://picard.musicbrainz.org
i've been meaning to join for a while now, but never found the time for it :P
2021-03-16 07547, 2021
ROpdebee
anyways, I've just finished some talks with ArchiveTeam, they're going to queue up all URLs from the latest MB dump later today into one of their projects, and set up a live data feed to archive new URLs as they are added
2021-03-16 07502, 2021
ROpdebee
those will eventually be injected into the wayback machine
MBS-9009: Every time a Homepage/Blog/Discography/Biography URL is submitted to MB, it should also be submitted to the Wayback Machine
2021-03-16 07511, 2021
reosarevok
Neat!
2021-03-16 07544, 2021
ROpdebee
edit notes aren't included in the live data feed though, so those won't be archived automatically (yet). reosarevok: Is there any way we could get those in a feed too?
2021-03-16 07521, 2021
RikkoM joined the channel
2021-03-16 07508, 2021
rxrog has quit
2021-03-16 07515, 2021
reosarevok
ROpdebee: probably not, since they're not meant to be public-public (they require login)
pretty happy about this, for like when things get removed from beatport or bandcamp
2021-03-16 07545, 2021
darwin
useful to have an archive to be able to refer back to
2021-03-16 07530, 2021
musicfan joined the channel
2021-03-16 07513, 2021
musicfan
I am attempting to add a band named Soraia (https://www.soraia.com/) There's already another unrelated entry with that name, so I'm attempting to disambiguate, but the disambiguation field remains red no matter what, which does not allow me to submit. Is this a known issue or am I doing something wrong?
2021-03-16 07543, 2021
SirPHOENiX17 joined the channel
2021-03-16 07543, 2021
SirPHOENiX1 has quit
2021-03-16 07544, 2021
SirPHOENiX17 is now known as SirPHOENiX1
2021-03-16 07515, 2021
finalsummer
whatgear looks like a SEO spam site and should be blacklisted. equipboard looks legitimate and has actual user contributions, whatgear looks like just outdated scraped(?) info from the former
Yeah. Added a ton of “official home page” links which clearly aren’t. When called out on one, said, “Oops!”
2021-03-16 07532, 2021
crism
User reported.
2021-03-16 07503, 2021
CatQuest
[14:46] <ROpdebee> anyways, I've just finished some talks with ArchiveTeam, they're going to queue up all URLs from the latest MB dump later today into one of their projects, and set up a live data feed to archive new URLs as they are added
2021-03-16 07503, 2021
CatQuest
wait so urls in the edit notes or what?
2021-03-16 07509, 2021
CatQuest
becasue omg
2021-03-16 07549, 2021
CatQuest
BUT url entities automatically being put in the IA?
2021-03-16 07558, 2021
CatQuest
👏 🎉 👏
2021-03-16 07557, 2021
CatQuest
ROpdebee: whooooo
2021-03-16 07509, 2021
CatQuest
[14:47] <ROpdebee> those will eventually be injected into the wayback machine
2021-03-16 07509, 2021
CatQuest
I'm very excited about this!!!!
2021-03-16 07531, 2021
CatQuest
(even if it's not edit note urls today)
2021-03-16 07530, 2021
CatQuest
musicfan: hi! can you give a sreenshot of how you're doing it?
2021-03-16 07553, 2021
ROpdebee
Well, currently most of the URL entities in the 2021-03-13 mbdump have been grabbed by archiveteam. they'll be uploaded to IA and injected into the wayback machine sometime soon. new or updated URL entities should be grabbed via the live data feed, I've been told that'll be set up later today
2021-03-16 07518, 2021
ROpdebee
as for edit notes and annotations, those will likely be done every three days with new data from the dumps, but i'm still working on extracting the URLs as i'd like to replicate the way the MB server does it to make sure we're parsing them consistently
2021-03-16 07511, 2021
CatQuest
like reo said I'm not sure edit notes are possible :/ but annotation ones should be
2021-03-16 07519, 2021
CatQuest
but this is *such* a help! so many old urls are gone and they were the proof or input fro many releases. some releases you can't even find any more
2021-03-16 07506, 2021
CatQuest
I wonder 🤔 would it be possible to get a simple list of urls that no longer resolve + weren't already in the ia?
2021-03-16 07528, 2021
CatQuest
or rather maybe not a list but to see the number of them
2021-03-16 07500, 2021
ROpdebee
probably not, the effort that would take would be equivalent to the actual archival
2021-03-16 07512, 2021
CatQuest
hah, alright
2021-03-16 07528, 2021
ROpdebee
or maybe a bit less, but would still take as many requests (>6M for URL entities alone)
2021-03-16 07540, 2021
CatQuest
yea not viable
2021-03-16 07510, 2021
CatQuest
anyway this is excellent news! this exact thing is something I've always worried about. it should have been in function ages ago <3
2021-03-16 07539, 2021
CatQuest
but from now on new urls should be caught so no *new* urls are "lost"
2021-03-16 07529, 2021
CatQuest
oh. btw maybe you should also do this with BookBrainz.org
2021-03-16 07546, 2021
CatQuest
still fairly undeveloped but should also evnetually be able to link to all kinds of things
2021-03-16 07506, 2021
CatQuest
woudl be great to have the url archiving fro mthe get-go
2021-03-16 07510, 2021
ROpdebee
what could be useful though, is a periodic dump of all recently "used" URLs on MB
2021-03-16 07520, 2021
CatQuest
used how?
2021-03-16 07539, 2021
CatQuest
entered?
2021-03-16 07548, 2021
ROpdebee
say once a day a file is uploaded to some FTP with URLs that have been entered into edit notes, annotations, URL entities which have been edited (or their ARs added/removed/edited)
2021-03-16 07529, 2021
CatQuest
hm
2021-03-16 07533, 2021
CatQuest
reosarevok: ?
2021-03-16 07556, 2021
ROpdebee
that file could be injected into AT's queue immediately with little effort, and you'd get a snapshot of the URLs in the exact state as when they were used
2021-03-16 07558, 2021
CatQuest
reosarevok: how hard would it be to create a dump ofthe urls entered into edit notes?
2021-03-16 07526, 2021
CatQuest
i personally don't know the data (just a long standing editor+BBstylecat and MB Instrument Inserter) you' wanna talk to reo, yvanzo, zas, etc
2021-03-16 07539, 2021
ROpdebee
yeah i'm just throwing out ideas, now we'd have to download and process fairly large DB dumps to get just a couple thousand new links every three days
2021-03-16 07506, 2021
ROpdebee
to be clear though, we can still get the urls in edit notes from one of the dump files
2021-03-16 07541, 2021
reosarevok
bitmap: ^ does this seem like something that could be done kinda like with the json dumps?
2021-03-16 07550, 2021
RikkoM has quit
2021-03-16 07526, 2021
ROpdebee
also, couple of caveats: The project it's being inserted into currently doesn't archive page requisites (images, css, js, etc) but it's better than nothing I guess
2021-03-16 07558, 2021
ROpdebee
and Spotify links probably aren't useful, since it loads data dynamically from the API, and those responses aren't captured either. so you'll just get broken pages :( I told them about this, but it's a wontfix situation