in #metabrainz

18:27 PM
Freso

🙋
18:28 PM
Dealt with reports and forum stuff and such, being around+about, etc.
18:28 PM
fin.
18:28 PM
Aaaand… that’s it for reviews, I believe.
18:29 PM
Thank you for your reviews! :)
18:29 PM
We have a few items on the agenda, so…
18:29 PM
ruaok: Expiring domains
18:29 PM
mayhem

so, have a look at this list we have expiring soon: https://gist.github.com/mayhem/436428f8e925c50e...
18:30 PM
the first one was registered the same time as musicbrainz.org -- back then there was no google and search was just starting typo domains were a real thing back then.
18:30 PM
also, there was a thought that people might not find a .org (kinda not well known then) and that a .com should be reserved to redirect to the .org
18:30 PM
kinda outdated concepts.
18:31 PM
I could see keeping bookbrainz.com. any thoughts monkey ?
18:31 PM
musicbrains.org I think we can ditch. any objections?
18:31 PM
Freso

+1
18:31 PM
zas

+1
18:31 PM
monkey

Do we have any data of how many hits they get?
18:31 PM
lucifer

it currently redirects to mb.org though so someone could try to phish?
18:31 PM
monkey

Generally I don't see the need for it
18:31 PM
mayhem

we have no stats.
18:32 PM
but given that some people care, lets keep bb.com and ditch musicbrains.org
18:32 PM
then...
18:32 PM
back then, maybe 15 years ago, we talked about possibly doing foodbrainz and tvbrainz.
18:32 PM
foodbrainz has been done, but I dont recall the name. Freso, do you?
18:32 PM
Freso

I think FilmBrainz is more likely to be a thing sometime than TVBrainz, so not sure TVBrainz is needed. There is already OpenFoodFacts and I haven’t seen much legit community requests for FoodBrainz, so I don’t think that’s going to happen ever.
18:33 PM
mayhem

tvbrainz? What's TV?
18:33 PM
Freso

Yes, OpenFoodFacts. :p
18:33 PM
zas

lucifer: good point, it was one of my concerns, that said there are plenty of domains that can be used to phish
18:33 PM
lucifer

makes sense
18:33 PM
mayhem

I'm open to keeping musicbrains.org -- for phishing concerns.
18:33 PM
but TV and food. we're just not going to do that.
18:34 PM
zas

+1^^
18:34 PM
Freso

+1
18:34 PM
alastairp

+1
18:34 PM
mayhem

so to summarize: keep boobrainz.com, musicbrains.org and ditch the tv and food domains.
18:34 PM
any objections?
18:34 PM
monkey

+1
18:34 PM
CatQuest

boobrainz? awesome
18:34 PM
mayhem

we also have moviebrainz,org which is still a possibility.
18:34 PM
Freso

Yeah.
18:34 PM
mayhem

ok, motion carried, I'll make that happen.
18:34 PM
akshaaatt

I think our domains are unique as is. Would someone try to zip up the domains if we don't have them?
18:35 PM
mayhem

akshaaatt: they always do.
18:35 PM
CatQuest

gamebrainz
18:35 PM
Freso

akshaaatt: 🤷
18:35 PM
akshaaatt

Oh!
18:35 PM
mayhem

ok, onward to the next topic.
18:35 PM
Freso

mayhem: Supporting open source
18:35 PM
mayhem

we'll soon be getting paid for the ODI participation.
18:35 PM
TOPIC: MetaBrainz Community and Development channel | MusicBrainz non-development: #musicbrainz | BookBrainz: #bookbrainz | Channel is logged; see https://musicbrainz.org/doc/IRC for details | Agenda: Supporting open source (ruaok), Next meeting (Freso)
18:35 PM
and I've decided I want to help open source in general a bit, so I want us to lead by example.
18:36 PM
akshaaatt

++
18:36 PM
monkey

+1000
18:36 PM
mayhem

I am dedicating a budget of $6000 annually to this cause -- at least to start.
18:36 PM
the first year is being paid out by Microsoft/ODI.
18:37 PM
I'd like each teammember to identify 2 or 3 open source projects that they think should be supported.
18:37 PM
akshaaatt

Sounds superb!
18:37 PM
mayhem

I will create a spreadsheet for this in a minute.
18:37 PM
monkey

Yeah, I love the idea!
18:37 PM
mayhem

propose a project and propose an annual support payment. add a link to where to support the project.
18:37 PM
reosarevok

Neat
18:38 PM
mayhem

then in maybe 2 weeks we can have another meeting where we talk about the chosen projects and allocate funds, ok?
18:38 PM
monkey

👍
18:38 PM
mayhem

then, I want to create a page on meb.org that states this.
18:38 PM
akshaaatt

Can we do something like an internal Gsoc thing where we let students/people provide proposals and pay, mentor them for the projects? This could happen during the months of Nov, Dec, Jan where people are anyway looking for internships
18:38 PM
Freso

Sounds good.
18:38 PM
mayhem

I'm going to state what % of our income we're dedicating to OSS and challenge other companies to do the same or better.
18:39 PM
let's make some noise about this, because imagine if someone like Google did this as well?
18:39 PM
OSS developers are burning out, so lets see if we can help a little.
18:39 PM
fin. back to freso.
18:39 PM
Freso

Freso: Next meeting
18:39 PM
lucifer

sounds great. awesome!
18:40 PM
Freso

This is just a PSA that Europe switches to DST on Sunday, so for Indians and USians and others that do not follow Europe’s DST schedule, note that next week’s meeting will be an hour… uh, later? for you.
18:40 PM
lucifer

earlier
18:40 PM
Freso

Earlier. Thanks lucifer. :p
18:40 PM
alastairp

an hour different
18:41 PM
Freso

fin.
18:41 PM
And that wraps up today’s meeting too.
18:41 PM
Thank you everyone for your time! Stay safe out there. :)
18:41 PM
</BANG>
18:41 PM
akshaaatt

Thank you!
18:41 PM
monkey

Cheers !
18:41 PM
lucifer

mayhem: you may be able to speed up the query a bit, change the table to unlogged before writing data and then to logged after writing. downside is something crashes table loses data but you'll run the query again in that case anyway
18:41 PM
TOPIC: MetaBrainz Community and Development channel | MusicBrainz non-development: #musicbrainz | BookBrainz: #bookbrainz | Channel is logged; see https://musicbrainz.org/doc/IRC for details | Agenda: Reviews
18:41 PM
*is if something crashes
18:42 PM
mayhem

the query time is all in the execute() -- the writing is relatively fast.
18:42 PM
lucifer

ah 👍
18:43 PM
alastairp

mayhem: is the query significantly faster when you just apply it to a few rows (i.e. when a replication packet comes in?)
18:44 PM
mayhem

it is instantaneous when I request only one row, so I would expect so.
18:44 PM
team: https://docs.google.com/spreadsheets/d/16Ih1vaC...
18:44 PM
team members should be able to edit
18:44 PM
alastairp

yeah, so I guess it doesn't matter how long it takes (within reason...) if we're only going to run it once
18:44 PM
PrathameshG: hi, I'm here. not sure how late it is for you or if you want to talk
18:45 PM
what have you managed to do?
18:45 PM
lucifer

oh but how do you apply it to some rows? lookup edits and figure out the entities that need to refreshed
18:45 PM
Sophist-UK joined the channel
18:45 PM
alastairp

lucifer: the plan is to consume replication packets, which say which rows have changed
18:46 PM
PrathameshG

alastairp: Hey there, DW I'll be online for another hour.
18:46 PM
lucifer

i see, makes sense.
18:47 PM
mayhem

lucifer: this is why we are so damned anal about the created/last_updated columns. so that downstream users like this can clearly understand what changed.
18:48 PM
and I have a GIN index on a ARRAY column that lists artist_mbids, so I can quickly mark rows as dirty.
18:48 PM
and then one query to select the dirty rows in a CTE with the rest of the query.
18:48 PM
lucifer

yup makes sens. nice! :D
18:48 PM
mayhem

it is going to be really beefy after that.
18:49 PM
lucifer

one day, this will land hopefully make it all automatic https://commitfest.postgresql.org/23/2138/ !!
18:49 PM
Incremental refresh of materialized views
18:51 PM
PrathameshG

alastairp: I was on a vacation and got stung by a bee on my hand so sadly I wasn't able to do much 🤦‍♂️
18:51 PM
Although, so far I've managed to get used to my environment on bono, and tried out loading and testing some of the data sets.
18:51 PM
monkey

That a buzzkill…
18:52 PM
That's*
18:52 PM
alastairp

monkey: buzzz off
18:52 PM
monkey

Yes honey.
18:52 PM
alastairp

PrathameshG: oops, hope everything is ok. but don't worry about it - we're happy for you to just play around and look at the data. no pressure to do anything
18:53 PM
lucifer

oh i forgot to mention earlier (~2hrs ago), LB prod updated
18:53 PM
alastairp

PrathameshG: so it sounds like you now know how to do things, but you're not sure what to do?
18:53 PM
PrathameshG

Nono, it's completely alright, I'd be more than happy to take responsibility and take up targets and try to complete them
18:53 PM
Yes, that's exactly what's happening right now. I don't have a clue what I have to do
18:57 PM
alastairp

so first of all - last week you were talking about a bunch of interesting ideas that you had to look up mlhd data in last.fm and other sources
18:57 PM
PrathameshG

Yes that's right
18:57 PM
alastairp

just because you have an account on bono don't think that you only have to do what we suggest, feel free to use the resources for your own project too
18:58 PM
that being said, let me explain to you what we wanted to do with the mhld
18:58 PM
PrathameshG

Thanks a lot, I was thinking of running some network intensive stuff on it.
18:58 PM
Please go ahead
18:59 PM
alastairp

some history: experience has shown us that a lot of data in last.fm is wrong. about 10 or so years ago (someone correct me if I'm wrong), musicbrainz made a big change to its database structure, and we introduced a number of new concepts. it seems that maybe last.fm were late to identify this change
19:00 PM
the result of this is that sometimes when they give you a "recording mbid", this might not actually be correct. it might be a track mbid (a "track" is a recording on a specific release - you could have only 1 recording, 2 releases, and 2 tracks)
19:00 PM
so one thing that we're not sure about, is if all of the ids in the data files are actually correct and if they actually exist in musicbrainz
19:01 PM
the first thing we wanted to do was look through the data files and cross-reference it with the database and see which things exist, which things have been deleted for some reason, and which things are in the database but with the wrong id
19:03 PM
PrathameshG

Alright, so we've to revalidate all the MBIDs
19:03 PM
Sounds very doable
19:05 PM
So I'll just start by writing scripts to evaluate the mbids and cross check them with the musicbrainz db.
19:05 PM
I'll keep giving you updates for the same :)
19:05 PM
And of course, if you've anything to add on it, please go ahead and mention it
19:07 PM
Uploaded file: https://uploads.kiwiirc.com/files/dc138fefc8586...
19:08 PM
^ Found this above snippet that mayhem posted in the last convo. Will try to implement some of it along the way 👌
19:08 PM
alastairp

feel free to share with us the code that you write, there are many tricks that one can apply to make something like this comparison fast
19:08 PM
PrathameshG

Yes absolutely
19:09 PM
alastairp

yes right - that's a broader outline of what we hope to do
19:09 PM
PrathameshG

Firstly, I was wondering if I'd have to hit the database on each row?
19:09 PM
That's gonna be really intense for the database
19:09 PM
alastairp

yeah, exactly ;)
19:10 PM
PrathameshG

Alrighty, I'll get started with 1 dataset first then :))
19:10 PM
Will update you soon.
19:10 PM
alastairp

databases are really good at doing things in bulk, my first intuition would be to process at least 1 file at once
19:10 PM
lucifer

for checking whether the mbid is track mbid or recording mbid stuff, you could probably get away with just reading the index which should be fast enough. also do lookup in batchs.
19:11 PM
alastairp

for example, I'm just looking through some of the sample 00 files - some have 150k rows, some have 40k rows. postgresql will have no problem if you pass in a query with 100,000 parameters
19:11 PM
PrathameshG

Got it 👍 lucifer alastairp
19:11 PM
alastairp

however I wonder if there are some even easier ways to do this, I'm just trying something - one moment
19:12 PM
PrathameshG

yep
19:12 PM
Also, just to confirm our primary concern is with the recording-mbid right?
19:14 PM
alastairp

yes, right - because we already have these relationships in the musicbrainz database our plan is to ignore the artist and release ids and re-compute them
19:15 PM
PrathameshG

Sounds good. So I'll just drop the artist/release ID columns during analysis. Will speed up the process a bit