#metabrainz

/

14:46 PM
alastairp

and then the explain endpoint remains html-only

2022-07-20 20132, 2022

14:53 PM
lucifer

mayhem: alastairp: it still does, https://github.com/metabrainz/data-set-hoster/blo…

2022-07-20 20139, 2022

14:54 PM
lucifer

i think it could be made to work by printing log first then letting the dataset hoster handle the result formatting.

2022-07-20 20157, 2022

14:54 PM
lucifer

or alternatively some code duplication for the short term is fine

2022-07-20 20149, 2022

15:18 PM
mayhem

lucifer: alastairp and I were just chatting...

2022-07-20 20135, 2022

15:19 PM
alastairp

lucifer: yeah, I was thinking about adding the formatting in the ds hoster, but the log output of the mapper is so specific (e.g. bold indicators in certain places) that I don't think we can make it generic

2022-07-20 20152, 2022

15:19 PM
mayhem

last night I realized that if we added release support for the mapper, we could solve two problems that we are currently seeing: Listening to a complication while watching data in the listening now viewer (it does the wrong thing). Identifying albums in a listen stream.

2022-07-20 20110, 2022

15:20 PM
mayhem

I didn't know how to deal with releases, but now I understand. so now I feel armed to go back and add release support.

2022-07-20 20119, 2022

15:20 PM
alastairp

I worked out that there's actually no need to make this machine-readable. the key is just to make mapper + explain endpoints return the same data, which just means using consistent stop word handling

2022-07-20 20135, 2022

15:20 PM
mayhem

which means that alastairp you and I should spend some time at the summit to work out how to make a better mapper.

2022-07-20 20139, 2022

15:21 PM
mayhem

ahhh, the tests finally pass for https://github.com/metabrainz/listenbrainz-server…

2022-07-20 20129, 2022

15:24 PM
lucifer

mayhem: alastairp: makes sense. sounds good to me.

2022-07-20 20141, 2022

15:25 PM
alastairp

mayhem:

2022-07-20 20141, 2022

15:25 PM
alastairp

CREATE TABLE mapping.release_group_secondary_type_sort ( secondary_type integer, sort integer )

2022-07-20 20145, 2022

15:25 PM
mayhem

once these improvements are in, the mapper is going to be incredible.

2022-07-20 20146, 2022

15:25 PM
alastairp

INSERT INTO mapping.release_group_secondary_type_sort values (%s, %s);", tuple((id, type_id)))

2022-07-20 20157, 2022

15:25 PM
alastairp

is that inserting in the correct columns?

2022-07-20 20115, 2022

15:27 PM
mayhem

https://www.irccloud.com/pastebin/EUHf0esb/

2022-07-20 20119, 2022

15:27 PM
mayhem

this is the resultant table.

2022-07-20 20147, 2022

15:27 PM
mayhem

https://www.irccloud.com/pastebin/2N7I2vS1/

2022-07-20 20151, 2022

15:27 PM
mayhem

with proper order.

2022-07-20 20148, 2022

15:28 PM
mayhem

so, I guess so, but perhaps the variable names could be named better.

2022-07-20 20114, 2022

15:29 PM
alastairp

ah, I just saw the way that you were doing the join

2022-07-20 20121, 2022

15:29 PM
alastairp

ON rgst.id = rgsts.secondary_type

2022-07-20 20139, 2022

15:29 PM
alastairp

mm

2022-07-20 20115, 2022

15:30 PM
alastairp

but does this mean that sec-type 8 (dj-mix) is sorted before type 6 (live)?

2022-07-20 20153, 2022

15:31 PM
lucifer

each group should be assigned the same the id so that the next sort gets a chance, no?

2022-07-20 20106, 2022

15:32 PM
lucifer

*all items in a group

2022-07-20 20107, 2022

15:33 PM
mayhem

the sec type only comes into play for groups that have the same primary type, no?

2022-07-20 20142, 2022

15:33 PM
alastairp

you're saying so that you would get the earliest of (compilation or soundtrack or live), rather than the earliest compilation, or if there are no compilations the earliest soundtrack, or if there are no.... etc, lucifer?

2022-07-20 20102, 2022

15:34 PM
lucifer

alastairp: yup.

2022-07-20 20127, 2022

15:34 PM
alastairp

when I mentioned groups yesterday I didn't mean to imply that I considered them all the same, I think it's OK to have an explicit ordering

2022-07-20 20104, 2022

15:35 PM
lucifer

ah ok, my understanding was the other way around. makes sense to have it this way then.

2022-07-20 20122, 2022

15:52 PM
Pratha-Fish

alastairp: Hi, I am back. Took longer than I expected lol

2022-07-20 20140, 2022

15:52 PM
Pratha-Fish

Also, looks like the conversion is completed too :)

2022-07-20 20145, 2022

15:52 PM
Pratha-Fish

It took 83.1 Hours in total

2022-07-20 20153, 2022

15:52 PM
alastairp

Pratha-Fish: yeah, I was going to ask you about that

2022-07-20 20100, 2022

15:53 PM
mayhem

merely a round off error from 50 hours, no worries. :)

2022-07-20 20112, 2022

15:53 PM
Pratha-Fish

☠️

2022-07-20 20116, 2022

15:53 PM
alastairp

same order of magnitude, and still less than a week

2022-07-20 20132, 2022

15:53 PM
alastairp

next time we need to do this, let's multi-thread it 8x

2022-07-20 20141, 2022

15:53 PM
alastairp

Pratha-Fish: so... any track ids?

2022-07-20 20150, 2022

15:53 PM
Pratha-Fish

alastairp: lets check!

2022-07-20 20100, 2022

15:54 PM
Pratha-Fish

give me a sec

2022-07-20 20128, 2022

15:58 PM
Pratha-Fish

https://usercontent.irccloud-cdn.com/file/FkoAfBn…

2022-07-20 20134, 2022

15:58 PM
Pratha-Fish

alastairp: ^ Nothing found :)

2022-07-20 20124, 2022

15:59 PM
alastairp

ah, this means that there are no files that have a log entry?

2022-07-20 20138, 2022

15:59 PM
alastairp

so either it means that nothing was found, or it means that you have a bug in writing your logs ;)

2022-07-20 20156, 2022

16:01 PM
Pratha-Fish

The logger returns None if the list of track-mbids in rec-mbids is Empty, if it has anything in it, it just returns a list

2022-07-20 20113, 2022

16:02 PM
Pratha-Fish

I also tested it out before running, so hopefully it worked well :)

2022-07-20 20138, 2022

16:02 PM
Pratha-Fish

Also, the timelogs are quite interesting too https://usercontent.irccloud-cdn.com/file/JxDsqkr…

2022-07-20 20140, 2022

16:02 PM
alastairp

that's great. I also tested the data independently myself, so I think that we're pretty safe in deciding this

2022-07-20 20102, 2022

16:03 PM
Pratha-Fish

🎉

2022-07-20 20111, 2022

16:03 PM
alastairp

it means that we can remove track lookups from everything (remember also to put this in our doc explaining why we no longer have it!)

2022-07-20 20133, 2022

16:03 PM
Pratha-Fish

Definitely

2022-07-20 20134, 2022

16:03 PM
alastairp

yeah, I imagine that some files are large (users who have a lot of scrobbles)

2022-07-20 20138, 2022

16:03 PM
lucifer

you can also try adding an errorneous entry to a file manually, then run it on 3-4 files including the maligned file to confirm it wasn't a logger issues.

2022-07-20 20129, 2022

16:04 PM
Pratha-Fish

lucifer: sure I'll try it out. I have the faulty data ready too

2022-07-20 20126, 2022

16:05 PM
alastairp

Pratha-Fish: ./49/49dc8e61-67ca-4ad1-bf53-437856924777.txt.gz is the largest file

2022-07-20 20140, 2022

16:05 PM
alastairp

you could try and run it as a one-off to see if it takes ~12 seconds

2022-07-20 20147, 2022

16:05 PM
Pratha-Fish

okie

2022-07-20 20132, 2022

16:07 PM
alastairp

Pratha-Fish: I'll move the rec_track_checker/MLHD directory to /data, don't panic when it disappears

2022-07-20 20139, 2022

16:07 PM
Pratha-Fish

sure

2022-07-20 20104, 2022

16:08 PM
alastairp

https://www.irccloud.com/pastebin/m4PJwYUX/

2022-07-20 20108, 2022

16:08 PM
alastairp

that's amazing

2022-07-20 20117, 2022

16:08 PM
alastairp

!m Pratha-Fish

2022-07-20 20118, 2022

16:08 PM
BrainzBot

You're doing good work, Pratha-Fish!

2022-07-20 20149, 2022

16:08 PM
Pratha-Fish

!!!

2022-07-20 20152, 2022

16:08 PM
alastairp

Pratha-Fish: I'm about to head back home, but I have a few other things that I want to run past you

2022-07-20 20104, 2022

16:09 PM
Pratha-Fish

alastairp: yes please

2022-07-20 20151, 2022

16:09 PM
alastairp

1) lookup reports: I used your code to do lookups of more metadata and then generated an html report. check out this:

2022-07-20 20152, 2022

16:10 PM
alastairp

https://www.irccloud.com/pastebin/DPHsaNbY/

2022-07-20 20111, 2022

16:11 PM
alastairp

html is a bit nicer to view, as we can do links etc

2022-07-20 20124, 2022

16:11 PM
Pratha-Fish

Oh that's interesting

2022-07-20 20129, 2022

16:11 PM
alastairp

see also: https://wolf.metabrainz.org/~alastair/mlhd-lookup… and my comments to ansh

2022-07-20 20152, 2022

16:11 PM
alastairp

you can now host stuff on wolf, just make a 'public_html' directory on wolf and put the stuff there

2022-07-20 20118, 2022

16:12 PM
alastairp

(and also, there are some weird unicode issues in that html file and I don't know why, we need to debug it further)

2022-07-20 20105, 2022

16:13 PM
Pratha-Fish

Yea, noticed the unicode issue

2022-07-20 20139, 2022

16:13 PM
Pratha-Fish

Also, the jinja part is pretty interesting. I heard a bit about it through flask, but didn't know it could be used like that :)

2022-07-20 20118, 2022

16:14 PM
alastairp

yes right. there's no requirement to only use jinja as part of a webserver

2022-07-20 20133, 2022

16:14 PM
Pratha-Fish

alastairp: Does the public_html dir need to be in my home directory or can it be hosted anywhere?

2022-07-20 20135, 2022

16:15 PM
alastairp

we could have used any templating system, but I just reached for jinja because we have experience with it

2022-07-20 20102, 2022

16:16 PM
alastairp

Pratha-Fish: the configuration is such that ~your_username/file.html maps to /home/your_username/public_html/file.html

2022-07-20 20131, 2022

16:18 PM
Pratha-Fish

Ah I see

2022-07-20 20110, 2022

16:19 PM
Pratha-Fish

Also, if let's say I hosted a txt file in that directory, would ~snaek/file.txt just send the txt file as a download?

2022-07-20 20119, 2022

16:19 PM
alastairp

try it and see

2022-07-20 20125, 2022

16:19 PM
Pratha-Fish

OMW

2022-07-20 20103, 2022

16:24 PM
Pratha-Fish

alastairp: yes, looks like it works that way. txt files are previewed directly on the browser, while binary and markdown files are sent as downloads

2022-07-20 20121, 2022

16:24 PM
alastairp

the markdown one is interesting. because that's text

2022-07-20 20150, 2022

16:24 PM
Pratha-Fish

+1

2022-07-20 20123, 2022

16:25 PM
alastairp

we use nginx as the webserver

2022-07-20 20132, 2022

16:26 PM
Pratha-Fish

Is there any particular tradeoffs b/w nginx and apache?

2022-07-20 20135, 2022

16:26 PM
Pratha-Fish

*are

2022-07-20 20145, 2022

16:26 PM
alastairp

typically it will inspect the file and know what it is, therefore it will know if it should tell the browser to display it or download it

2022-07-20 20152, 2022

16:26 PM
alastairp

`curl -D - https://wolf.metabrainz.org/~snaek/test.html`

2022-07-20 20105, 2022

16:27 PM
alastairp

includes: `content-type: text/html`

2022-07-20 20118, 2022

16:27 PM
alastairp

whereas the same one on README.md shows `content-type: application/octet-stream`

2022-07-20 20126, 2022

16:27 PM
Pratha-Fish

wow that's interesting

2022-07-20 20130, 2022

16:27 PM
alastairp

so the browser sees that and says "whoops, I'd better download this"

2022-07-20 20149, 2022

16:27 PM
alastairp

try a zip file or something as well, based on the content-type the browser will choose what to do

2022-07-20 20101, 2022

16:28 PM
Pratha-Fish

yep

2022-07-20 20107, 2022

16:28 PM
alastairp

if we could configure nginx to send `text/plain` as the content type then the browser would probably display it

2022-07-20 20132, 2022

16:28 PM
alastairp

regarding nginx/apache, no great difference. I used to use apache, then one day I started using nginx. about 15 years ago it had some features that made it faster

2022-07-20 20106, 2022

16:29 PM
alastairp

Pratha-Fish: anyway, back to what we were discussing

2022-07-20 20118, 2022

16:29 PM
alastairp

I think that it is now a top priority to start moving some of these notebooks to scripts (as you said you had started doing). It's very difficult for me to share code to you in the notebook, because every time I run your notebook, the outputs change and it causes really annoying git diffs

2022-07-20 20141, 2022

16:29 PM
alastairp

with a python script, I'd be able to open a pull request on your repo to add this html template, for example

2022-07-20 20117, 2022

16:30 PM
Pratha-Fish

I see, I'll get that one done ASAP then

2022-07-20 20148, 2022

16:30 PM
alastairp

so let's focus on that. I'd like a script that I can run which takes the list of files (df2_artist_rec_names_artist_list.txt), looks up the necessary data, does the mapping lookup, and then writes the debug html

2022-07-20 20106, 2022

16:31 PM
alastairp

here's another interesting item which I found:

2022-07-20 20122, 2022

16:31 PM
alastairp

you use requests_cache

2022-07-20 20126, 2022

16:31 PM
alastairp

https://www.irccloud.com/pastebin/ynmThNWm/

2022-07-20 20137, 2022

16:31 PM
alastairp

this loop is interesting

2022-07-20 20104, 2022

16:32 PM
Pratha-Fish

yes, requests cache just made testing faster without puting load on the mapping API

2022-07-20 20119, 2022

16:32 PM
alastairp

do you see that even if the lookup that you do returns data from the cache, you'll still sleep for 0.5 seconds?

2022-07-20 20104, 2022

16:33 PM
Pratha-Fish

yes, that one was placed there to reduce load on the server

2022-07-20 20124, 2022

16:33 PM
alastairp

but if you don't access the server, there's no reason to reduce load on it

2022-07-20 20147, 2022

16:33 PM
Pratha-Fish

Hmmmmmmmmm

2022-07-20 20154, 2022

16:33 PM
Pratha-Fish

never thought of it!

2022-07-20 20102, 2022

16:34 PM
alastairp

mmmhm

2022-07-20 20149, 2022

16:34 PM
alastairp

this is why I suggested an alternative, of saving the result of the lookup to a file, and skipping the lookup if the file with the result already exists

2022-07-20 20124, 2022

16:35 PM
alastairp

https://www.irccloud.com/pastebin/ee8o97PK/

2022-07-20 20131, 2022

16:35 PM
alastairp

so you only sleep if you do the lookup

2022-07-20 20149, 2022

16:35 PM
alastairp

there may be a method in requests_cache that tells you if a lookup was returned from the cache, I don't know

2022-07-20 20131, 2022

16:36 PM
Pratha-Fish

Yes that's better, it'll even act as a "start where left" functionality if the test ends abruptly somewhere ig

2022-07-20 20144, 2022

16:36 PM
alastairp

yep, exactly! that's why I use this pattern

2022-07-20 20155, 2022

16:36 PM
alastairp

https://requests-cache.readthedocs.io/en/stable/u…

2022-07-20 20100, 2022

16:37 PM
alastairp

The following attributes are available on responses: from_cache: indicates if the response came from the cache

2022-07-20 20112, 2022

16:37 PM
alastairp

so there we go, you could also use that

2022-07-20 20120, 2022

16:37 PM
Pratha-Fish

Excellent

2022-07-20 20128, 2022

16:37 PM
alastairp

anyway, it was just something that I noticed and wanted to tell you

2022-07-20 20136, 2022

16:37 PM
alastairp

OK, one last thing - then I need to go home

2022-07-20 20101, 2022

16:38 PM
Pratha-Fish

Thanks for informing :)

2022-07-20 20105, 2022

16:38 PM
alastairp

after talking some stuff over with mayhem yesterday and today, I think that the query that I gave you to get the artist of a recording is incorrect

2022-07-20 20159, 2022

16:38 PM
alastairp

https://musicbrainz.org/release-group/d406dc76-09… this is a good demonstration of the problem, it was one of the examples in the mis-matched report that I generated

2022-07-20 20108, 2022

16:39 PM
alastairp

~ Release group by onelinedrawing

2022-07-20 20127, 2022

16:39 PM
alastairp

however, click on onelinedrawing and it takes you to an artist called "Jonah Matranga"

2022-07-20 20124, 2022

16:40 PM
Pratha-Fish

Interesting

2022-07-20 20126, 2022

16:40 PM
alastairp

This is because MB allows you to have "artist credits" for releases and recordings, which may be different to the name of the person (because of stylistic reasons, or maybe their name changed but they're the same person, or... there are many reasons)

2022-07-20 20107, 2022

16:41 PM
alastairp

anyway, I think that the query I gave you uses the artist's "official" name, rather than the credit. but the mapping code uses the credit

2022-07-20 20126, 2022

16:41 PM
alastairp

this is probably why we were getting so many empty matches

2022-07-20 20133, 2022

16:41 PM
alastairp

so, let's fix the query

2022-07-20 20100, 2022

16:42 PM
Pratha-Fish

🧠 ✨

2022-07-20 20142, 2022

16:42 PM
Pratha-Fish

That makes me think, does the release-MBID column really help us with anything?

2022-07-20 20155, 2022

16:42 PM
alastairp

not at the moment :)

2022-07-20 20118, 2022

16:43 PM
Pratha-Fish

Great, one less thing to worry about

2022-07-20 20130, 2022

16:43 PM
alastairp

let's talk about that in a few weeks. I came up with some ideas with mayhem yesterday. when we've finished these current tasks we can address it, but it's not useful for us yet

2022-07-20 20139, 2022

16:43 PM
alastairp

ok, so

2022-07-20 20140, 2022

16:43 PM
alastairp

select recording.gid as rec_gid, array_agg(artist.gid) as artist_credit_list from recording join artist_credit ac on ac.id=artist_credit join artist_credit_name acn on acn.artist_credit=ac.id join artist on artist.id = acn.artist group by recording.gid limit 10;

2022-07-20 20105, 2022

16:44 PM
alastairp

I gave you a query like this, which gives you the artists on a recording, but then we used the `artist table` to look up the actual artist, right?

2022-07-20 20135, 2022

16:44 PM
Pratha-Fish

that's right