i think it could be made to work by printing log first then letting the dataset hoster handle the result formatting.
2022-07-20 20157, 2022
lucifer
or alternatively some code duplication for the short term is fine
2022-07-20 20149, 2022
mayhem
lucifer: alastairp and I were just chatting...
2022-07-20 20135, 2022
alastairp
lucifer: yeah, I was thinking about adding the formatting in the ds hoster, but the log output of the mapper is so specific (e.g. bold indicators in certain places) that I don't think we can make it generic
2022-07-20 20152, 2022
mayhem
last night I realized that if we added release support for the mapper, we could solve two problems that we are currently seeing: Listening to a complication while watching data in the listening now viewer (it does the wrong thing). Identifying albums in a listen stream.
2022-07-20 20110, 2022
mayhem
I didn't know how to deal with releases, but now I understand. so now I feel armed to go back and add release support.
2022-07-20 20119, 2022
alastairp
I worked out that there's actually no need to make this machine-readable. the key is just to make mapper + explain endpoints return the same data, which just means using consistent stop word handling
2022-07-20 20135, 2022
mayhem
which means that alastairp you and I should spend some time at the summit to work out how to make a better mapper.
so, I guess so, but perhaps the variable names could be named better.
2022-07-20 20114, 2022
alastairp
ah, I just saw the way that you were doing the join
2022-07-20 20121, 2022
alastairp
ON rgst.id = rgsts.secondary_type
2022-07-20 20139, 2022
alastairp
mm
2022-07-20 20115, 2022
alastairp
but does this mean that sec-type 8 (dj-mix) is sorted before type 6 (live)?
2022-07-20 20153, 2022
lucifer
each group should be assigned the same the id so that the next sort gets a chance, no?
2022-07-20 20106, 2022
lucifer
*all items in a group
2022-07-20 20107, 2022
mayhem
the sec type only comes into play for groups that have the same primary type, no?
2022-07-20 20142, 2022
alastairp
you're saying so that you would get the earliest of (compilation or soundtrack or live), rather than the earliest compilation, or if there are no compilations the earliest soundtrack, or if there are no.... etc, lucifer?
2022-07-20 20102, 2022
lucifer
alastairp: yup.
2022-07-20 20127, 2022
alastairp
when I mentioned groups yesterday I didn't mean to imply that I considered them all the same, I think it's OK to have an explicit ordering
2022-07-20 20104, 2022
lucifer
ah ok, my understanding was the other way around. makes sense to have it this way then.
2022-07-20 20122, 2022
Pratha-Fish
alastairp: Hi, I am back. Took longer than I expected lol
2022-07-20 20140, 2022
Pratha-Fish
Also, looks like the conversion is completed too :)
2022-07-20 20145, 2022
Pratha-Fish
It took 83.1 Hours in total
2022-07-20 20153, 2022
alastairp
Pratha-Fish: yeah, I was going to ask you about that
2022-07-20 20100, 2022
mayhem
merely a round off error from 50 hours, no worries. :)
2022-07-20 20112, 2022
Pratha-Fish
☠️
2022-07-20 20116, 2022
alastairp
same order of magnitude, and still less than a week
2022-07-20 20132, 2022
alastairp
next time we need to do this, let's multi-thread it 8x
that's great. I also tested the data independently myself, so I think that we're pretty safe in deciding this
2022-07-20 20102, 2022
Pratha-Fish
🎉
2022-07-20 20111, 2022
alastairp
it means that we can remove track lookups from everything (remember also to put this in our doc explaining why we no longer have it!)
2022-07-20 20133, 2022
Pratha-Fish
Definitely
2022-07-20 20134, 2022
alastairp
yeah, I imagine that some files are large (users who have a lot of scrobbles)
2022-07-20 20138, 2022
lucifer
you can also try adding an errorneous entry to a file manually, then run it on 3-4 files including the maligned file to confirm it wasn't a logger issues.
2022-07-20 20129, 2022
Pratha-Fish
lucifer: sure I'll try it out. I have the faulty data ready too
2022-07-20 20126, 2022
alastairp
Pratha-Fish: ./49/49dc8e61-67ca-4ad1-bf53-437856924777.txt.gz is the largest file
2022-07-20 20140, 2022
alastairp
you could try and run it as a one-off to see if it takes ~12 seconds
2022-07-20 20147, 2022
Pratha-Fish
okie
2022-07-20 20132, 2022
alastairp
Pratha-Fish: I'll move the rec_track_checker/MLHD directory to /data, don't panic when it disappears
whereas the same one on README.md shows `content-type: application/octet-stream`
2022-07-20 20126, 2022
Pratha-Fish
wow that's interesting
2022-07-20 20130, 2022
alastairp
so the browser sees that and says "whoops, I'd better download this"
2022-07-20 20149, 2022
alastairp
try a zip file or something as well, based on the content-type the browser will choose what to do
2022-07-20 20101, 2022
Pratha-Fish
yep
2022-07-20 20107, 2022
alastairp
if we could configure nginx to send `text/plain` as the content type then the browser would probably display it
2022-07-20 20132, 2022
alastairp
regarding nginx/apache, no great difference. I used to use apache, then one day I started using nginx. about 15 years ago it had some features that made it faster
2022-07-20 20106, 2022
alastairp
Pratha-Fish: anyway, back to what we were discussing
2022-07-20 20118, 2022
alastairp
I think that it is now a top priority to start moving some of these notebooks to scripts (as you said you had started doing). It's very difficult for me to share code to you in the notebook, because every time I run your notebook, the outputs change and it causes really annoying git diffs
2022-07-20 20141, 2022
alastairp
with a python script, I'd be able to open a pull request on your repo to add this html template, for example
2022-07-20 20117, 2022
Pratha-Fish
I see, I'll get that one done ASAP then
2022-07-20 20148, 2022
alastairp
so let's focus on that. I'd like a script that I can run which takes the list of files (df2_artist_rec_names_artist_list.txt), looks up the necessary data, does the mapping lookup, and then writes the debug html
yes, requests cache just made testing faster without puting load on the mapping API
2022-07-20 20119, 2022
alastairp
do you see that even if the lookup that you do returns data from the cache, you'll still sleep for 0.5 seconds?
2022-07-20 20104, 2022
Pratha-Fish
yes, that one was placed there to reduce load on the server
2022-07-20 20124, 2022
alastairp
but if you don't access the server, there's no reason to reduce load on it
2022-07-20 20147, 2022
Pratha-Fish
Hmmmmmmmmm
2022-07-20 20154, 2022
Pratha-Fish
never thought of it!
2022-07-20 20102, 2022
alastairp
mmmhm
2022-07-20 20149, 2022
alastairp
this is why I suggested an alternative, of saving the result of the lookup to a file, and skipping the lookup if the file with the result already exists
however, click on onelinedrawing and it takes you to an artist called "Jonah Matranga"
2022-07-20 20124, 2022
Pratha-Fish
Interesting
2022-07-20 20126, 2022
alastairp
This is because MB allows you to have "artist credits" for releases and recordings, which may be different to the name of the person (because of stylistic reasons, or maybe their name changed but they're the same person, or... there are many reasons)
2022-07-20 20107, 2022
alastairp
anyway, I think that the query I gave you uses the artist's "official" name, rather than the credit. but the mapping code uses the credit
2022-07-20 20126, 2022
alastairp
this is probably why we were getting so many empty matches
2022-07-20 20133, 2022
alastairp
so, let's fix the query
2022-07-20 20100, 2022
Pratha-Fish
🧠 ✨
2022-07-20 20142, 2022
Pratha-Fish
That makes me think, does the release-MBID column really help us with anything?
2022-07-20 20155, 2022
alastairp
not at the moment :)
2022-07-20 20118, 2022
Pratha-Fish
Great, one less thing to worry about
2022-07-20 20130, 2022
alastairp
let's talk about that in a few weeks. I came up with some ideas with mayhem yesterday. when we've finished these current tasks we can address it, but it's not useful for us yet
2022-07-20 20139, 2022
alastairp
ok, so
2022-07-20 20140, 2022
alastairp
select recording.gid as rec_gid, array_agg(artist.gid) as artist_credit_list from recording join artist_credit ac on ac.id=artist_credit join artist_credit_name acn on acn.artist_credit=ac.id join artist on artist.id = acn.artist group by recording.gid limit 10;
2022-07-20 20105, 2022
alastairp
I gave you a query like this, which gives you the artists on a recording, but then we used the `artist table` to look up the actual artist, right?