In the figma file, second frame first page, top artists of 2022
2023-01-02 00213, 2023
jasje
top albums of 2022**
2023-01-02 00245, 2023
lucifer
jasje: you can construct the url for image this way: `https://archive.org/download/mbid-{caa_release_mbid}/mbid-{caa_release_mbid}-{caa_id}_thumb500.jpg`
2023-01-02 00219, 2023
lucifer
items in the top album list have caa_id and caa_release_mbid fields present and not null if there is a cover art available.
2023-01-02 00230, 2023
lucifer
if the field is missing then no cover art available.
[musicbrainz-android] 14dependabot[bot] opened pull request #175 (03master…dependabot/gradle/org.jetbrains.kotlin-kotlin-gradle-plugin-1.8.0): Bump kotlin-gradle-plugin from 1.7.10 to 1.8.0 https://github.com/metabrainz/musicbrainz-android…
2023-01-02 00245, 2023
BrainzGit
[musicbrainz-android] 14dependabot[bot] closed pull request #163 (03master…dependabot/gradle/org.jetbrains.kotlin-kotlin-gradle-plugin-1.7.22): Bump kotlin-gradle-plugin from 1.7.10 to 1.7.22 https://github.com/metabrainz/musicbrainz-android…
2023-01-02 00253, 2023
BrainzGit
[musicbrainz-android] 14dependabot[bot] opened pull request #176 (03master…dependabot/gradle/com.squareup.okhttp3-mockwebserver-5.0.0-alpha.11): Bump mockwebserver from 5.0.0-alpha.7 to 5.0.0-alpha.11 https://github.com/metabrainz/musicbrainz-android…
2023-01-02 00229, 2023
Toasty joined the channel
2023-01-02 00221, 2023
Pratha-Fish
Hi alastairp, Hope you had a great holiday :)
2023-01-02 00244, 2023
alastairp
hi Pratha-Fish, how are you?
2023-01-02 00255, 2023
alastairp
I had a busy break, but it was fulfilling
2023-01-02 00228, 2023
Pratha-Fish
These breaks never last long enough 🥲
2023-01-02 00256, 2023
Pratha-Fish Even my college was gonna open back on ~20th Jan, but looks like they changed their mind and started it back again from today itself
2023-01-02 00207, 2023
alastairp
your college is so weird
2023-01-02 00225, 2023
Pratha-Fish
alastairp: You have seen nothing yet 💀
2023-01-02 00234, 2023
alastairp
if mine changed anything from what was agreed 2 years ago, the unions would shut everything down
2023-01-02 00222, 2023
Pratha-Fish
Well, you went to a pretty good college. My college campus itself looks like something out of Far Cry 3... With gangs and stuff
2023-01-02 00245, 2023
alastairp
😬
2023-01-02 00200, 2023
Pratha-Fish
Thankfully some good friends make college life manageable haha
2023-01-02 00233, 2023
Pratha-Fish
But anyway, hopefully I'll be taking more days off ahead, so work should'nt be much of a problem
Sorry you had to go through all that effort even though I had 4 months to see the work through🥲
2023-01-02 00244, 2023
Toasty has quit
2023-01-02 00255, 2023
alastairp
Pratha-Fish: yes, right. loading some more data from musicbrainz, and then being very detailed about how we treated each bit of data and what we do with it in each case
2023-01-02 00224, 2023
Pratha-Fish
Hmm
2023-01-02 00229, 2023
alastairp
Pratha-Fish: you would be surprised... remember that 90% of what I wrote in this change was easy for me because 1) I've done this kind of thing for 15 years, or 2) because you did all of the heavy lifting to answer all of our unknown questions about the data
2023-01-02 00242, 2023
alastairp
I'm still really happy with where we got to
2023-01-02 00235, 2023
Pratha-Fish
Well that was one hell of a learning curve haha. I still don't know how we got to the end in the first place lol
2023-01-02 00245, 2023
alastairp
the key was to work out which bits of data your conversion code had left behind, I tried to carefully discuss this in the comments in the process_df_new function
2023-01-02 00225, 2023
Pratha-Fish
Thanks for the comments, they have been pretty helpful
2023-01-02 00239, 2023
Pratha-Fish
I'll try to leave more along the way too
2023-01-02 00255, 2023
Pratha-Fish
alastairp: So can you give me a brief overview of the new changes? And what steps we need to take ahead
2023-01-02 00239, 2023
Pratha-Fish
^ Whenever you're free that is
2023-01-02 00210, 2023
alastairp
Pratha-Fish: your code worked well for items which had a recording mbid, and for which the recording mbid was valid
2023-01-02 00227, 2023
alastairp
so for recordings we go through our steps: look up if there is a redirect and replace it if necessary; look up if there is a canonical id and replace it if necessary; and then look up artist and release information
2023-01-02 00217, 2023
alastairp
there's one pending item that came up in my testing here where we have what are called "non album tracks" in musicbrainz - that is, a recording with no related album. It turns out that there were quite a few of these, and I think that it's due to bad data in the mlhd, we need to come up with a better way of looking up this
2023-01-02 00203, 2023
alastairp
Then I was looking at the case of "what if there is an artist and release id, but no recording?" - you had already considered this a bit (your keep_missing, turn_blank parameters)
2023-01-02 00246, 2023
alastairp
when I started looking at the data in detail, a few things became clear to me. first, we were talking about having 2 datasets, one with only rows that have all columns (artist, release, recording), and one that has all rows from mlhd (without throwing away bad data)
2023-01-02 00222, 2023
alastairp
I realised that we could actually make a single dataset that contains both of this, by making one set of files with the "all column" data, and another set of files with the same filename in a separate directory containing _only_ the incomplete rows. This means if you need only all column data, you read just 1 file, and if you want all rows, you read 2 files and merge the rows together
2023-01-02 00223, 2023
Pratha-Fish
"non album tracks" that's a new one
2023-01-02 00246, 2023
alastairp
this is great because it means we don't need to make the dataset 2x as big to get all of the necessary data
2023-01-02 00203, 2023
Pratha-Fish
What do you mean by "column data" here?
2023-01-02 00229, 2023
alastairp
I mean the columns from the mlhd, timestamp, artist id, release id, recording id
2023-01-02 00247, 2023
Pratha-Fish
Ah I see
2023-01-02 00258, 2023
alastairp
so I mean "rows for which we have an artist id, a release id, and a recording id"
2023-01-02 00207, 2023
Pratha-Fish
And while reading this, something just popped up in my mind
2023-01-02 00229, 2023
alastairp
because there are a bunch of rows with only an artist and release id, sometimes there are rows with only a timestamp (we know someone listened to something at this time, but there's no record of what it was)
2023-01-02 00224, 2023
Pratha-Fish
Given that people tend to listen to the same songs again and again, we can just make a dataset of unique rows, and then couple it with another data set with just the row_ID and a list of timestamps at which the track was listened at
2023-01-02 00208, 2023
alastairp
yes, possibly. there are actually databases that are designed to consider/store data in this alternate format. I'm not sure how much space it would save, but we can definitely try it and find out
2023-01-02 00255, 2023
Pratha-Fish
Yes exactly
2023-01-02 00222, 2023
Pratha-Fish
I came across another last.fm dataset too a while ago, and they seemed to distribute the data this way
2023-01-02 00211, 2023
alastairp
the other thing that came up was some of our previous questions about recordings, but in this case applied to releases. So, I did the same process - 1) perform a redirect lookup, 2) find a canonical release id (this is a new dataset that I made only a month ago), find the artist of the release
2023-01-02 00226, 2023
alastairp
this bought up another question about if the release id is actually the correct field to use - we realised that in probably 99% of cases that people want to use this dataset, they're really just interested in knowing the general concept of "what album did someone listen to", not the specific version/format/year that it was released in
2023-01-02 00245, 2023
alastairp
in this case, the release group id is a better choice. so we need to work out how to add this to the dataset
2023-01-02 00241, 2023
Pratha-Fish
That definitely sounds a lot better
2023-01-02 00242, 2023
Pratha-Fish
Also, with our previous code, we were clearly missing out on some huge chunks of data where the recording MBID wasnt present, but the release MBID or artist MBID was
2023-01-02 00205, 2023
Pratha-Fish
And ig we can also derive artist MBIDs from release MBIDs too
2023-01-02 00231, 2023
alastairp
right, so I addressed that in the 2nd and 3rd third of the process_df_new function
2023-01-02 00201, 2023
alastairp
see that there is a 'if recording', and then if things are successful there is a 'continue', otherwise it falls down to the 'if release' and 'if artist' cases
2023-01-02 00236, 2023
Pratha-Fish
Great!
2023-01-02 00254, 2023
Pratha-Fish
It makes me wonder, how's the processing time looking as of now?
2023-01-02 00214, 2023
alastairp
I don't really know - I think slower than your original code, but not by much