alastairp: the changes for series and author were already there in 452 right?
alastairp
yes, I had a look and I saw the commits there
ansh
So we did not have to merge 453 and 455
The commits would be repeated
alastairp
correct. in fact, I only merged 452, and github automatically closed 453 and 455
because it saw that their commits also existed in `master`
ansh
Oh I thought you merged all three
alastairp
nope, magic
yvanzo
al-bondigas: It should not require any change to files provided by git. If you made any changes, revert these.
alastairp
ansh: did you move 456 back to draft? and 458 is pending further work?
ansh
For 456, I'll quickly rebase and make it for all the entity types
For 458, The error message is not getting translated
alastairp
that sounds like a bug
ansh
Since it is an API, we don't have an option to translate it
We are not providing any language to it
alastairp
ahhh I thought it was a user message
my mistake. no problem about the translation, then
how about the issue of encountering a server error if we forget to update the mapping?
I think that just using a generic word in this error is OK. it's been like this since when we added other entities to CB
ansh
For adding a new entity, we already have 4-5 mappings iirc
alastairp
I wonder if we could move these all to one place
ansh
The generic word also sounds good, but in CB-444, monkey mentioned "Is there a way we can use the type of entity we are submitting a review for ?", so I did it in that way
ah, interesting. this error is passed through directly to the BB user from the CB API
ansh
I think it is also being done in LB
alastairp
even in this case, I don't see a problem with saying "item" instead of album
or "entity"
but in the case that we do want to show the actual entity type, we absolutely should have only one mapping of entity types to names, and ideally it should be in an easy to find place (the link that I just posted is a bad place)
I don't know what the behaviour of gettext is if we try and use a translated value in the API without first setting it up. perhaps we can have a plain dict in review.py, and then wrap the values in gettext for the version in user.py
though it's not clear if gettext will generate the .po files correctly if we call gettext in a loop, rather than passing it a constant
I am done with the changes, just testing it before pushing
alastairp
great
Pratha-Fish
alastairp: Hi, are you currently free for a discussion?
ansh
alastairp: Pushed the changes
yvanzo
al-bondigas: I just tested creating a database from the current master HEAD which is tagged v-2022-08-09 and it worked flawlessly, so It isn’t a regression.
al-bondigas: Since it is an issue with your setup, I recommend you to restart from scratch after having cleared everything as follows: sudo docker-compose down --remove-orphans --rmi all --volumes && cd .. && rm -fr musicbrainz-docker
ansh: that might have been just before I added to the enum
ansh
Okay
alastairp
Pratha-Fish: remind me of where we got to in the last few weeks. From what I remember, we were fixing some bugs in querying/sorting external APIs, and we got metadata matches for almost everything in the catalogue, but there are still some unmatched?
Pratha-Fish
alastairp: yes, that's right
We were working on a script that took MLHD as input, and first cleaned the rec mbids to canonical mbids
alastairp
ah that's right, we had the standalone scripts, too
Pratha-Fish
and then looked up the canonical mbid using either MBC or the mapper APi
The MBC looked a lot more accurate, so we went ahead with it and started debugging it
However, I also noticed a lot more severe issue today
alastairp
and I remember that we also discussed loading the underlying MBC table into memory and directly looking up the data instead of using the API (to make it faster). Did you get anywhere with that?
what issues have you found?
Pratha-Fish
alastairp: yes, the current version looks up MBC in memory :)
The issue:
When fetching canonical MBIDs for MLHD recording mbids, my script loaded 400k rows, and found canonical MBIDs for all but 12k rows
e.g.
This is the original MBID from MLHD: 07d193fb-d411-41fc-b9ee-9c5902de137a
This is the canonical MBID according to my script: caeea2b6-f751-4ea8-9c9f-ac092a1b8f08
https://musicbrainz.org/release/0b765eaf-d58e-4... annotation: "Released by Amy Lee and Ben Moody in 1998 at a concert at Vino's in Little Rock, Arkansas, as a printed CD-R. The EP sold out that night"
so, are the acoustic recordings from the brazil release of the album the same recordings from the 1998 live concert?
unsure unless we listen to them
that is, maybe the solution is to actually merge the recordings in the MB database instead
Pratha-Fish
Is that verified by acoustic fingerprinting?
(as if I know what it actually is in the first place)
alastairp
we could look at the fingerprints if both recording mbids have one submitted
Pratha-Fish
Right
alastairp
this is in fact one of the things that mayhem and I discussed that we could do with the data - take a look at these differences in data and give a list to the community, saying "here is something that perhaps could be improved in the MB database"
Pratha-Fish
Yea, I remember you guys discussing that one :)
As for the mapping issue that we were talking about,
There's 2 stages in the mapping script:
1) Cleaning MLHD_rec_mbid into MLHD_canonical_mbid (wherever possible)
2) Using this MLHD_canonical_mbid to lookup rec_name and artist_credit_name
Now, the 1st stage only uses MB_rec_redirect and MB_rec_canonical tables to lookup original rec_mbids and fetch their respective canonical MBIDs
so, the 1st stage doesn't use MBC at all. Could it be still possibly a sorting issue?
alastairp
do you have code for the first stage that you can link to me?
The 2 highlighted lines here just loop through the MLHD data, and replace any rec_mbids if they're found in the redirect table
alastairp
right, this is doing just the redirect table?
Pratha-Fish
It does the same thing again, but with the canonical rec mbid table
alastairp
ahh, I see
so you do mbid -> mbid_gid_redirect and then do the canonical lookup, and then... use that mbid to get metadata, and look up mbc?
Pratha-Fish
exactly
alastairp
right. I think I see where some confusion may be
our expectation is that canonical mbid table and mbc should probably give the same results in almost all cases, but to do this we could instead test:
mbid -> mbid_gid_redirect -> canonical
and separately
mbid -> mbid_gid_redirect -> metadata -> mbc
Pratha-Fish
what does "metadata" here stand for?
alastairp
the artist credit name and recording title
Pratha-Fish
Looks like so far we have been using the mbid -> mbid_gid_redirect -> metadata -> mbc way
and the metadata here is fetched using
alastairp
now, keep in mind the trick here that the canonical mbid mapping is actually based on the data from the mbc table, so I expect that the metadata will actually be the same regardless of if we look up redirect -> canonical -> metadata -> mbc or redirect -> metadata -> mbc
so these 12k rows that you mentioned - are these from where the mbc lookup and the canonical lookup give different mbids?
do you mean that 88k rows keep the same mbid, and 12k of them change?
Pratha-Fish
Basically:
I load 400k rows -> check in mbid_gid_redirect -> check in canonical_recording_redirect
After this "cleanup" process, only 12k rows are left out of 400k
alastairp
I don't know what you mean my "rows are left"
Pratha-Fish
12k rows (12k mbids) are deemed canonical, and the rest are just dropped for unknown reasons
alastairp
oh, you're _missing_ 380k rows?
Pratha-Fish
exactly
Let me explain with an example
alastairp
oops, that doesn't sound good
Pratha-Fish
Right
I must be doing something incredibly stupid in the cleanup process. Because our earlier tests showed almost all recording MBIDs were good to go without cleanup