No more errors and it seems like it's starting to decrease... hopefully that was all there is to it then?
2022-03-25 08459, 2022
reosarevok
But keeping an eye
2022-03-25 08406, 2022
lucifer
maybe some missed error handling in sir. it disconnected temporarily due to some reason but then didn't reconnect while the rest of code assumed it did.
2022-03-25 08433, 2022
alastairp
morning
2022-03-25 08454, 2022
lucifer
🤦 i forgot to rebuild the mapping container so it used the wrong commit :( building again
2022-03-25 08412, 2022
alastairp
I'm without internet at home, I'll hang around working offline, and try and jump in on tethering every now and again until things are back normal
2022-03-25 08415, 2022
lucifer
morning!
2022-03-25 08441, 2022
reosarevok
Hmm, the queues are actually slowly rising again, but no errors. Maybe it's just the issue where if it's too high it doesn't come back down on its own. I'll try the whole saving-the-messages thing yvanzo documented
2022-03-25 08404, 2022
d4rkie joined the channel
2022-03-25 08439, 2022
reosarevok
All saved, let's see what happens now
2022-03-25 08412, 2022
mayhem returns after a surprise visit from a friend
2022-03-25 08455, 2022
reosarevok
Oh no, unexpected socialising
2022-03-25 08403, 2022
reosarevok shudders :D
2022-03-25 08420, 2022
reosarevok
Ok, sir seems to be working fine and processing messages again
2022-03-25 08432, 2022
reosarevok
Will start queuing the saved messages in small batches
2022-03-25 08417, 2022
mayhem
do we have examples for two recordings that are in conflict I can look at?
2022-03-25 08421, 2022
PrathameshG has quit
2022-03-25 08411, 2022
mayhem
the vacuum analyze on bono finished. but swapping the new mb_metadata_cache table into production at gaga didn't finish. odd.
2022-03-25 08433, 2022
atj
reosarevok: hello
2022-03-25 08440, 2022
mayhem
ah! stopping listenbrainz-web-test allowed it to finish.
2022-03-25 08456, 2022
reosarevok
atj: hi! seemingly not actually a rabbitmq issue after all, so no need anymore :)
ah, I see. would it be cheeky to call that a data problem?
2022-03-25 08405, 2022
reosarevok
Yes
2022-03-25 08417, 2022
reosarevok
I mean, if it's really dropping 3 million recordings, yes :D
2022-03-25 08418, 2022
mayhem
"David Guetta,JD Davis"
2022-03-25 08439, 2022
reosarevok
Well, if it was "David Guetta, JD Davis" with a space you'd get the same :)
2022-03-25 08440, 2022
mayhem
"David Guetta & J.D. Davis"
2022-03-25 08404, 2022
mayhem
and this isn't an argument for merging this stuff into a single AC?
2022-03-25 08408, 2022
reosarevok
If one release prints it one way and the other the other way, it's correct to have it like that (probably with a space after the comma though)
2022-03-25 08424, 2022
mayhem
ah yes, fair.
2022-03-25 08424, 2022
reosarevok
You can't, acs explicitly need to have the same credit, join phrases, etc
2022-03-25 08459, 2022
mayhem
ok.
2022-03-25 08403, 2022
lucifer
for our purposes, we'd mark one as canonical and redirect all others to it though
2022-03-25 08405, 2022
mayhem
I really have no idea how to resolve this.
2022-03-25 08406, 2022
reosarevok
I did suggest that an option might be to not use ACs but AC artist MBIDs for deduping
2022-03-25 08417, 2022
reosarevok
But that might not work for matching to messybrainz :)
2022-03-25 08427, 2022
reosarevok
I think the least bad option is what lucifer said
2022-03-25 08408, 2022
mayhem
that could work if the underlying audio is the same track. does that appear to be the case?
2022-03-25 08414, 2022
mayhem
no
2022-03-25 08421, 2022
lucifer
uh yeah in this case no.
2022-03-25 08425, 2022
reosarevok
You just take all the MBIDs for a specific combined_lookup and throw it into canonical_recording
2022-03-25 08435, 2022
lucifer
consider track length too?
2022-03-25 08449, 2022
reosarevok
I mean, you'll already be conflating actually-different-recordings anyway
2022-03-25 08454, 2022
mayhem
track length would open a greater can of worms/
2022-03-25 08401, 2022
lucifer
yeah indeed
2022-03-25 08409, 2022
lucifer
not to mention that we don't have it most listens
2022-03-25 08416, 2022
reosarevok
AFAICT, you're already merging live and studio versions if they have the same title + ac, no
2022-03-25 08419, 2022
reosarevok
?
2022-03-25 08429, 2022
mayhem
yeah
2022-03-25 08438, 2022
reosarevok
So it doesn't seem any different to me
2022-03-25 08452, 2022
mayhem
and there are a lot of liberties that have been taken here in order to get a decent mapping.
2022-03-25 08459, 2022
reosarevok
As I said, yes, there's a small chance you'll conflate a track with a very common name by two different artists with the same name
2022-03-25 08407, 2022
reosarevok
But it seems about as minor as the punctuation-only issue tbh
2022-03-25 08418, 2022
mayhem
"You just take all the MBIDs for a specific combined_lookup and throw it into canonical_recording"
2022-03-25 08423, 2022
mayhem
how do you feel about that lucifer ?
2022-03-25 08440, 2022
lucifer
i guess for the automatic mapper continue to do this. but in future let users override mapping for specific listens.
2022-03-25 08418, 2022
reosarevok
There's two ways to do that cleanly, a) you specifically exclude the *first* mbid and only throw the others in or b) (probably easier) you literally throw all into canonical_recording at first, then remove any mbids from canonical_recording that already appear on the main table
2022-03-25 08432, 2022
reosarevok
(since you're also maybe going to get some dupes that *are* just dupes)
2022-03-25 08401, 2022
lucifer
mayhem: yeah i agree with that unless it is entirely different artists and recordings.
2022-03-25 08408, 2022
mayhem
it would be great if we could get rid of the dedup step at the end and have the alg produce data without dups.
2022-03-25 08429, 2022
lucifer
reosarevok: do know of an example at that? like the Prodigy one you mentioned
2022-03-25 08417, 2022
reosarevok
lucifer: an example where there's two different acs, but the same artist?
2022-03-25 08446, 2022
lucifer
uh no, different ac different artist but after removing punctuation it becomes the same.
2022-03-25 08400, 2022
reosarevok
Oh
2022-03-25 08427, 2022
reosarevok
Well, it's easier to find ones with different join phrases probably
2022-03-25 08435, 2022
lucifer
but again its likely to be an edge case so i am in favor of letting users's handle that.
here the best match we could find, if you don't like feel free to change it.
2022-03-25 08410, 2022
reosarevok
If it's an edge case then it can't be the cause for a 3 million recording difference? :)
2022-03-25 08429, 2022
reosarevok
But yes, in general, "match as best you can but allow to change it" seems sensible to me
2022-03-25 08430, 2022
mayhem
lucifer: I see you commented out the dedup step for canonical recordings as well. just for testing or was there a solid reason for that?
2022-03-25 08405, 2022
reosarevok
Unrelatedly, I'm slowly requeing all those sir messages, seems like all is good
2022-03-25 08406, 2022
lucifer
mayhem: testing to confirm that dedup is also not removing rows unexpectedly.
2022-03-25 08412, 2022
mayhem
ok.
2022-03-25 08428, 2022
mayhem
then I think we should change the dedup step to insert found rows into canoncial_recordings
2022-03-25 08445, 2022
mayhem
it is much harder to do this earlier since we process data AC by AC.
2022-03-25 08423, 2022
lucifer
reosarevok: oh yes, for different join phrase i say redirect. that's not the edge case i am talking about. its different artist before punctuation smae after one i called edge case.
2022-03-25 08424, 2022
mayhem
s/DELETE FROM/ INSERT INTO/
2022-03-25 08430, 2022
lucifer
yeah makes sense
2022-03-25 08437, 2022
lucifer
insert into followed by a delete.
2022-03-25 08442, 2022
mayhem
yes.
2022-03-25 08406, 2022
mayhem
my schedule is discombobulated for the next 5 hours. if you're free to take a stab at it, please do lucifer .
2022-03-25 08418, 2022
mayhem
I'll be back this afternoon to continue working on this.
2022-03-25 08420, 2022
lucifer
somewhat related canonical recordings and mbid mapping need to be in the same db for that. can't use --timescale.
2022-03-25 08420, 2022
reosarevok
You can probably do the delete + insert in the same query
2022-03-25 08435, 2022
mayhem
mb_metadata_cache, however, looks good now. caa_ids are present.
2022-03-25 08415, 2022
mayhem
lucifer: then, lets make it so that either all or none of the produced tables are stored in TS.
2022-03-25 08419, 2022
mayhem
that should work, no?
2022-03-25 08423, 2022
lucifer
yes
2022-03-25 08432, 2022
mayhem
great.
2022-03-25 08447, 2022
mayhem
reosarevok: thanks for all your help. I knew this would be easier with you helping.
2022-03-25 08453, 2022
lucifer
yes probably but 2 data modifying statements in 1 cte may be calling for problems. since pg doesn't mandate which order those will run.