ruaok: not sure i understand what's wrong with those tracks. acousticbrainz.org/b58da12b-3182-4afc-b5ff-7646... shows 10 submissions, 9 of which are 185 bpm and the dataset hoster shows 184 so checks out. am i missing something here?
those outlier peaks on the right side of the graphs? I'd guess those are wrong.
(through no fault of your own)
yvanzo
O’Moin
ruaok
moin, yvanzo!
lucifer: it might be a good idea to pull up some of the tracks that make up mokey's right hand spike and see if those tracks are all correct. I rather doubt it.
so I'm thinking out loud - there are a few fields on the bpm histogram that might be useful. e.g. I'm looking at `bpm_histogram_first_peak_weight`, given the name of the field it might indicate how "strong" a bpm is
monkey
Maybe the alg didn't get fed enough African rhythms
ruaok
lucifer: thanks. I'll turn that into a playlist for inspection in a bit.
alastairp
in which case we could remove items which have less certainty
I think monkey might be on to something too - I suspect it's pretty good for pop/rockish songs, so maybe we could filter out there as well
ruaok
also remind of the problem where the alg might pick a wrong range... something about BPM twice or half of the stated value....
alastairp
as a very basic filter, _maybe_ we could assume that a fast track is also loud?
yes - exactly that. sometimes it might pick a value twice or half, due to misidentifying the peaks
ruaok
ok, so a possible approach is to identify these cases and then adjust BPM?
note that the bpm was identified as 125 once and 185 9 other times
*123 once
alastairp
this is one of the reasons why we return the 1st and 2nd histogram peak, too. it's possible that we could ignore items where these 2 peaks are close (and therefore the algorithm is uncertain about which one to choose)
ruaok
it is clearly not 125 either.
alastairp
lucifer: are you doing any filtering/processing on this data?
lucifer
nope
ruaok
185 / 2, quite possible.
monkey
Maybe it'll be worth doing some sorting by mood instead of BPM?
actually yes, alastairp. selecting the bpm which is closest to the mean of all bpms of that recording mbid.
ruaok
mood is a much higher level of data, this even more unreliable, monkey
I trust nothing of the high level data in AB.
alastairp
the cantelows example - I could imagine that it's finding the high plucked notes as "beats", and therefore miscalculating the BPM
monkey
I mean, BPM doesn't look super reliable or appropriate for what we're doing, but strong confidence in Aggressive/Not Aggressive might make for a better sorting
alastairp
lucifer: note that the ?n=0 submission has 123 as the first peak and 185 as the 2nd peak
monkey
That was mostly what the rollercoaster effect was for me. Calm song followed by agressive one.
lucifer
alastairp: currently i am ignoring those fields just using bpm field. i have those peak fields available as well in the dump if we want to try some stuff out.
lucifer: any chance you could make a dump of the mood as you did for BPM/key?
lucifer
ruaok: a nicer version to import to bono for playlist
sure, can do but it'll probably take a long time. took 2 days to dump bpm/key.
ruaok
shit.
lucifer
maybe if i dump of frank it'll be faster? saving network trips to kiss.
ruaok
ok, then I'll try fetching from AB for testing.
lucifer
yeah that sounds better. if something pans out, we can do a full dump.
alastairp
dortmund and rosamerica are definitely "better" genres, but they have very few categories, we'd be better off finishing the genre import and using tags instead
well, we could always dump moods + weights at the same time
lucifer
yup that's possible
alastairp
(although moods require a few joins into separate tables)
lucifer: yes, I'd dump directly on frank
ruaok
could we create a dataset hoster that gives access to moods given a list of MBIDs?
lucifer
ab image doesn't have psql so i dumped through lb-web. i'll start a temporary postgres container on frank this time then.
ruaok
meaning that the dataset hoster queries frank and takes care of the picking of the instance of the MBID
lucifer
bono has a subset of ab db fwiw.
this data is accessible through ab api so we could just use that.
alastairp
yes to both of those - we can prototype it on bono and directly connect to db, if it works, release ds hoster on kiss connecting to frankdb
lucifer
and opt the bono ip out of ratelimit if that becomes an issue.
ruaok
I'll work on the AB api for now -- if that shows promise we can expand on that.
but first let me make a playlst from those MBIDs
alastairp
that being said, we already have the bulk get specific (ll) feature API for AB, which is basically that. not sure why we never finished the hl version of this
lucifer: I sometimes use docker exec on the pg instance on frank to get a psql shell
wondering if pg_dump could be faster than \copy, alastairp
that is pg_dump the whole table, transfer to michael and bring up a pg instance there. import the dump and let spark connect to the pg instance directly.
alastairp
ah, right. I'm not sure where the current slowdown is - is it due to getting a field from the json? or was it slow last time because of the round trip between two different servers?
if you're after a pg dump
frank /home/alastair/acousticbrainz-pgdump-2021-06-03.pgdump
lucifer
ah nice.
the size of pgdumps it too so probably not a good idea to do this.
*too large
alastairp
so, let's try your previous dump, including the weights, directly from frank and see if it's any faster