we finally found out it was because mayhem was reading audio files from a slow disk compared to the one used to store database, hiding performance gain (which concern only database).
outsidecontext
ah, ok :) happy to test this later here as well
mayhem
new SSD arrives later this week, so that problem goes away. :)
Sophist-UK joined the channel
Sophist_UK has quit
rana_satyaraj
I'm new here, I have set up the ListenBrainz development environment, but having trouble finding something to work on. Can anyone point me in the right direction, maybe give me some tasks to do? It could be anything as long as it's coding.
mayhem
rana_satyaraj: hi! I'm looking but I can never find the "easy first bugs" label in jira
lucifer: Hello! Did you see LB-1455 by any chance? Wondering if it is due to how often we rebuild the cache or if there's something else going on there that prevents it being added to the cache.
zas: you can run create on an existing DB file and it will make the new table for you.
zas
great, but I think we'll still need better handling of schema updates at some point
mayhem
yep.
I didn't think we'd need it that soon, lol.
zas
:D
outsidecontext: The way we manage the catalog of audio files in listenbrainz-content-resolver could be done in Picard btw. In order to speed up music collection updates/tag resync etc
monkey
Oh boy. mayhem do I have a fun mapping pickle for you !
These two are not the same recording and not the same artist: pray (by Eve) and Pray (by EVE)
one of the things I was thinking about is that getting this right is... hard.
I could get right results by changing window size. 5 or 10 seconds.
but obviously that doesn't work in the real world.
the thing I had always wondered about is using machine learning to really solve this problem.
musicListenerSam
hmm . perhaps instead of creating a voting classifier for multiple mdls we could begin with a voting clasifier for specif window sizes
thats would be a start
mayhem
not sure that is the right approach.
I have a feeling that we should pick a middle of the road window size.
and then no use a peak detector -- that part is the trickiest.
what if instead the feed the generated data to something like a neural net?
we'd need to build a decent training data set, with audio files and expected (verified) BMP values.
then we can train a BMP classifier with that data.
what do you think?
musicListenerSam
yup , we will surely need to start with the data
i think teh neural network is a right approach
the algos can often fail in more dymaic scenarios , where the nueural network thrives
training the BMP classifoeer with a good dataset would be a huge plus
riksucks has quit
ig i'll look into the dataset buliding for now then , ig spotify has a lot of bpm data ,or so ive heard
mayhem
it does.
but I dont know if we can trust it.
arsh has quit
AcousticBrainz has this data, but we can't rely on it.
vscode_ has quit
so my take was to make a collection of releases, from many different genres and work out a BPM value for each track in the collection.
musicListenerSam
in that case , where else can we look for reliable sources of BMP data ?
hmm'
mayhem
what do you think is a good training dataset size for this?
there are other algorithms out there. we could ownload as many as we can find, run then all, pull in AB/Spotify and if we get an agreement, the track goes into the collection.
that might however, select for easy cases, so we may need to hand resolve the edge cases.
Shubh has quit
musicListenerSam
frankly speaking , if i take releases , average duration of each relsese 3 minutes , and since audio files are large sized in sonometry , i think 1 gb worth of data would be a good start . something that can be achoeved in the beginning
ya i agree with that
we could run a script matching the two , the data from the spotify api for bpm vs the algo and if its the same , it passes
ShivamAwasthi has quit
to enhance accuracy , we could parallely run mutiple agotihms and set a threshold for acceptance
mayhem
agreed.
musicListenerSam
that way we would have reduced number of false BMP results in the dataset .
Freso has quit
mayhem
let me see what I can do to collect this dataset.
musicListenerSam
as far as the edge cases are concerned , once we have a neural network that works on the larger chunk of data , certain cases should stand out , say soft music or some other case that the dataset misses . we can then work towards that data needs specifically , perhaps using attention modelling of some sort
mayhem
zas: are you following this convo?
musicListenerSam
shouldnt be that hard once we are at that point
mayhem
for ambient and classical music, ie. music without a clear beat, we should ideally say: Nope, can't determine BPM, rather than giving the wrong BPM.
zas
mayhem: yes
musicListenerSam
hmm , ill look into the dataset creation as well .
perhaps for clasical and amibent we can set a confidence fro the models prediction
mayhem
so, we're trying to come up with a machine learning BPM alg -- I have a feeling its been done before, but none of these approaches ever made it to open source.
musicListenerSam
below a certain prediction we just say , no BPM detected
mayhem
musicListenerSam: why don't you use my music service as source for music for now. let me worry about the dataset.
musicListenerSam
okay
mayhem
zas: your collection has more breadth than mine does.
would you be willing to contribute 5 albums each from punk, jazz, metal for the training dataset?
zas
np, but genres like jazz & metal are rather fuzzy
mayhem
yep, understood. thye are just poorly represented in my collection.
zas
bpm of doom metal is near zero, while bpm of death metal is rather high
mayhem
which is why I want both.
the more edge casesy sort of music you can help us with, the better.
musicListenerSam
hmm . ok so ig understood what i need to do next .(y) i'll ping you mayhem in case of any more exciting developements and if we feel the changes are an improvement we can add them to the bpm-detector repo
mayhem
yep. if you give me your github handle, I'll give you commit access to that repo.
and tomorrow, I will start building a test dataset.
should be good to start testing with tomorrow, but significant size will take some time still.
musicListenerSam
okay . sure
we'll start testing in small batches for now anyway , so the dataset size shouldnt be hurdle for now