yvanzo: Yes, that is exactly what I like to accomplish. But the trained file by Leo_Verto is not present to work on or I didn't find it so far....
yvanzo
Yes, this is why a trainer must be made first.
Leo_Verto
I can provide the trained model, it might make sense to re-train it on newer spam though.
yvanzo
When Leo_Verto worked on it, he had direct access to privated data. It simplified development, but it also made it difficult ot fully open source it.
We want to avoid this by making everything that is necessary to train a model, but the private data, thus the needed dummy data.
The trainer must be deployed on MeB servers along with the MB DB, so it will have access to private data.
diru1100
Yes, that is my intention too :) but the model if trained on dummy data would have a very bad foundation from the beggining, unless the dummy data is highly accurate to the real spam.
this brings to my other issue: What I found is that, the data that we keep through online learning is minimal relative to the data the model is trained on. This might not help change the model perception anytime soon. To change this, I want to retrain the model every week/month with new editor data. For this to happen, first we have to send the data for the model to test, send it back to SpamNinja, let them
classify the SB result. Store them back and train the model based on final review.
reosarevok
yvanzo: did I get it correctly in the Muziekweb email that we should also change the cleanup to standardize to .nl?
yvanzo
reosarevok: yup, I think so
reosarevok
Ok
yvanzo
reosarevok: and probably remove language too.
reosarevok
We already do, it seems
yvanzo
diru1100: let's have something based on dummy data that can be retrained with dummy feedback first :)
We can reboot from a rightful model after that, does it make sense?
diru1100
yes, you want to test out the whole process with dummy data everywhere, first?.
ok sure, I can use dummy data and do it np. Which approach do you want me to follow for updating the model? the online learning one or retrain every week/month one? or shall we test both of those as well?
diru1100: online learning at least, both if you feel it could be worth it :)
and if you want to do both, start with the simplest one, so something can be tested sooner.
Leo_Verto
I think the problem with online learning is that the email and website tokenizers will need to be recreated once in a while to include new spam domains. This automatically invalidates all previously trained models.
KindTwo joined the channel
KindOne has quit
reosarevok
bitmap: rebased
bitmap
thx
KindOne joined the channel
KindTwo has quit
reosarevok
Guess I should be rebasing pretty much everything really!
yvanzo: I think I can complete the phase 1 with just dummy data and it doesn't involve any updation methods. I will research which way is better till then and we can go with that?
yvanzo
works for me!
diru1100
Cool :)
yvanzo
:)
KindTwo joined the channel
KindOne has quit
KindTwo is now known as KindOne
shivam-kapila
ruaok: Added GET API endpoints too to the PR
ruaok: Mr_Monkey's listens couldn't be reached after 3 pages because of this check. 3 week data is missing but window is 15 days.
shivam-kapila: timescale-rebased-again is pushed -- something is making the connection to the db blow up during the integration tests. maybe you have a moment to look at it -- I suspect i'll be afk all day tomorrow.
zas
bitmap, yvanzo: it seems container mb website on ludwig is misbehaving. Can you have a look?
ok, I'll just restart it, logs don't help at all (zilions of "Can't use an undefined value as a subroutine reference at /home/musicbrainz/carton-local/lib/perl5/Plack/Util.pm line 14")
I also restarted the ws container on ludwig, everything is back to normal now, not sure what happened (but it seems both containers were affected)
ZaphodBeeblebrox has quit
Chinmay3199 has quit
rdswift
zas: The Picard docs refer to the following as basic tags, but I haven't yet found a release that will produce them. Do you know if they are still valid, or have they been deprecated? musicbrainz_originalalbumid, musicbrainz_originalartistid, musicbrainz_releasetrackid, originalalbum, originalartist