rayyan_seliya123: let's start with just 78rpm/cylinder for now. you should update your prototype or rewrite it from scratch to work with the rest of the codebase. https://github.com/metabrainz/listenbrainz-serv...
you won't need to create new models, we don't use sqlalchemy as an orm anyway.
you can see apple and spotify follow the same structure whereas soundcloud has a different one. you should check what data is available in the IA and then either map it to the existing apple/spotify or soundcloud format. if neither is suitable then we can think of a new format.
Maxr1998_ has quit
Maxr1998 joined the channel
rayyan_seliya123
<lucifer[m]> "you can see apple and spotify..." <- Thanks for the detailed guidance! I’ve reviewed the existing codebase and understand that for the 78rpm/cylinder collections, the Internet Archive data is mostly track-level, so I’ll map it to the SoundCloud format as you suggested.
For moving forward, would you prefer that I work directly in the main ListenBrainz repo through PRs, or should I start in a separate branch and then merge my work in? I want to follow whatever workflow you think is best for the project.
Let me know what you prefer, and I’ll get started accordingly!
lucifer[m]
[@rayyan_seliya123:matrix.org](https://matrix.to/#/@rayyan_seliya123:matrix.org) work with LB repo through PRs.
rayyan_seliya123
lucifer[m]: Okk fine 👍
_BrainzGit
[listenbrainz-server] 14amCap1712 opened pull request #3292 (03master…similar-users): Use cosine similarity instead of pearson coefficient for similar users https://github.com/metabrainz/listenbrainz-serv...
lucifer[m]
monkey: the current similarity scores on LB should be using this new algorithm
monkey[m]
Ooh, OK
lucifer[m]
do they seem sensible to you?
mayhem[m] is reading the PR right now
i have a dump of the score before this change if you want to compare.
monkey[m]
Damn, I don't have older version saved to compare, but let me look
holycow23[m]
<lucifer[m]> "i'll fix the errors and let..." <- Hey lucifer, any update on this
lucifer[m]
holycow23: not yet
holycow23[m]
Okay
mayhem[m]
lucifer: its hard to judge the cosine similarity without having prior data.
monkey[m]
Would love to compare to see if it was the case before, but I'm already seeing twousers whom I have 6+ artists in common at 0% compatibility, which feels wrong.
But I've always thought the similarity scores were low
lucifer[m] sent a code block: https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/yhckSFyYztsIjnTgUrAmpKDY
I noticed that the closest person to me is now much stronger, while the others are weaker.
lucifer[m] sent a code block: https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/txxvtmQvDEICRXZHaGRmpFeD
lucifer[m]
the first row is pearson coefficient and the second row is cosine similarity
monkey[m]
Well, they seem very close
mayhem[m]
oh wow. well, I guess I haven't looked at similarity data in a while.
lucifer[m]
mayhem: user similarities have not updated in a few days because it always OOM'ed.
the last week it OOM'ed in a way to bring down the cluster so i changed it.
monkey[m]
FWIW i think the similarity calculations need to be reviewed, but where it comes to fixing OOM and the smallest differences between the numbers I see, I would consider them equivalent.
lucifer[m]
we can implement and experiment with pearson coefficient but just that we'd have to implement something manually. which is doable.
i went with column similarities because it exists there and was a smaller fix.
mayhem[m]
I think we should keep it for the time being and ask the community for feedback.
monkey[m]
Might be worth calculating the average difference between the two methods for all the usersyou have data for, but... From my point of view they are both equally low.
mayhem[m]
that downside to that is that everyone has an opinion on how it should work and there'd be "its just a little tweak" comments.
(in ML, its never just a little tweak.)
monkey[m]
Little tweak, big refactor
lucifer[m]
fwiw, i don't recall any particular reason implementing it with pearson coefficient the first time.
i do think there is value in experimenting and improving similarities but we'd need to do it more rigourously, define proper test datasets as a reference etc etc
monkey[m]
Agreed.
For my numbers the differences were sub-percentage point, which makes virtually no difference, so OK from me.
_BrainzGit
[listenbrainz-server] 14amCap1712 merged pull request #3292 (03master…similar-users): Use cosine similarity instead of pearson coefficient for similar users https://github.com/metabrainz/listenbrainz-serv...
fettuccinae[m]
mayhem: ping
mayhem[m]
Pong
fettuccinae[m]
For authroziation of endpoints, each project can have an auth token generated from MeB and saved in secrets of both MeB and the project.
That way, when a project makes a request, we can authorize it using either the token or the owner_id of the token sent. Is this approach okay?
mayhem[m]
I think so, but lucifer: is more on top of oath related questions. lucifer: ?
lucifer[m]
@fettuccinae:matrix.org: not sure what you mean. but the workflow would be as follows: the project LB/BB/MB connect to MeB to obtain an auth token, and use that auth token in the request to post notifications to MeB, MeB validates whether the token has the relevant scopes and is owned by the one of the hardcoded client ids in the configuration, if yes then it proceeds otherwise it rejects the request.
fettuccinae[m]
lucifer[m]: i was thinking an admin user could generate auth token for projects through https://metabrainz.org/profile#, and then this token could be hardcoded in the configuration of both project and Meb. So when a project sends a request with this token, MeB verifies it against the saved token in config and allows the request
lucifer[m]
@fettuccinae:matrix.org: no we don't want to do that for multiple reasons. it makes token rotation hard and we cannot have expiring tokens this way.
fettuccinae[m]
ohh, but how can the project get tokens in the authorization for if login is required for /oauth2/authorize.
lucifer[m]
unless there is a strong reason we should stick to using the oauth way. i am running a bit behind schedule on client credentials grant but only the testing is pending, once that is done it should be available for use in your project.
fettuccinae[m]
s/for/flow/
lucifer[m]
with the client credentials grant, you won't need the manual /oauth2/authorization.
fettuccinae[m]
Ohh, thanks. I'll add todo's and work on other things.
* Ohh, got it, thanks. I'll, * add todo's for this and work
Kladky has quit
Kladky joined the channel
Kladky has quit
Kladky joined the channel
lucifer[m]
holycow23: i tested the dumps locally and everything seems to work fine, lets try again when you are around.
holycow23[m]
lucifer[m]: I can try right now
lucifer[m]
holycow23: okay try running `./develop.sh spark format` once and share its output.
incremental dump imported fine, its still importing the sample dump.
should be done in less than 5 mins.
holycow23[m]
Okay
lucifer[m]
update the logs when you see another Request done!
holycow23[m]
Updated
Got a request done!
lucifer[m]
that succeeded as well.
okay now run ./develop.sh manage spark request_user_stats --entity artists --range this_week --type entity
holycow23[m]
Done
lucifer[m]
update the request consumer logs after another request done
holycow23[m]
Okay
Will this take time?
lucifer[m]
should be done by now.
update the logs anyway and i'll take a look
holycow23[m]
Updated
lucifer[m]
yeah seems to be still running, lets wait. this is not optimized for running locally.
holycow23[m]
okay
lucifer[m]
it took 16s on my PC but docker-desktop is probably slower.
holycow23[m]
Okay
lucifer[m]
anything new in logs
holycow23[m]
Nope
Its been at this stage for long now
lucifer[m]
okay, check spark_reader logs
and see if there are any messages for user_entity.
holycow23[m] uploaded an image: (53KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/matrix.org/xHVHOWWKtTxmXkJyFKQqfDFu/image.png >
check the logs above and below this, there are a lot of debug messages that might drown the user entity message. you can do a grep on the logs if possible for user_entity to confirm.
holycow23[m]
its been this throughout except these two lines
`2025-06-03 14:22:34,058 listenbrainz.webserver DEBUG Received a message, adding to internal processing queue...`
`2025-06-03 14:22:34,059 listenbrainz.webserver INFO Received message for import_incremental_dump`
lucifer[m]
i see
try running ./develop.sh manage spark request_user_stats --entity artists --range this_week --type entity again i guess
and see if there's anything new in request_consumer logs
holycow23[m]
listenbrainzspark hasn't changed after running the command
holycow23[m] uploaded an image: (32KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/matrix.org/QvxGdpfqwgXBUfjmeXvjcUeZ/image.png >
lucifer[m]
yeah that is fine, what about the request consumer logs
holycow23[m]
its the same
no updated
lucifer[m]
i see, you can stop the containers.
./develop.sh spark down
and then run ./develop.sh spark up to bring it back up again