rayyan_seliya123: let's start with just 78rpm/cylinder for now. you should update your prototype or rewrite it from scratch to work with the rest of the codebase. https://github.com/metabrainz/listenbrainz-server…
2025-06-03 15424, 2025
lucifer[m]
you won't need to create new models, we don't use sqlalchemy as an orm anyway.
you can see apple and spotify follow the same structure whereas soundcloud has a different one. you should check what data is available in the IA and then either map it to the existing apple/spotify or soundcloud format. if neither is suitable then we can think of a new format.
2025-06-03 15438, 2025
Maxr1998_ has quit
2025-06-03 15426, 2025
Maxr1998 joined the channel
2025-06-03 15437, 2025
rayyan_seliya123
<lucifer[m]> "you can see apple and spotify..." <- Thanks for the detailed guidance! I’ve reviewed the existing codebase and understand that for the 78rpm/cylinder collections, the Internet Archive data is mostly track-level, so I’ll map it to the SoundCloud format as you suggested.
2025-06-03 15437, 2025
rayyan_seliya123
For moving forward, would you prefer that I work directly in the main ListenBrainz repo through PRs, or should I start in a separate branch and then merge my work in? I want to follow whatever workflow you think is best for the project.
2025-06-03 15437, 2025
rayyan_seliya123
Let me know what you prefer, and I’ll get started accordingly!
2025-06-03 15418, 2025
lucifer[m]
[@rayyan_seliya123:matrix.org](https://matrix.to/#/@rayyan_seliya123:matrix.org) work with LB repo through PRs.
2025-06-03 15445, 2025
rayyan_seliya123
lucifer[m]: Okk fine 👍
2025-06-03 15454, 2025
_BrainzGit
[listenbrainz-server] 14amCap1712 opened pull request #3292 (03master…similar-users): Use cosine similarity instead of pearson coefficient for similar users https://github.com/metabrainz/listenbrainz-server…
2025-06-03 15455, 2025
lucifer[m]
monkey: the current similarity scores on LB should be using this new algorithm
2025-06-03 15402, 2025
monkey[m]
Ooh, OK
2025-06-03 15406, 2025
lucifer[m]
do they seem sensible to you?
2025-06-03 15418, 2025
mayhem[m] is reading the PR right now
2025-06-03 15420, 2025
lucifer[m]
i have a dump of the score before this change if you want to compare.
2025-06-03 15426, 2025
monkey[m]
Damn, I don't have older version saved to compare, but let me look
2025-06-03 15446, 2025
holycow23[m]
<lucifer[m]> "i'll fix the errors and let..." <- Hey lucifer, any update on this
2025-06-03 15453, 2025
lucifer[m]
holycow23: not yet
2025-06-03 15400, 2025
holycow23[m]
Okay
2025-06-03 15450, 2025
mayhem[m]
lucifer: its hard to judge the cosine similarity without having prior data.
2025-06-03 15450, 2025
monkey[m]
Would love to compare to see if it was the case before, but I'm already seeing twousers whom I have 6+ artists in common at 0% compatibility, which feels wrong.
2025-06-03 15450, 2025
monkey[m]
But I've always thought the similarity scores were low
2025-06-03 15454, 2025
lucifer[m] sent a code block: https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/yhckSFyYztsIjnTgUrAmpKDY
I noticed that the closest person to me is now much stronger, while the others are weaker.
2025-06-03 15459, 2025
lucifer[m] sent a code block: https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/txxvtmQvDEICRXZHaGRmpFeD
2025-06-03 15418, 2025
lucifer[m]
the first row is pearson coefficient and the second row is cosine similarity
2025-06-03 15445, 2025
monkey[m]
Well, they seem very close
2025-06-03 15459, 2025
mayhem[m]
oh wow. well, I guess I haven't looked at similarity data in a while.
2025-06-03 15401, 2025
lucifer[m]
mayhem: user similarities have not updated in a few days because it always OOM'ed.
2025-06-03 15440, 2025
lucifer[m]
the last week it OOM'ed in a way to bring down the cluster so i changed it.
2025-06-03 15402, 2025
monkey[m]
FWIW i think the similarity calculations need to be reviewed, but where it comes to fixing OOM and the smallest differences between the numbers I see, I would consider them equivalent.
2025-06-03 15429, 2025
lucifer[m]
we can implement and experiment with pearson coefficient but just that we'd have to implement something manually. which is doable.
2025-06-03 15450, 2025
lucifer[m]
i went with column similarities because it exists there and was a smaller fix.
2025-06-03 15418, 2025
mayhem[m]
I think we should keep it for the time being and ask the community for feedback.
2025-06-03 15429, 2025
monkey[m]
Might be worth calculating the average difference between the two methods for all the usersyou have data for, but... From my point of view they are both equally low.
2025-06-03 15450, 2025
mayhem[m]
that downside to that is that everyone has an opinion on how it should work and there'd be "its just a little tweak" comments.
2025-06-03 15407, 2025
mayhem[m]
(in ML, its never just a little tweak.)
2025-06-03 15404, 2025
monkey[m]
Little tweak, big refactor
2025-06-03 15407, 2025
lucifer[m]
fwiw, i don't recall any particular reason implementing it with pearson coefficient the first time.
2025-06-03 15456, 2025
lucifer[m]
i do think there is value in experimenting and improving similarities but we'd need to do it more rigourously, define proper test datasets as a reference etc etc
2025-06-03 15422, 2025
monkey[m]
Agreed.
2025-06-03 15408, 2025
monkey[m]
For my numbers the differences were sub-percentage point, which makes virtually no difference, so OK from me.
2025-06-03 15457, 2025
_BrainzGit
[listenbrainz-server] 14amCap1712 merged pull request #3292 (03master…similar-users): Use cosine similarity instead of pearson coefficient for similar users https://github.com/metabrainz/listenbrainz-server…
2025-06-03 15431, 2025
fettuccinae[m]
mayhem: ping
2025-06-03 15427, 2025
mayhem[m]
Pong
2025-06-03 15406, 2025
fettuccinae[m]
For authroziation of endpoints, each project can have an auth token generated from MeB and saved in secrets of both MeB and the project.
2025-06-03 15406, 2025
fettuccinae[m]
That way, when a project makes a request, we can authorize it using either the token or the owner_id of the token sent. Is this approach okay?
2025-06-03 15434, 2025
mayhem[m]
I think so, but lucifer: is more on top of oath related questions. lucifer: ?
2025-06-03 15413, 2025
lucifer[m]
@fettuccinae:matrix.org: not sure what you mean. but the workflow would be as follows: the project LB/BB/MB connect to MeB to obtain an auth token, and use that auth token in the request to post notifications to MeB, MeB validates whether the token has the relevant scopes and is owned by the one of the hardcoded client ids in the configuration, if yes then it proceeds otherwise it rejects the request.
2025-06-03 15431, 2025
fettuccinae[m]
lucifer[m]: i was thinking an admin user could generate auth token for projects through https://metabrainz.org/profile#, and then this token could be hardcoded in the configuration of both project and Meb. So when a project sends a request with this token, MeB verifies it against the saved token in config and allows the request
2025-06-03 15447, 2025
lucifer[m]
@fettuccinae:matrix.org: no we don't want to do that for multiple reasons. it makes token rotation hard and we cannot have expiring tokens this way.
2025-06-03 15439, 2025
fettuccinae[m]
ohh, but how can the project get tokens in the authorization for if login is required for /oauth2/authorize.
2025-06-03 15442, 2025
lucifer[m]
unless there is a strong reason we should stick to using the oauth way. i am running a bit behind schedule on client credentials grant but only the testing is pending, once that is done it should be available for use in your project.
2025-06-03 15450, 2025
fettuccinae[m]
s/for/flow/
2025-06-03 15407, 2025
lucifer[m]
with the client credentials grant, you won't need the manual /oauth2/authorization.
2025-06-03 15433, 2025
fettuccinae[m]
Ohh, thanks. I'll add todo's and work on other things.
2025-06-03 15451, 2025
fettuccinae[m]
* Ohh, got it, thanks. I'll, * add todo's for this and work
2025-06-03 15449, 2025
Kladky has quit
2025-06-03 15429, 2025
Kladky joined the channel
2025-06-03 15458, 2025
Kladky has quit
2025-06-03 15436, 2025
Kladky joined the channel
2025-06-03 15408, 2025
lucifer[m]
holycow23: i tested the dumps locally and everything seems to work fine, lets try again when you are around.
2025-06-03 15436, 2025
holycow23[m]
lucifer[m]: I can try right now
2025-06-03 15457, 2025
lucifer[m]
holycow23: okay try running `./develop.sh spark format` once and share its output.
incremental dump imported fine, its still importing the sample dump.
2025-06-03 15449, 2025
lucifer[m]
should be done in less than 5 mins.
2025-06-03 15407, 2025
holycow23[m]
Okay
2025-06-03 15419, 2025
lucifer[m]
update the logs when you see another Request done!
2025-06-03 15437, 2025
holycow23[m]
Updated
2025-06-03 15446, 2025
holycow23[m]
Got a request done!
2025-06-03 15452, 2025
lucifer[m]
that succeeded as well.
2025-06-03 15425, 2025
lucifer[m]
okay now run ./develop.sh manage spark request_user_stats --entity artists --range this_week --type entity
2025-06-03 15443, 2025
holycow23[m]
Done
2025-06-03 15408, 2025
lucifer[m]
update the request consumer logs after another request done
2025-06-03 15428, 2025
holycow23[m]
Okay
2025-06-03 15404, 2025
holycow23[m]
Will this take time?
2025-06-03 15436, 2025
lucifer[m]
should be done by now.
2025-06-03 15447, 2025
lucifer[m]
update the logs anyway and i'll take a look
2025-06-03 15400, 2025
holycow23[m]
Updated
2025-06-03 15450, 2025
lucifer[m]
yeah seems to be still running, lets wait. this is not optimized for running locally.
2025-06-03 15401, 2025
holycow23[m]
okay
2025-06-03 15406, 2025
lucifer[m]
it took 16s on my PC but docker-desktop is probably slower.
2025-06-03 15427, 2025
holycow23[m]
Okay
2025-06-03 15432, 2025
lucifer[m]
anything new in logs
2025-06-03 15439, 2025
holycow23[m]
Nope
2025-06-03 15409, 2025
holycow23[m]
Its been at this stage for long now
2025-06-03 15442, 2025
lucifer[m]
okay, check spark_reader logs
2025-06-03 15458, 2025
lucifer[m]
and see if there are any messages for user_entity.
2025-06-03 15424, 2025
holycow23[m] uploaded an image: (53KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/matrix.org/xHVHOWWKtTxmXkJyFKQqfDFu/image.png >
2025-06-03 15437, 2025
lucifer[m]
check the logs above and below this, there are a lot of debug messages that might drown the user entity message. you can do a grep on the logs if possible for user_entity to confirm.
2025-06-03 15436, 2025
holycow23[m]
its been this throughout except these two lines
2025-06-03 15436, 2025
holycow23[m]
`2025-06-03 14:22:34,058 listenbrainz.webserver DEBUG Received a message, adding to internal processing queue...`
2025-06-03 15436, 2025
holycow23[m]
`2025-06-03 14:22:34,059 listenbrainz.webserver INFO Received message for import_incremental_dump`
2025-06-03 15443, 2025
lucifer[m]
i see
2025-06-03 15458, 2025
lucifer[m]
try running ./develop.sh manage spark request_user_stats --entity artists --range this_week --type entity again i guess
2025-06-03 15410, 2025
lucifer[m]
and see if there's anything new in request_consumer logs
2025-06-03 15426, 2025
holycow23[m]
listenbrainzspark hasn't changed after running the command
2025-06-03 15453, 2025
holycow23[m] uploaded an image: (32KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/matrix.org/QvxGdpfqwgXBUfjmeXvjcUeZ/image.png >
2025-06-03 15446, 2025
lucifer[m]
yeah that is fine, what about the request consumer logs
2025-06-03 15402, 2025
holycow23[m]
its the same
2025-06-03 15403, 2025
holycow23[m]
no updated
2025-06-03 15447, 2025
lucifer[m]
i see, you can stop the containers.
2025-06-03 15452, 2025
lucifer[m]
./develop.sh spark down
2025-06-03 15408, 2025
lucifer[m]
and then run ./develop.sh spark up to bring it back up again