in #metabrainz

1:11 AM
d4rkie joined the channel
1:13 AM
d4rk-ph0enix has quit
1:55 AM
Jigen joined the channel
1:56 AM
Goemon has quit
1:57 AM
ApeKattQuest has quit
1:59 AM
ApeKattQuest joined the channel
1:59 AM
ApeKattQuest has quit
1:59 AM
ApeKattQuest joined the channel
3:26 AM
dabeglavins has quit
4:47 AM
pite has quit
6:04 AM
lucifer[m]

rayyan_seliya123: let's start with just 78rpm/cylinder for now. you should update your prototype or rewrite it from scratch to work with the rest of the codebase. https://github.com/metabrainz/listenbrainz-serv...
6:05 AM
you won't need to create new models, we don't use sqlalchemy as an orm anyway.
6:06 AM
look at the existing pydantic models at https://github.com/metabrainz/listenbrainz-serv... and the SQL tables at https://github.com/metabrainz/listenbrainz-serv...
6:07 AM
you can see apple and spotify follow the same structure whereas soundcloud has a different one. you should check what data is available in the IA and then either map it to the existing apple/spotify or soundcloud format. if neither is suitable then we can think of a new format.
7:23 AM
Maxr1998_ has quit
7:24 AM
Maxr1998 joined the channel
8:00 AM
rayyan_seliya123

<lucifer[m]> "you can see apple and spotify..." <- Thanks for the detailed guidance! I’ve reviewed the existing codebase and understand that for the 78rpm/cylinder collections, the Internet Archive data is mostly track-level, so I’ll map it to the SoundCloud format as you suggested.
8:00 AM
For moving forward, would you prefer that I work directly in the main ListenBrainz repo through PRs, or should I start in a separate branch and then merge my work in? I want to follow whatever workflow you think is best for the project.
8:00 AM
Let me know what you prefer, and I’ll get started accordingly!
8:08 AM
lucifer[m]

[@rayyan_seliya123:matrix.org](https://matrix.to/#/@rayyan_seliya123:matrix.org) work with LB repo through PRs.
8:12 AM
rayyan_seliya123

lucifer[m]: Okk fine 👍
10:04 AM
_BrainzGit

[listenbrainz-server] 14amCap1712 opened pull request #3292 (03master…similar-users): Use cosine similarity instead of pearson coefficient for similar users https://github.com/metabrainz/listenbrainz-serv...
10:08 AM
lucifer[m]

monkey: the current similarity scores on LB should be using this new algorithm
10:09 AM
monkey[m]

Ooh, OK
10:09 AM
lucifer[m]

do they seem sensible to you?
10:09 AM
mayhem[m] is reading the PR right now
10:09 AM
i have a dump of the score before this change if you want to compare.
10:09 AM
monkey[m]

Damn, I don't have older version saved to compare, but let me look
10:10 AM
holycow23[m]

<lucifer[m]> "i'll fix the errors and let..." <- Hey lucifer, any update on this
10:10 AM
lucifer[m]

holycow23: not yet
10:11 AM
holycow23[m]

Okay
10:11 AM
mayhem[m]

lucifer: its hard to judge the cosine similarity without having prior data.
10:11 AM
monkey[m]

Would love to compare to see if it was the case before, but I'm already seeing twousers whom I have 6+ artists in common at 0% compatibility, which feels wrong.
10:11 AM
But I've always thought the similarity scores were low
10:11 AM
lucifer[m] sent a code block: https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/yhckSFyYztsIjnTgUrAmpKDY
10:12 AM
_BrainzGit

[musicbrainz-server] 14reosarevok opened pull request #3552 (03master…MBS-14047): MBS-14047: Support medium in NotFound https://github.com/metabrainz/musicbrainz-serve...
10:12 AM
BrainzBot

MBS-14047: ISE when trying to reach non-existing medium MBID https://tickets.metabrainz.org/browse/MBS-14047
10:12 AM
lucifer[m]

monkey: ^
10:12 AM
mayhem[m]

I noticed that the closest person to me is now much stronger, while the others are weaker.
10:12 AM
lucifer[m] sent a code block: https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/txxvtmQvDEICRXZHaGRmpFeD
10:13 AM
lucifer[m]

the first row is pearson coefficient and the second row is cosine similarity
10:13 AM
monkey[m]

Well, they seem very close
10:13 AM
mayhem[m]

oh wow. well, I guess I haven't looked at similarity data in a while.
10:14 AM
lucifer[m]

mayhem: user similarities have not updated in a few days because it always OOM'ed.
10:14 AM
the last week it OOM'ed in a way to bring down the cluster so i changed it.
10:15 AM
monkey[m]

FWIW i think the similarity calculations need to be reviewed, but where it comes to fixing OOM and the smallest differences between the numbers I see, I would consider them equivalent.
10:15 AM
lucifer[m]

we can implement and experiment with pearson coefficient but just that we'd have to implement something manually. which is doable.
10:15 AM
i went with column similarities because it exists there and was a smaller fix.
10:16 AM
mayhem[m]

I think we should keep it for the time being and ask the community for feedback.
10:16 AM
monkey[m]

Might be worth calculating the average difference between the two methods for all the usersyou have data for, but... From my point of view they are both equally low.
10:16 AM
mayhem[m]

that downside to that is that everyone has an opinion on how it should work and there'd be "its just a little tweak" comments.
10:17 AM
(in ML, its never just a little tweak.)
10:18 AM
monkey[m]

Little tweak, big refactor
10:18 AM
lucifer[m]

fwiw, i don't recall any particular reason implementing it with pearson coefficient the first time.
10:18 AM
i do think there is value in experimenting and improving similarities but we'd need to do it more rigourously, define proper test datasets as a reference etc etc
10:19 AM
monkey[m]

Agreed.
10:20 AM
For my numbers the differences were sub-percentage point, which makes virtually no difference, so OK from me.
11:58 AM
_BrainzGit

[listenbrainz-server] 14amCap1712 merged pull request #3292 (03master…similar-users): Use cosine similarity instead of pearson coefficient for similar users https://github.com/metabrainz/listenbrainz-serv...
12:22 PM
fettuccinae[m]

mayhem: ping
12:34 PM
mayhem[m]

Pong
12:37 PM
fettuccinae[m]

For authroziation of endpoints, each project can have an auth token generated from MeB and saved in secrets of both MeB and the project.
12:37 PM
That way, when a project makes a request, we can authorize it using either the token or the owner_id of the token sent. Is this approach okay?
12:39 PM
mayhem[m]

I think so, but lucifer: is more on top of oath related questions. lucifer: ?
12:42 PM
lucifer[m]

@fettuccinae:matrix.org: not sure what you mean. but the workflow would be as follows: the project LB/BB/MB connect to MeB to obtain an auth token, and use that auth token in the request to post notifications to MeB, MeB validates whether the token has the relevant scopes and is owned by the one of the hardcoded client ids in the configuration, if yes then it proceeds otherwise it rejects the request.
12:46 PM
fettuccinae[m]

lucifer[m]: i was thinking an admin user could generate auth token for projects through https://metabrainz.org/profile#, and then this token could be hardcoded in the configuration of both project and Meb. So when a project sends a request with this token, MeB verifies it against the saved token in config and allows the request
12:47 PM
lucifer[m]

@fettuccinae:matrix.org: no we don't want to do that for multiple reasons. it makes token rotation hard and we cannot have expiring tokens this way.
12:48 PM
fettuccinae[m]

ohh, but how can the project get tokens in the authorization for if login is required for /oauth2/authorize.
12:48 PM
lucifer[m]

unless there is a strong reason we should stick to using the oauth way. i am running a bit behind schedule on client credentials grant but only the testing is pending, once that is done it should be available for use in your project.
12:48 PM
fettuccinae[m]

s/for/flow/
12:49 PM
lucifer[m]

with the client credentials grant, you won't need the manual /oauth2/authorization.
12:50 PM
fettuccinae[m]

Ohh, thanks. I'll add todo's and work on other things.
12:51 PM
* Ohh, got it, thanks. I'll, * add todo's for this and work
13:31 PM
Kladky has quit
13:34 PM
Kladky joined the channel
13:41 PM
Kladky has quit
13:43 PM
Kladky joined the channel
14:14 PM
lucifer[m]

holycow23: i tested the dumps locally and everything seems to work fine, lets try again when you are around.
14:14 PM
holycow23[m]

lucifer[m]: I can try right now
14:14 PM
lucifer[m]

holycow23: okay try running `./develop.sh spark format` once and share its output.
14:15 PM
pite joined the channel
14:18 PM
holycow23[m]

Do I share the entire log?
14:19 PM
lucifer[m]

the last few lines should be enough
14:19 PM
holycow23[m]

https://gist.github.com/granth23/95232d5ed5c0ef...
14:19 PM
I have updated it here
14:19 PM
lucifer[m]

looks good.
14:19 PM
holycow23[m]

Okay
14:20 PM
lucifer[m]

now run ./develop.sh up web -d and then ./develop.sh spark up -d
14:20 PM
holycow23[m] uploaded an image: (64KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/matrix.org/dRohsmAEdwgpTRGQvgvGwfdz/image.png >
14:20 PM
holycow23[m] uploaded an image: (81KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/matrix.org/PfVlTkHjAfqJjpsMQprdMams/image.png >
14:21 PM
./develop.sh manage spark request_import_incremental
14:21 PM
./develop.sh manage spark request_import_sample
14:21 PM
holycow23[m] uploaded an image: (56KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/matrix.org/fePYeFuAPcQmXyMssZTTvchx/image.png >
14:22 PM
monitor the logs for the request consumer container and share them when its done executing these commands.
14:23 PM
holycow23[m]

https://gist.github.com/granth23/95232d5ed5c0ef...
14:23 PM
Updated here
14:23 PM
lucifer[m]

incremental dump imported fine, its still importing the sample dump.
14:23 PM
should be done in less than 5 mins.
14:24 PM
holycow23[m]

Okay
14:24 PM
lucifer[m]

update the logs when you see another Request done!
14:24 PM
holycow23[m]

Updated
14:24 PM
Got a request done!
14:24 PM
lucifer[m]

that succeeded as well.
14:25 PM
okay now run ./develop.sh manage spark request_user_stats --entity artists --range this_week --type entity
14:25 PM
holycow23[m]

Done
14:26 PM
lucifer[m]

update the request consumer logs after another request done
14:26 PM
holycow23[m]

Okay
14:29 PM
Will this take time?
14:30 PM
lucifer[m]

should be done by now.
14:30 PM
update the logs anyway and i'll take a look
14:32 PM
holycow23[m]

Updated
14:32 PM
lucifer[m]

yeah seems to be still running, lets wait. this is not optimized for running locally.
14:33 PM
holycow23[m]

okay
14:34 PM
lucifer[m]

it took 16s on my PC but docker-desktop is probably slower.
14:34 PM
holycow23[m]

Okay
14:34 PM
lucifer[m]

anything new in logs
14:34 PM
holycow23[m]

Nope
14:35 PM
Its been at this stage for long now
14:35 PM
lucifer[m]

okay, check spark_reader logs
14:35 PM
and see if there are any messages for user_entity.
14:38 PM
holycow23[m] uploaded an image: (53KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/matrix.org/xHVHOWWKtTxmXkJyFKQqfDFu/image.png >
14:39 PM
check the logs above and below this, there are a lot of debug messages that might drown the user entity message. you can do a grep on the logs if possible for user_entity to confirm.
14:45 PM
holycow23[m]

its been this throughout except these two lines
14:45 PM
`2025-06-03 14:22:34,058 listenbrainz.webserver DEBUG Received a message, adding to internal processing queue...`
14:45 PM
`2025-06-03 14:22:34,059 listenbrainz.webserver INFO Received message for import_incremental_dump`
14:45 PM
lucifer[m]

i see
14:45 PM
try running ./develop.sh manage spark request_user_stats --entity artists --range this_week --type entity again i guess
14:46 PM
and see if there's anything new in request_consumer logs
14:47 PM
holycow23[m]

listenbrainzspark hasn't changed after running the command
14:47 PM
holycow23[m] uploaded an image: (32KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/matrix.org/QvxGdpfqwgXBUfjmeXvjcUeZ/image.png >
14:48 PM
lucifer[m]

yeah that is fine, what about the request consumer logs
14:49 PM
holycow23[m]

its the same
14:49 PM
no updated
14:49 PM
lucifer[m]

i see, you can stop the containers.
14:49 PM
./develop.sh spark down
14:50 PM
and then run ./develop.sh spark up to bring it back up again
14:50 PM
./develop.sh manage spark request_user_stats --entity recordings --range this_week --type entity
14:50 PM
then run recording stats instead.
14:51 PM
holycow23[m] uploaded an image: (130KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/matrix.org/pfpKdaqrTcgkGUGRHSZpNxfr/image.png >
14:51 PM
holycow23[m]

I ran recordings only and received this
14:52 PM
lucifer[m]

./develop.sh manage spark request_user_stats --entity release_groups --range this_week --type entity
14:52 PM
try release groups.
14:52 PM
holycow23[m]

Okay
14:53 PM
I wait now right?
14:53 PM
lucifer[m]

yes, what do the logs show
14:54 PM
holycow23[m]

request consumer?
14:54 PM
lucifer[m]

yes
14:54 PM
holycow23[m]

https://gist.github.com/granth23/95232d5ed5c0ef...
14:54 PM
updated current status
14:55 PM
lucifer[m]

i see. okay.
14:55 PM
./develop.sh manage spark request_user_stats --type listening_activity --range this_week