pristine___: i might be wrong but moving the `UserRecommendationsRecord` and `UserRecommendationsMessage` to somewhere inside `listenbrainz_spark` folder should probably fix the issue
right now, the `data` folder at the root is not included in the source zip, so spark is unable to find those files and hence errors
pristine___: i've triggered a new recommendations job, but i don't think i can do much else, the data validation errors need to be fixed.
pristine___: the job failed again, i'm not sure what the issue is, it'll need to be debugged in dev i guess. if this needs to be fixed quick, we should revert https://github.com/metabrainz/listenbrainz-serv... and deploy again.
other than that, things look reasonable to me, so i'm stepping away for now.
pristine___
> although the question is, if the errors are in request consumer, why is it affecting the site?
It is affecting site in way that users see "recommendations for the user not generated, check back later", it's a valid message but if the scripts runs success fully users will be able to see their recs.
iliekcomputers: ^
iliekcomputers
Why is the site not showing the old recommendations
pristine___
Because the older recs are not according to Pydantic format. I triggered generate recommendations so that the recs are in suitable format and will be shown on site, but the script failed because of `data module not found` error
iliekcomputers: I don't think there is an need to revert the PR, I will just open a PR to remove the data module usage from recommend.py and it will work.
iliekcomputers
We should investigate why the import fails
Does it work in dev?
pristine___
Though it is weird, the error. Because the data module works for one script and doesn't for the other
Yeah, works in dev.
iliekcomputers
That is very weird
Let's open a ticket to investigate what exactly the issue is
pristine___
Right. Also, regarding the site, it looks better with this PR in a way that check back later is a better message than ISE, imo
Cool. I will open a ticket
iliekcomputers
Cool, I didn't really understand the urgency of this, but I'm happy with that plan
pristine___
iliekcomputers: steps
1 open a PR to remove data module usage from recommend.py ( your comment on the PR last night will fulfill this purpose)
2. Merge the PR, restart request consumer.
3 open a ticket to fix the issue.
iliekcomputers: Urgency of what?
iliekcomputers
Llke how urgent fixing the error was
Steps look good to me
pristine___
iliekcomputers: So that users can see their recs, I am of the opinion that the rec feature is new, so users might be interested in checking their recs, I'd just don't want them to see check back later message, when a few days back we said that go check your recs and give feedback.
iliekcomputers
Makes sense, which is why I suggested reverting
pristine___
iliekcomputers: But then a few users will get ISE, no? *Check back later* is better than ISE, and recs better than *check back later*. Give me an hour, have just woken up, I will make a PR in an hour or so.
iliekcomputers
Sure.
v6lur joined the channel
_lucifer
alastairp: there are some issues regarding gh:CB#311.
It is not working as expected because the create revision function itself calls other functions like `review.get_by_id` and avg_rating.update`.
it seems that the created revision is not yet committed when the other two operations are executed hence there is a mismatch.
i think this can be probably fixed if we pass the connection to those functions as well but that means adding an optional connection to almost all of the db operations
i am not sure if there is a better solution
Gazooo794 has quit
Gazooo794 joined the channel
BrainzGit
[listenbrainz-server] vansika opened pull request #1110 (master…redundant-recommend-code): remove redundant dict->pydantic->dict conversion from recommend.py https://github.com/metabrainz/listenbrainz-serv...
Though I still don't know why recs for the user aren't in the expected format but atleast the user will no more see ISE. In the meantime I will try to look into this.
iliekcomputers
sounds good.
gr0uch0mars joined the channel
ruaok
iliekcomputers: looks promising!
iliekcomputers
i lifted the lichess text :D
ruaok
all art is theft. :)
v6lur has quit
Glycem has quit
Glycem joined the channel
pristine___
ruaok: what if postgres' `unaccent` and python's `unidecode` gives different result for the same accented string
ruaok
we'll miss matches.
I think it might be best for me to move to unidecode for my next round of mapping work.
pristine___
> we'll miss matches.
ruaok: Right. we already miss matches since we are joining on msids rn, and the fact that unicode and unaccent results may differ, we will again miss matches. So I was wondering if devoting time in creating matchable fields rn for artist_name and track_name is a good step, I mean shouldn't we wait till mapping also uses unidecode?
ruaok
its a matter of timing and severity of the problem
timing: I won't be doing mapping work until after the summit
severity: you're going to get many many more matches on text, but you're going to lose .0001% of those to funky decode mismatches. I bet you won't be able to tell.
pristine___
Right
Cool. The missing mb data endpoint will tell us anyway the matches we missed
ruaok
yep.
gr0uch0mars has quit
MajorLurker has quit
shivam-kapila
iliekcomputers: whats your display res
Mineo has quit
Mineo joined the channel
_lucifer
pristine___: ping
pristine___
_lucifer: pong
_lucifer
available for discussing as we decided the other day?
Let's start from here, normalization of the input?
_lucifer
sure, i had a question before that
how does hdfs fit in the picture with spark?
pristine___
Yeah, so spark does all the processing of data, and that data is stored in a distributed system, here that distributed system is HDFS
_lucifer
ok makes sense, yes so let's continue
pristine___
Nice
So remember you were taking about that medium blog?
_lucifer
yup
pristine___
Do you have a link?
_lucifer
let me see if i can find it
pristine___
Cool. Rn, all we do is just count the number of times a user has listened to a song, feed it as such in the recommender
I guess it is affecting user-user similarity
in a not so good way, no?
_lucifer
yeah right, that is affecting the recs
in a bad way, at least theorectically
i am unable to find the link but the basic idea is this
|X - average| / mean
pristine___
X here is the playcount?
_lucifer
yes, and my bad it should | playcount - average | / std. deviation
average is here to counteract the user's own listening tendencies
and std deviation is to bring all users on a same rating scale
pristine___
Right.
shivam-kapila
I smell variance here
pristine___
_lucifer: have you seen the rating beyong the limit error in Sentry?
_lucifer
i do have a sentry account 😅. is it open for all?
shivam-kapila
You will need an invite
_lucifer
you can share the stack trace for the time being i guess
pristine___
Not sure. But I will tell you. The ratings given by the recommender to recordings belong to (-1, 3)
Though we initially thought them to be in (-1, 1)
I am still not sure about (-1, 3) but that's what I have seen till now.
These ratings don't make much sense
imo
And they are directly dependent on what we feed in, ig
i.e playcount
_lucifer
yeah, (-1, 3) does not make sense at all, its either open ended and the ratings should be considered relative to each other or yeah that
pristine___
> |X - average| / mean
Is that the only metric we have? I am not really good at this stuff, but I guess if we have a few metrics/ways of normalization, we can compare and find the better one.
shivam-kapila
Average == mean??
_lucifer
not mean the standard deviation, my mistake as I said above 😓
there are many ways to normalize yes
this is the most basic one imo
pristine___
Hmm.. Okay. So to start, we know that the current way of feeding playcount isn't really cool since user interactions can only be interpreted as positive feedback (implicit feedback)
We can start with the formula you shared above
_lucifer
yeah, we can experiment and compare the results
pristine___
Would you like to work on this? I mean simply treat the generated playcounts with the above formula and compare results
You will need to have some data sets in hdfs on your local machine, and you are good to go
_lucifer
yeah sure, i am currently setup spark locally and will try to generate recs locally