pristine___: i might be wrong but moving the `UserRecommendationsRecord` and `UserRecommendationsMessage` to somewhere inside `listenbrainz_spark` folder should probably fix the issue
2020-09-26 27055, 2020
_lucifer
right now, the `data` folder at the root is not included in the source zip, so spark is unable to find those files and hence errors
pristine___: i've triggered a new recommendations job, but i don't think i can do much else, the data validation errors need to be fixed.
2020-09-26 27046, 2020
iliekcomputers
pristine___: the job failed again, i'm not sure what the issue is, it'll need to be debugged in dev i guess. if this needs to be fixed quick, we should revert https://github.com/metabrainz/listenbrainz-server… and deploy again.
2020-09-26 27051, 2020
iliekcomputers
other than that, things look reasonable to me, so i'm stepping away for now.
2020-09-26 27037, 2020
pristine___
> although the question is, if the errors are in request consumer, why is it affecting the site?
2020-09-26 27033, 2020
pristine___
It is affecting site in way that users see "recommendations for the user not generated, check back later", it's a valid message but if the scripts runs success fully users will be able to see their recs.
2020-09-26 27039, 2020
pristine___
iliekcomputers: ^
2020-09-26 27036, 2020
iliekcomputers
Why is the site not showing the old recommendations
2020-09-26 27057, 2020
pristine___
Because the older recs are not according to Pydantic format. I triggered generate recommendations so that the recs are in suitable format and will be shown on site, but the script failed because of `data module not found` error
2020-09-26 27024, 2020
pristine___
iliekcomputers: I don't think there is an need to revert the PR, I will just open a PR to remove the data module usage from recommend.py and it will work.
2020-09-26 27011, 2020
iliekcomputers
We should investigate why the import fails
2020-09-26 27020, 2020
iliekcomputers
Does it work in dev?
2020-09-26 27020, 2020
pristine___
Though it is weird, the error. Because the data module works for one script and doesn't for the other
2020-09-26 27030, 2020
pristine___
Yeah, works in dev.
2020-09-26 27047, 2020
iliekcomputers
That is very weird
2020-09-26 27019, 2020
iliekcomputers
Let's open a ticket to investigate what exactly the issue is
2020-09-26 27034, 2020
pristine___
Right. Also, regarding the site, it looks better with this PR in a way that check back later is a better message than ISE, imo
2020-09-26 27041, 2020
pristine___
Cool. I will open a ticket
2020-09-26 27010, 2020
iliekcomputers
Cool, I didn't really understand the urgency of this, but I'm happy with that plan
2020-09-26 27006, 2020
pristine___
iliekcomputers: steps
2020-09-26 27006, 2020
pristine___
1 open a PR to remove data module usage from recommend.py ( your comment on the PR last night will fulfill this purpose)
2020-09-26 27006, 2020
pristine___
2. Merge the PR, restart request consumer.
2020-09-26 27006, 2020
pristine___
3 open a ticket to fix the issue.
2020-09-26 27015, 2020
pristine___
iliekcomputers: Urgency of what?
2020-09-26 27016, 2020
iliekcomputers
Llke how urgent fixing the error was
2020-09-26 27058, 2020
iliekcomputers
Steps look good to me
2020-09-26 27013, 2020
pristine___
iliekcomputers: So that users can see their recs, I am of the opinion that the rec feature is new, so users might be interested in checking their recs, I'd just don't want them to see check back later message, when a few days back we said that go check your recs and give feedback.
2020-09-26 27009, 2020
iliekcomputers
Makes sense, which is why I suggested reverting
2020-09-26 27036, 2020
pristine___
iliekcomputers: But then a few users will get ISE, no? *Check back later* is better than ISE, and recs better than *check back later*. Give me an hour, have just woken up, I will make a PR in an hour or so.
2020-09-26 27011, 2020
iliekcomputers
Sure.
2020-09-26 27017, 2020
v6lur joined the channel
2020-09-26 27040, 2020
_lucifer
alastairp: there are some issues regarding gh:CB#311.
It is not working as expected because the create revision function itself calls other functions like `review.get_by_id` and avg_rating.update`.
2020-09-26 27001, 2020
_lucifer
it seems that the created revision is not yet committed when the other two operations are executed hence there is a mismatch.
2020-09-26 27026, 2020
_lucifer
i think this can be probably fixed if we pass the connection to those functions as well but that means adding an optional connection to almost all of the db operations
2020-09-26 27054, 2020
_lucifer
i am not sure if there is a better solution
2020-09-26 27001, 2020
Gazooo794 has quit
2020-09-26 27048, 2020
Gazooo794 joined the channel
2020-09-26 27057, 2020
BrainzGit
[listenbrainz-server] vansika opened pull request #1110 (master…redundant-recommend-code): remove redundant dict->pydantic->dict conversion from recommend.py https://github.com/metabrainz/listenbrainz-server…
Though I still don't know why recs for the user aren't in the expected format but atleast the user will no more see ISE. In the meantime I will try to look into this.
2020-09-26 27045, 2020
iliekcomputers
sounds good.
2020-09-26 27015, 2020
gr0uch0mars joined the channel
2020-09-26 27054, 2020
ruaok
iliekcomputers: looks promising!
2020-09-26 27024, 2020
iliekcomputers
i lifted the lichess text :D
2020-09-26 27038, 2020
ruaok
all art is theft. :)
2020-09-26 27011, 2020
v6lur has quit
2020-09-26 27047, 2020
Glycem has quit
2020-09-26 27013, 2020
Glycem joined the channel
2020-09-26 27002, 2020
pristine___
ruaok: what if postgres' `unaccent` and python's `unidecode` gives different result for the same accented string
2020-09-26 27044, 2020
ruaok
we'll miss matches.
2020-09-26 27003, 2020
ruaok
I think it might be best for me to move to unidecode for my next round of mapping work.
2020-09-26 27058, 2020
pristine___
> we'll miss matches.
2020-09-26 27037, 2020
pristine___
ruaok: Right. we already miss matches since we are joining on msids rn, and the fact that unicode and unaccent results may differ, we will again miss matches. So I was wondering if devoting time in creating matchable fields rn for artist_name and track_name is a good step, I mean shouldn't we wait till mapping also uses unidecode?
2020-09-26 27013, 2020
ruaok
its a matter of timing and severity of the problem
2020-09-26 27022, 2020
ruaok
timing: I won't be doing mapping work until after the summit
2020-09-26 27000, 2020
ruaok
severity: you're going to get many many more matches on text, but you're going to lose .0001% of those to funky decode mismatches. I bet you won't be able to tell.
2020-09-26 27010, 2020
pristine___
Right
2020-09-26 27050, 2020
pristine___
Cool. The missing mb data endpoint will tell us anyway the matches we missed
2020-09-26 27058, 2020
ruaok
yep.
2020-09-26 27041, 2020
gr0uch0mars has quit
2020-09-26 27000, 2020
MajorLurker has quit
2020-09-26 27025, 2020
shivam-kapila
iliekcomputers: whats your display res
2020-09-26 27012, 2020
Mineo has quit
2020-09-26 27022, 2020
Mineo joined the channel
2020-09-26 27033, 2020
_lucifer
pristine___: ping
2020-09-26 27054, 2020
pristine___
_lucifer: pong
2020-09-26 27019, 2020
_lucifer
available for discussing as we decided the other day?
Let's start from here, normalization of the input?
2020-09-26 27046, 2020
_lucifer
sure, i had a question before that
2020-09-26 27000, 2020
_lucifer
how does hdfs fit in the picture with spark?
2020-09-26 27058, 2020
pristine___
Yeah, so spark does all the processing of data, and that data is stored in a distributed system, here that distributed system is HDFS
2020-09-26 27034, 2020
_lucifer
ok makes sense, yes so let's continue
2020-09-26 27042, 2020
pristine___
Nice
2020-09-26 27059, 2020
pristine___
So remember you were taking about that medium blog?
2020-09-26 27003, 2020
_lucifer
yup
2020-09-26 27007, 2020
pristine___
Do you have a link?
2020-09-26 27028, 2020
_lucifer
let me see if i can find it
2020-09-26 27011, 2020
pristine___
Cool. Rn, all we do is just count the number of times a user has listened to a song, feed it as such in the recommender
2020-09-26 27037, 2020
pristine___
I guess it is affecting user-user similarity
2020-09-26 27054, 2020
pristine___
in a not so good way, no?
2020-09-26 27034, 2020
_lucifer
yeah right, that is affecting the recs
2020-09-26 27054, 2020
_lucifer
in a bad way, at least theorectically
2020-09-26 27029, 2020
_lucifer
i am unable to find the link but the basic idea is this
2020-09-26 27059, 2020
_lucifer
|X - average| / mean
2020-09-26 27044, 2020
pristine___
X here is the playcount?
2020-09-26 27051, 2020
_lucifer
yes, and my bad it should | playcount - average | / std. deviation
2020-09-26 27035, 2020
_lucifer
average is here to counteract the user's own listening tendencies
2020-09-26 27057, 2020
_lucifer
and std deviation is to bring all users on a same rating scale
2020-09-26 27036, 2020
pristine___
Right.
2020-09-26 27048, 2020
shivam-kapila
I smell variance here
2020-09-26 27019, 2020
pristine___
_lucifer: have you seen the rating beyong the limit error in Sentry?
2020-09-26 27049, 2020
_lucifer
i do have a sentry account 😅. is it open for all?
2020-09-26 27018, 2020
shivam-kapila
You will need an invite
2020-09-26 27042, 2020
_lucifer
you can share the stack trace for the time being i guess
2020-09-26 27001, 2020
pristine___
Not sure. But I will tell you. The ratings given by the recommender to recordings belong to (-1, 3)
2020-09-26 27022, 2020
pristine___
Though we initially thought them to be in (-1, 1)
2020-09-26 27044, 2020
pristine___
I am still not sure about (-1, 3) but that's what I have seen till now.
2020-09-26 27053, 2020
pristine___
These ratings don't make much sense
2020-09-26 27058, 2020
pristine___
imo
2020-09-26 27014, 2020
pristine___
And they are directly dependent on what we feed in, ig
2020-09-26 27025, 2020
pristine___
i.e playcount
2020-09-26 27036, 2020
_lucifer
yeah, (-1, 3) does not make sense at all, its either open ended and the ratings should be considered relative to each other or yeah that
2020-09-26 27050, 2020
pristine___
> |X - average| / mean
2020-09-26 27041, 2020
pristine___
Is that the only metric we have? I am not really good at this stuff, but I guess if we have a few metrics/ways of normalization, we can compare and find the better one.
2020-09-26 27042, 2020
shivam-kapila
Average == mean??
2020-09-26 27001, 2020
_lucifer
not mean the standard deviation, my mistake as I said above 😓
2020-09-26 27034, 2020
_lucifer
there are many ways to normalize yes
2020-09-26 27050, 2020
_lucifer
this is the most basic one imo
2020-09-26 27004, 2020
pristine___
Hmm.. Okay. So to start, we know that the current way of feeding playcount isn't really cool since user interactions can only be interpreted as positive feedback (implicit feedback)
2020-09-26 27024, 2020
pristine___
We can start with the formula you shared above
2020-09-26 27000, 2020
_lucifer
yeah, we can experiment and compare the results
2020-09-26 27011, 2020
pristine___
Would you like to work on this? I mean simply treat the generated playcounts with the above formula and compare results
2020-09-26 27043, 2020
pristine___
You will need to have some data sets in hdfs on your local machine, and you are good to go
2020-09-26 27043, 2020
_lucifer
yeah sure, i am currently setup spark locally and will try to generate recs locally