ruaok: the messages sent back to lemmy should be one message per user or all users in a single message ?
2021-03-04 06325, 2021
reosarevok
yvanzo: I know you've talked about configurable columns for data display before - do you know if we have a ticket for that? (https://tickets.metabrainz.org/browse/MBS-11414 is about that and I'm wondering if it's a dupe)
2021-03-04 06326, 2021
BrainzBot
MBS-11414: Collection view should allow managing (thus adding missing) columns
2021-03-04 06304, 2021
ruaok
_lucifer: all in one.
2021-03-04 06314, 2021
_lucifer
👍
2021-03-04 06331, 2021
ruaok
updating the table row by row would be painfully slow. I plan to insert rows into a new table and then atomically swap the tables into production.
2021-03-04 06319, 2021
_lucifer
ruaok: i just pushed the initial implementation for user similarity. i was going through how lemmy requests similar users and think that it'll probably need a couple of changes.
2021-03-04 06345, 2021
ruaok
ok, what does it need?
2021-03-04 06305, 2021
_lucifer
as we decided earlier about separating dataframe creation, so the similar user request should just send a threshold
2021-03-04 06326, 2021
ruaok
ah, ok. np, will fix.
2021-03-04 06332, 2021
_lucifer
before sending a request for similar users, we need to manually request dataframes
2021-03-04 06342, 2021
ruaok
makes sense.
2021-03-04 06331, 2021
_lucifer
that part uses days instead of years, so the request should send number of days instead of years
2021-03-04 06306, 2021
ruaok
theoretically that part should not need any changes right?
2021-03-04 06310, 2021
ruaok
just use days, yes?
2021-03-04 06313, 2021
ruaok
years argument removed.
2021-03-04 06323, 2021
_lucifer
the days part no. but the request should now send a job_type as well to denote whether the dataframe is being generated for recommendations or user similarity
2021-03-04 06306, 2021
ruaok
what are the two exact string values possible for job_type?
2021-03-04 06349, 2021
_lucifer
i am using "recommendation" and "user_similarity" for now but that can be changed
2021-03-04 06335, 2021
ruaok
I hope to have similar artist collaborative filtering soon. that will make the candidate set selection for recording CF work a lot better.
2021-03-04 06300, 2021
ruaok
will the dataframes generated for "recommendation" be suitable for artist recommendation and recording recommendation?
2021-03-04 06317, 2021
ruaok
if not, we should name "recommendation" to "recommendation_recording".
2021-03-04 06307, 2021
_lucifer
i think those will be different, we were able to reuse the dataframes in this case because we use recordings for both things
2021-03-04 06344, 2021
_lucifer
recommendation_recording sounds good, on a similar note will we want to have user_similarity based on artists?
2021-03-04 06310, 2021
ruaok
going with "recommendation_recording" then.
2021-03-04 06332, 2021
ruaok
> on a similar note will we want to have user_similarity based on artists?
2021-03-04 06351, 2021
ruaok
I dont see an immediate need for that -- we need to look at the results of what you've created so far.
2021-03-04 06356, 2021
ruaok
then we'll see.
2021-03-04 06332, 2021
ruaok
but if you find yourself bored, you could work on the CF artists feature. I theory most of it is copypasta.
2021-03-04 06333, 2021
_lucifer
makes sense.
2021-03-04 06347, 2021
ruaok
In theory...
2021-03-04 06301, 2021
_lucifer
sure, but first need to test and iron out this feature first :)
2021-03-04 06321, 2021
ruaok
agreed.
2021-03-04 06348, 2021
ruaok
the data saving is the primary task for today. hopefully we can test later this afternoon.
2021-03-04 06319, 2021
_lucifer
i'll be unavailable between 2-6PM CET. let's do it after 6 today or tomorrow
2021-03-04 06309, 2021
ruaok
ok, I wont be available after 6pm, so lets see what we can do before then. or tomorrow.
2021-03-04 06345, 2021
_lucifer
cool. in the meanwhile, i'll work on documenting the spark side and writing unit tests.
iliekcomputers: Hi! Do we have a definitive format for the `user/XXX/feed` API endpoint? I know /feed/listens was returning a list of listens, but I've assumed the following structure instead and wanted to compare with your plan:
similar users, for instance. we would have to diff the existing data to the new data to update it. or just blow it all away an insert new and swap tables in.
2021-03-04 06303, 2021
ruaok
the latter is MUCH faster, much less error prone.
2021-03-04 06311, 2021
alastairp
data structure is the same?
2021-03-04 06316, 2021
ruaok
exactly the same.
2021-03-04 06334, 2021
alastairp
just throwing some ideas around without knowing the problem area too well: views?
2021-03-04 06336, 2021
ruaok
and TRUNCATE would have an exclusive lock on the table for too long.
With the code change, that returns "Acte 1, no. 7 : Chœur : « Voyons brigadier »"
2021-03-04 06313, 2021
reosarevok
Is that actually wrong though?
2021-03-04 06357, 2021
MajorLurker joined the channel
2021-03-04 06332, 2021
MajorLurker has quit
2021-03-04 06333, 2021
ruaok
iliekcomputers: you up for a quick technical discussion?
2021-03-04 06343, 2021
iliekcomputers
Yep
2021-03-04 06350, 2021
ruaok
cool.
2021-03-04 06326, 2021
ruaok
for the similar users feature I need to create a parallel table, populate it and then swap it in in one transaction.
2021-03-04 06340, 2021
ruaok
nothing challenging here.
2021-03-04 06355, 2021
ruaok
more a "how do we do this cleanly" question.
2021-03-04 06304, 2021
ruaok
we have the table definition in create_tables.sql.
2021-03-04 06325, 2021
ruaok
but now I need to run that single table creation script again as part of a INSERT INTO query.
2021-03-04 06356, 2021
ruaok
which duplicates a critical table definition and that blows.
2021-03-04 06343, 2021
ruaok
any idea how to have that knowledge live in code and the .sql file?
2021-03-04 06356, 2021
iliekcomputers
What do you mean by parallel table?
2021-03-04 06323, 2021
ruaok
the similar_users table is in production. now we want to update the table with new data from spark.
2021-03-04 06359, 2021
ruaok
the fastest way to do this is not to diff the table, but to create a new parallel table with the same table structure, INSERT INTO, CREATE INDEX, then in a transaction RENAME TABLE.
2021-03-04 06323, 2021
ruaok
this allows the table to always be available with no downtime.
2021-03-04 06319, 2021
iliekcomputers
So every time new data comes in, we'll create a new table, drop the old one and rename the new one?