ruaok: the messages sent back to lemmy should be one message per user or all users in a single message ?
reosarevok
yvanzo: I know you've talked about configurable columns for data display before - do you know if we have a ticket for that? (https://tickets.metabrainz.org/browse/MBS-11414 is about that and I'm wondering if it's a dupe)
BrainzBot
MBS-11414: Collection view should allow managing (thus adding missing) columns
ruaok
_lucifer: all in one.
_lucifer
👍
ruaok
updating the table row by row would be painfully slow. I plan to insert rows into a new table and then atomically swap the tables into production.
_lucifer
ruaok: i just pushed the initial implementation for user similarity. i was going through how lemmy requests similar users and think that it'll probably need a couple of changes.
ruaok
ok, what does it need?
_lucifer
as we decided earlier about separating dataframe creation, so the similar user request should just send a threshold
ruaok
ah, ok. np, will fix.
_lucifer
before sending a request for similar users, we need to manually request dataframes
ruaok
makes sense.
_lucifer
that part uses days instead of years, so the request should send number of days instead of years
ruaok
theoretically that part should not need any changes right?
just use days, yes?
years argument removed.
_lucifer
the days part no. but the request should now send a job_type as well to denote whether the dataframe is being generated for recommendations or user similarity
ruaok
what are the two exact string values possible for job_type?
_lucifer
i am using "recommendation" and "user_similarity" for now but that can be changed
ruaok
I hope to have similar artist collaborative filtering soon. that will make the candidate set selection for recording CF work a lot better.
will the dataframes generated for "recommendation" be suitable for artist recommendation and recording recommendation?
if not, we should name "recommendation" to "recommendation_recording".
_lucifer
i think those will be different, we were able to reuse the dataframes in this case because we use recordings for both things
recommendation_recording sounds good, on a similar note will we want to have user_similarity based on artists?
ruaok
going with "recommendation_recording" then.
> on a similar note will we want to have user_similarity based on artists?
I dont see an immediate need for that -- we need to look at the results of what you've created so far.
then we'll see.
but if you find yourself bored, you could work on the CF artists feature. I theory most of it is copypasta.
_lucifer
makes sense.
ruaok
In theory...
_lucifer
sure, but first need to test and iron out this feature first :)
ruaok
agreed.
the data saving is the primary task for today. hopefully we can test later this afternoon.
_lucifer
i'll be unavailable between 2-6PM CET. let's do it after 6 today or tomorrow
ruaok
ok, I wont be available after 6pm, so lets see what we can do before then. or tomorrow.
_lucifer
cool. in the meanwhile, i'll work on documenting the spark side and writing unit tests.
iliekcomputers: Hi! Do we have a definitive format for the `user/XXX/feed` API endpoint? I know /feed/listens was returning a list of listens, but I've assumed the following structure instead and wanted to compare with your plan:
similar users, for instance. we would have to diff the existing data to the new data to update it. or just blow it all away an insert new and swap tables in.
the latter is MUCH faster, much less error prone.
alastairp
data structure is the same?
ruaok
exactly the same.
alastairp
just throwing some ideas around without knowing the problem area too well: views?
ruaok
and TRUNCATE would have an exclusive lock on the table for too long.
With the code change, that returns "Acte 1, no. 7 : Chœur : « Voyons brigadier »"
Is that actually wrong though?
MajorLurker joined the channel
MajorLurker has quit
ruaok
iliekcomputers: you up for a quick technical discussion?
iliekcomputers
Yep
ruaok
cool.
for the similar users feature I need to create a parallel table, populate it and then swap it in in one transaction.
nothing challenging here.
more a "how do we do this cleanly" question.
we have the table definition in create_tables.sql.
but now I need to run that single table creation script again as part of a INSERT INTO query.
which duplicates a critical table definition and that blows.
any idea how to have that knowledge live in code and the .sql file?
iliekcomputers
What do you mean by parallel table?
ruaok
the similar_users table is in production. now we want to update the table with new data from spark.
the fastest way to do this is not to diff the table, but to create a new parallel table with the same table structure, INSERT INTO, CREATE INDEX, then in a transaction RENAME TABLE.
this allows the table to always be available with no downtime.
iliekcomputers
So every time new data comes in, we'll create a new table, drop the old one and rename the new one?