in #bookbrainz

4:33 AM
pbryan

CatQuest KassOtsimine I see this disambiguation has been ping-ponging. Wanted to understand the background from you. https://beta.bookbrainz.org/author/036b53cf-029...
5:32 AM
riya9 joined the channel
5:41 AM
riya9 has quit
6:00 AM
Leftmost has quit
6:00 AM
maddy007[m] has quit
6:00 AM
Leftmost joined the channel
6:03 AM
maddy007[m] joined the channel
7:59 AM
riya60 joined the channel
8:00 AM
riya60 has quit
8:03 AM
RohanSasne joined the channel
8:03 AM
riya98 joined the channel
8:12 AM
kellnerd joined the channel
8:14 AM
RohanSasne has quit
8:24 AM
riya98 has quit
8:43 AM
riyaku11 joined the channel
9:52 AM
riyaku11

hey monkey! I have posted my proposal for gsoc 2023 on the community forum. Kindly review it and suggest any changes if required.
9:52 AM
https://community.metabrainz.org/t/gsoc-2023-pr...
9:53 AM
riyaku11 has quit
11:10 AM
kellnerd has quit
11:53 AM
kellnerd joined the channel
12:18 PM
riyaku11 joined the channel
12:25 PM
riyaku11 has quit
12:31 PM
Sauradip_Ghosh joined the channel
12:39 PM
kellnerd

Hey monkey, are you up for a discussion about the importer project?
12:39 PM
I have a few important questions about general design decisions which should be made before I can focus on one of the possible approaches for my proposal.
12:43 PM
monkey

Hi kellnerd, I'm about to go have some lunch, but definitely up for talking after that, in maybe 1h, 1h30?
12:44 PM
kellnerd

Sounds good to me
12:49 PM
I have written down some notes with my thoughts, do you want to have a look at them to refresh your memory or do you prefer if I will shoot you with specific questions after lunch?
13:54 PM
Sauradip_Ghosh has quit
14:04 PM
Vivekumar08 joined the channel
14:10 PM
Vivekumar08 has quit
14:29 PM
monkey

I'm back and have coffee in hand kellnerd, shoot :)
14:30 PM
kellnerd

Ok, give me a quick moment, I'm just filling out the GSoC registration form :)
14:30 PM
(Almost done)
14:30 PM
monkey

👍
14:34 PM
kellnerd

Ok, so I want my proposal to address two shortcomings of the GSoC 2018 project which I have identified:
14:34 PM
1. Importing relationships
14:34 PM
2. Repeating the import process
14:35 PM
monkey

I agree with your assessment, those are two required features
14:35 PM
kellnerd

Since both topics depend on the design decisions for the other one, we will probably jump between topics a few times.
14:36 PM
Let's start with #2
14:36 PM
(My notes for reference: https://gist.github.com/kellnerd/4e598a8ad6b0ed...)
14:37 PM
So there are three degrees of repeatability in my notes
14:37 PM
1. Identifying Already Imported Entities
14:38 PM
This should already work using the origin_source_id and origin_id columns of the bookbrainz.link_import table
14:39 PM
(Which are badly named IMO, but that's a minor issue we can skip now)
14:40 PM
So when we are rerunning the import process using the latest dump, we can identify already imported entities and prevent adding duplicates.
14:40 PM
monkey

Following so far, and that sounds right
14:41 PM
ShivamAwasthi joined the channel
14:41 PM
kellnerd

The question is now, do we want to skip already imported entities and if not, do we want to update only pending imports (2) or also accepted imports (3).
14:42 PM
I would say, that we can at least do (2) as it should be easy to overwrite the pending data with the latest changes, which might be upstream changes or caused by improvements of the importer script.
14:43 PM
monkey

(2) is definitely a good idea
14:43 PM
kellnerd

(3) is where it gets more difficult, but that would be nice to have anyway
14:44 PM
Currently the import data is discarded once an entity is accepted, so we'd have to change that as a first step.
14:44 PM
monkey

As for (3), there may be two separate workflows: one that creates new entities in the _import tables, and one that suggests edits to an already validate import_entity, which would be made against the proper entity
14:45 PM
Not sure what shape that would take, and in any case I don't think it is something we would want to automate
14:45 PM
kellnerd

Exactly, that's similar to what I've outlined in https://gist.github.com/kellnerd/4e598a8ad6b0ed...
14:46 PM
monkey

Yep, reading it now
14:46 PM
kellnerd

We need a way to compare the pending entity level data of the last import with the current one.
14:47 PM
(This is where it gets tricky when relationships would be involved, but we are coming to that one later.)
14:48 PM
Give me a sign when you've followed that section (except for the hyperlink) or have any questions.
14:49 PM
monkey

And another level of complication on top of that if we want: there is also a possible distinction of pieces of data that could be modified automatically, for example we could decide to automatically apply a second run that only adds identifiers.
14:49 PM
kellnerd

Terminology (pending/accepted/external/queued entity) is explained right at the top of the file, by the way.
14:50 PM
monkey

but I think that's getting into the weeds
14:51 PM
I've followed so far
14:51 PM
kellnerd

Yeah, importing specific data for existing entities (based on their identifiers) might be something for a bot user.
14:51 PM
Good, then we will jump to topic #1 (relationships) now before we will revisit the updating problem.
14:52 PM
vivekumar08 joined the channel
14:52 PM
Well, in the document it's the next section, not the first...
14:53 PM
Let's start with the status quo: Relationship sets are intentionally empty for pending entities in order to simplify the task.
14:54 PM
Relationship data might be written to the import_metadata JSONB column but is not used so far.
14:54 PM
I've identified two approaches which both have their advantages and disadvantages.
14:55 PM
vivekumar08 has quit
14:55 PM
(1) Store representations of the pending entity's relationships in the import_metadata JSON column and create proper relationships only when the user accepts the import.
14:56 PM
(2) Allow pending entities to also have relationship sets and, as a consequence, BBIDs. Relationship target entities will also be stored as pending entities.
14:57 PM
(1) Requires additional logic to display and handle relationships while (2) uses the existing infrastructure
14:58 PM
For (1) my summary is relatively short: https://gist.github.com/kellnerd/4e598a8ad6b0ed...
14:59 PM
ShivamAwasthi has quit
15:00 PM
vivekumar08 joined the channel
15:00 PM
Its main disadvantage (besides having separate code paths to handle these temporary representations) would be that the data_id of an accepted entity can no longer be the same as the data_id of the equivalent pending entity.
15:00 PM
vivekumar08 has quit
15:00 PM
That's because the former includes a relationship set and the later has none.
15:01 PM
vivekumar08 joined the channel
15:02 PM
This makes the necessary comparisons for updating an accepted entity more complicated.
15:03 PM
Following me so far?
15:03 PM
monkey

So far reading your notes, option 2 seems like a more appropriate approach, keeping a lot closer to how we manage accepted entities. At a glance that would create fewer issues down the line
15:04 PM
And I was just gonna suggest unidirectional relationships when I read that part :)
15:04 PM
kellnerd

Ah, so you're already ahead of me ;)
15:04 PM
Of course, reading is faster than writing
15:08 PM
monkey

I'm trying to rack my brain for edge cases, so far haven't thought of any blocker
15:08 PM
kellnerd

The only disadvantages of (2) are that we lose the clean separation of the import and entity tables, and that making a mistake here could lead to unidirectional relationships between accepted entities.
15:08 PM
Cyclic dependencies ;)
15:08 PM
monkey

Ouch
15:09 PM
kellnerd

But these should not happen in good source data and we can add a check for that
15:09 PM
monkey

We do have some of those unidirectional relationships in the database already from old bugs in the interface, I think this sort of thing can be resolved as a post-processing job run regularly
15:10 PM
vivekumar08

Excuse for a moment, hii monkey I uploaded my proposal, kindly review it at https://community.metabrainz.org/t/gsoc-2023-pr..., and if any changes or suggestions are required, lemme know
15:10 PM
vivekumar08 has quit
15:10 PM
monkey

OK
15:10 PM
Oh, gone.
15:11 PM
>we lose the clean separation of the import and entity tables
15:11 PM
Do you mean because we use UUIDs of accepted entities in the import table?
15:11 PM
s
15:12 PM
Or are you refering to:
15:12 PM
> combine the entity and import tables and only add one new flag here
15:12 PM
kellnerd

There would be be no longer an import table if we completely follow my proposal
15:12 PM
^yeah that one
15:13 PM
monkey

I don't follow exaclty the requirement for this change
15:13 PM
kellnerd

I'll explain
15:13 PM
monkey

> we need a new flag for relationship source/target entities now in order to know whether the BBID belongs to an accepted entity or to a pending import
15:14 PM
Can we not fetch from the entity table first, then if no results fetch from the import table ?
15:14 PM
kellnerd

That's the easiest way to handle relationship source and target entity columns without both of them having an additional flag which indicates the table in which we can find their BBID values.
15:14 PM
We could do, but that violates the current DB constraints
15:15 PM
`ALTER TABLE bookbrainz.relationship ADD FOREIGN KEY (source_bbid) REFERENCES bookbrainz.entity (bbid);`
15:15 PM
monkey

REFERENCES bookbrainz.entity (bbid)
15:15 PM
Yeah, seeing that now, thanks for the explanation :)
15:17 PM
kellnerd

When I first studied the schema I wondered why the entity table and separate entity_header tables are necessary at all because they seemed identical for all entity types ;)
15:18 PM
monkey

Yeah, a little cumbersome if you ask me.
15:19 PM
kellnerd

So when you agree that this is a sensible way to handle relationships, this would also simplify the comparison for updating already accepted entities (as we have discussed before).
15:20 PM
We can simply check if the pending entity (which we never delete) refers to the same data_id as the master revision of the accepted entity.
15:20 PM
monkey

Yes. That and the easier reuse of code and DB rows makes it a preferable option IMO
15:21 PM
kellnerd

This way even weird edge cases like doing changes and reverting back to the revision of the import still allow for conflict-free updates.
15:21 PM
*allows
15:21 PM
monkey

Indeed
15:22 PM
Vivekumar08 joined the channel
15:22 PM
Vivekumar08 has quit
15:23 PM
kellnerd

^ Flaky internet connection I guess
15:25 PM
monkey

After some more thought I also think the is_import flag on the entity table (and dropping entity_import) does make sense.
15:27 PM
I think we *may* also want to show bidirectional relationships, even if one side hasn't been accepted yet. The original goal was to show import entities along with accepted entities, but with a clear indication that it is imported/non-user-validated. So it does make sense to be using BBIDs everywhere and a single entity table
15:29 PM
kellnerd

Yes, that would be engaging for users to import entities which are related to their favourite author.
15:30 PM
(for example)
15:30 PM
monkey

Indeed.
15:30 PM
Harder to get users to confirm/validate data if we don't show it on the website :)
15:31 PM
(i.e. if it's hidden on some "expert user" sort of page)
15:32 PM
kellnerd

But we would definitely have to think about how we handle updating the relationship sets of accepted entities when we import related pending entities. I don't want revision histories to be spammed with these updates...
15:33 PM
monkey

I think the goal should be to make them look (as much as possible) equivalent to a user creating a new entity
15:33 PM
Which would update the relationship set of any related entity
15:34 PM
kellnerd

Maybe we could update all unidirectional relationships to bidirectional relationships after the import process has finished?
15:34 PM
That way we would get one new revision per entity (and per import run) at most.
15:35 PM
monkey

I'm not 100% convinced we want unidirectional rels, for the reason I stated above: we do want to show the imported entities
15:36 PM
kellnerd

Yeah the idea was to have no more unidirectional rels *after* the import, only while the process is still running.
15:37 PM
But we could also silently create new relationship sets for accepted entities for which we delay creating a new revision until we are done.
15:38 PM
monkey

I'm not sure I follow. What do you mean by "while the process is still running" ? Do you mean the parsing of records and creation of import entities?
15:39 PM
kellnerd

As long as the consumer is running
15:39 PM
monkey

Right.
15:39 PM
kellnerd

I don't want to create a new revision every time the consumer processes a new entity which is related to an accepted entity-
15:40 PM
monkey

I see.
15:40 PM
Say an Author exists in BB, and we import 10 new Works from somewhere else, we don't want 10 new revisions liking each import_work to the existing author.
15:40 PM
kellnerd

So I'm thinking about ways to create only one revision per import run for an accepted entity at msot-
15:40 PM
*most
15:41 PM
Exactly that