Hey monkey, are you up for a discussion about the importer project?
2023-03-27 08629, 2023
kellnerd
I have a few important questions about general design decisions which should be made before I can focus on one of the possible approaches for my proposal.
2023-03-27 08658, 2023
monkey
Hi kellnerd, I'm about to go have some lunch, but definitely up for talking after that, in maybe 1h, 1h30?
2023-03-27 08631, 2023
kellnerd
Sounds good to me
2023-03-27 08650, 2023
kellnerd
I have written down some notes with my thoughts, do you want to have a look at them to refresh your memory or do you prefer if I will shoot you with specific questions after lunch?
2023-03-27 08646, 2023
Sauradip_Ghosh has quit
2023-03-27 08628, 2023
Vivekumar08 joined the channel
2023-03-27 08626, 2023
Vivekumar08 has quit
2023-03-27 08620, 2023
monkey
I'm back and have coffee in hand kellnerd, shoot :)
2023-03-27 08625, 2023
kellnerd
Ok, give me a quick moment, I'm just filling out the GSoC registration form :)
2023-03-27 08636, 2023
kellnerd
(Almost done)
2023-03-27 08655, 2023
monkey
👍
2023-03-27 08628, 2023
kellnerd
Ok, so I want my proposal to address two shortcomings of the GSoC 2018 project which I have identified:
2023-03-27 08645, 2023
kellnerd
1. Importing relationships
2023-03-27 08657, 2023
kellnerd
2. Repeating the import process
2023-03-27 08618, 2023
monkey
I agree with your assessment, those are two required features
2023-03-27 08652, 2023
kellnerd
Since both topics depend on the design decisions for the other one, we will probably jump between topics a few times.
So there are three degrees of repeatability in my notes
2023-03-27 08631, 2023
kellnerd
1. Identifying Already Imported Entities
2023-03-27 08626, 2023
kellnerd
This should already work using the origin_source_id and origin_id columns of the bookbrainz.link_import table
2023-03-27 08600, 2023
kellnerd
(Which are badly named IMO, but that's a minor issue we can skip now)
2023-03-27 08606, 2023
kellnerd
So when we are rerunning the import process using the latest dump, we can identify already imported entities and prevent adding duplicates.
2023-03-27 08635, 2023
monkey
Following so far, and that sounds right
2023-03-27 08609, 2023
ShivamAwasthi joined the channel
2023-03-27 08642, 2023
kellnerd
The question is now, do we want to skip already imported entities and if not, do we want to update only pending imports (2) or also accepted imports (3).
2023-03-27 08647, 2023
kellnerd
I would say, that we can at least do (2) as it should be easy to overwrite the pending data with the latest changes, which might be upstream changes or caused by improvements of the importer script.
2023-03-27 08606, 2023
monkey
(2) is definitely a good idea
2023-03-27 08636, 2023
kellnerd
(3) is where it gets more difficult, but that would be nice to have anyway
2023-03-27 08628, 2023
kellnerd
Currently the import data is discarded once an entity is accepted, so we'd have to change that as a first step.
2023-03-27 08631, 2023
monkey
As for (3), there may be two separate workflows: one that creates new entities in the _import tables, and one that suggests edits to an already validate import_entity, which would be made against the proper entity
2023-03-27 08618, 2023
monkey
Not sure what shape that would take, and in any case I don't think it is something we would want to automate
We need a way to compare the pending entity level data of the last import with the current one.
2023-03-27 08604, 2023
kellnerd
(This is where it gets tricky when relationships would be involved, but we are coming to that one later.)
2023-03-27 08652, 2023
kellnerd
Give me a sign when you've followed that section (except for the hyperlink) or have any questions.
2023-03-27 08649, 2023
monkey
And another level of complication on top of that if we want: there is also a possible distinction of pieces of data that could be modified automatically, for example we could decide to automatically apply a second run that only adds identifiers.
2023-03-27 08649, 2023
kellnerd
Terminology (pending/accepted/external/queued entity) is explained right at the top of the file, by the way.
2023-03-27 08600, 2023
monkey
but I think that's getting into the weeds
2023-03-27 08601, 2023
monkey
I've followed so far
2023-03-27 08601, 2023
kellnerd
Yeah, importing specific data for existing entities (based on their identifiers) might be something for a bot user.
2023-03-27 08654, 2023
kellnerd
Good, then we will jump to topic #1 (relationships) now before we will revisit the updating problem.
2023-03-27 08617, 2023
vivekumar08 joined the channel
2023-03-27 08629, 2023
kellnerd
Well, in the document it's the next section, not the first...
2023-03-27 08642, 2023
kellnerd
Let's start with the status quo: Relationship sets are intentionally empty for pending entities in order to simplify the task.
2023-03-27 08622, 2023
kellnerd
Relationship data might be written to the import_metadata JSONB column but is not used so far.
2023-03-27 08658, 2023
kellnerd
I've identified two approaches which both have their advantages and disadvantages.
2023-03-27 08607, 2023
vivekumar08 has quit
2023-03-27 08659, 2023
kellnerd
(1) Store representations of the pending entity's relationships in the import_metadata JSON column and create proper relationships only when the user accepts the import.
2023-03-27 08611, 2023
kellnerd
(2) Allow pending entities to also have relationship sets and, as a consequence, BBIDs. Relationship target entities will also be stored as pending entities.
2023-03-27 08609, 2023
kellnerd
(1) Requires additional logic to display and handle relationships while (2) uses the existing infrastructure
Its main disadvantage (besides having separate code paths to handle these temporary representations) would be that the data_id of an accepted entity can no longer be the same as the data_id of the equivalent pending entity.
2023-03-27 08656, 2023
vivekumar08 has quit
2023-03-27 08659, 2023
kellnerd
That's because the former includes a relationship set and the later has none.
2023-03-27 08611, 2023
vivekumar08 joined the channel
2023-03-27 08651, 2023
kellnerd
This makes the necessary comparisons for updating an accepted entity more complicated.
2023-03-27 08618, 2023
kellnerd
Following me so far?
2023-03-27 08656, 2023
monkey
So far reading your notes, option 2 seems like a more appropriate approach, keeping a lot closer to how we manage accepted entities. At a glance that would create fewer issues down the line
2023-03-27 08620, 2023
monkey
And I was just gonna suggest unidirectional relationships when I read that part :)
2023-03-27 08642, 2023
kellnerd
Ah, so you're already ahead of me ;)
2023-03-27 08655, 2023
kellnerd
Of course, reading is faster than writing
2023-03-27 08603, 2023
monkey
I'm trying to rack my brain for edge cases, so far haven't thought of any blocker
2023-03-27 08608, 2023
kellnerd
The only disadvantages of (2) are that we lose the clean separation of the import and entity tables, and that making a mistake here could lead to unidirectional relationships between accepted entities.
2023-03-27 08641, 2023
kellnerd
Cyclic dependencies ;)
2023-03-27 08651, 2023
monkey
Ouch
2023-03-27 08606, 2023
kellnerd
But these should not happen in good source data and we can add a check for that
2023-03-27 08631, 2023
monkey
We do have some of those unidirectional relationships in the database already from old bugs in the interface, I think this sort of thing can be resolved as a post-processing job run regularly
>we lose the clean separation of the import and entity tables
2023-03-27 08618, 2023
monkey
Do you mean because we use UUIDs of accepted entities in the import table?
2023-03-27 08619, 2023
monkey
s
2023-03-27 08600, 2023
monkey
Or are you refering to:
2023-03-27 08602, 2023
monkey
> combine the entity and import tables and only add one new flag here
2023-03-27 08615, 2023
kellnerd
There would be be no longer an import table if we completely follow my proposal
2023-03-27 08622, 2023
kellnerd
^yeah that one
2023-03-27 08611, 2023
monkey
I don't follow exaclty the requirement for this change
2023-03-27 08626, 2023
kellnerd
I'll explain
2023-03-27 08629, 2023
monkey
> we need a new flag for relationship source/target entities now in order to know whether the BBID belongs to an accepted entity or to a pending import
2023-03-27 08600, 2023
monkey
Can we not fetch from the entity table first, then if no results fetch from the import table ?
2023-03-27 08631, 2023
kellnerd
That's the easiest way to handle relationship source and target entity columns without both of them having an additional flag which indicates the table in which we can find their BBID values.
2023-03-27 08650, 2023
kellnerd
We could do, but that violates the current DB constraints
Yeah, seeing that now, thanks for the explanation :)
2023-03-27 08613, 2023
kellnerd
When I first studied the schema I wondered why the entity table and separate entity_header tables are necessary at all because they seemed identical for all entity types ;)
2023-03-27 08607, 2023
monkey
Yeah, a little cumbersome if you ask me.
2023-03-27 08650, 2023
kellnerd
So when you agree that this is a sensible way to handle relationships, this would also simplify the comparison for updating already accepted entities (as we have discussed before).
2023-03-27 08637, 2023
kellnerd
We can simply check if the pending entity (which we never delete) refers to the same data_id as the master revision of the accepted entity.
2023-03-27 08648, 2023
monkey
Yes. That and the easier reuse of code and DB rows makes it a preferable option IMO
2023-03-27 08625, 2023
kellnerd
This way even weird edge cases like doing changes and reverting back to the revision of the import still allow for conflict-free updates.
2023-03-27 08636, 2023
kellnerd
*allows
2023-03-27 08651, 2023
monkey
Indeed
2023-03-27 08612, 2023
Vivekumar08 joined the channel
2023-03-27 08641, 2023
Vivekumar08 has quit
2023-03-27 08609, 2023
kellnerd
^ Flaky internet connection I guess
2023-03-27 08638, 2023
monkey
After some more thought I also think the is_import flag on the entity table (and dropping entity_import) does make sense.
2023-03-27 08656, 2023
monkey
I think we *may* also want to show bidirectional relationships, even if one side hasn't been accepted yet. The original goal was to show import entities along with accepted entities, but with a clear indication that it is imported/non-user-validated. So it does make sense to be using BBIDs everywhere and a single entity table
2023-03-27 08652, 2023
kellnerd
Yes, that would be engaging for users to import entities which are related to their favourite author.
2023-03-27 08610, 2023
kellnerd
(for example)
2023-03-27 08628, 2023
monkey
Indeed.
2023-03-27 08650, 2023
monkey
Harder to get users to confirm/validate data if we don't show it on the website :)
2023-03-27 08610, 2023
monkey
(i.e. if it's hidden on some "expert user" sort of page)
2023-03-27 08603, 2023
kellnerd
But we would definitely have to think about how we handle updating the relationship sets of accepted entities when we import related pending entities. I don't want revision histories to be spammed with these updates...
2023-03-27 08605, 2023
monkey
I think the goal should be to make them look (as much as possible) equivalent to a user creating a new entity
2023-03-27 08656, 2023
monkey
Which would update the relationship set of any related entity
2023-03-27 08605, 2023
kellnerd
Maybe we could update all unidirectional relationships to bidirectional relationships after the import process has finished?
2023-03-27 08649, 2023
kellnerd
That way we would get one new revision per entity (and per import run) at most.
2023-03-27 08601, 2023
monkey
I'm not 100% convinced we want unidirectional rels, for the reason I stated above: we do want to show the imported entities
2023-03-27 08602, 2023
kellnerd
Yeah the idea was to have no more unidirectional rels *after* the import, only while the process is still running.
2023-03-27 08606, 2023
kellnerd
But we could also silently create new relationship sets for accepted entities for which we delay creating a new revision until we are done.
2023-03-27 08650, 2023
monkey
I'm not sure I follow. What do you mean by "while the process is still running" ? Do you mean the parsing of records and creation of import entities?
2023-03-27 08607, 2023
kellnerd
As long as the consumer is running
2023-03-27 08614, 2023
monkey
Right.
2023-03-27 08653, 2023
kellnerd
I don't want to create a new revision every time the consumer processes a new entity which is related to an accepted entity-
2023-03-27 08613, 2023
monkey
I see.
2023-03-27 08648, 2023
monkey
Say an Author exists in BB, and we import 10 new Works from somewhere else, we don't want 10 new revisions liking each import_work to the existing author.
2023-03-27 08650, 2023
kellnerd
So I'm thinking about ways to create only one revision per import run for an accepted entity at msot-