Hey monkey, are you up for a discussion about the importer project?
I have a few important questions about general design decisions which should be made before I can focus on one of the possible approaches for my proposal.
monkey
Hi kellnerd, I'm about to go have some lunch, but definitely up for talking after that, in maybe 1h, 1h30?
kellnerd
Sounds good to me
I have written down some notes with my thoughts, do you want to have a look at them to refresh your memory or do you prefer if I will shoot you with specific questions after lunch?
Sauradip_Ghosh has quit
Vivekumar08 joined the channel
Vivekumar08 has quit
monkey
I'm back and have coffee in hand kellnerd, shoot :)
kellnerd
Ok, give me a quick moment, I'm just filling out the GSoC registration form :)
(Almost done)
monkey
š
kellnerd
Ok, so I want my proposal to address two shortcomings of the GSoC 2018 project which I have identified:
1. Importing relationships
2. Repeating the import process
monkey
I agree with your assessment, those are two required features
kellnerd
Since both topics depend on the design decisions for the other one, we will probably jump between topics a few times.
So there are three degrees of repeatability in my notes
1. Identifying Already Imported Entities
This should already work using the origin_source_id and origin_id columns of the bookbrainz.link_import table
(Which are badly named IMO, but that's a minor issue we can skip now)
So when we are rerunning the import process using the latest dump, we can identify already imported entities and prevent adding duplicates.
monkey
Following so far, and that sounds right
ShivamAwasthi joined the channel
kellnerd
The question is now, do we want to skip already imported entities and if not, do we want to update only pending imports (2) or also accepted imports (3).
I would say, that we can at least do (2) as it should be easy to overwrite the pending data with the latest changes, which might be upstream changes or caused by improvements of the importer script.
monkey
(2) is definitely a good idea
kellnerd
(3) is where it gets more difficult, but that would be nice to have anyway
Currently the import data is discarded once an entity is accepted, so we'd have to change that as a first step.
monkey
As for (3), there may be two separate workflows: one that creates new entities in the _import tables, and one that suggests edits to an already validate import_entity, which would be made against the proper entity
Not sure what shape that would take, and in any case I don't think it is something we would want to automate
We need a way to compare the pending entity level data of the last import with the current one.
(This is where it gets tricky when relationships would be involved, but we are coming to that one later.)
Give me a sign when you've followed that section (except for the hyperlink) or have any questions.
monkey
And another level of complication on top of that if we want: there is also a possible distinction of pieces of data that could be modified automatically, for example we could decide to automatically apply a second run that only adds identifiers.
kellnerd
Terminology (pending/accepted/external/queued entity) is explained right at the top of the file, by the way.
monkey
but I think that's getting into the weeds
I've followed so far
kellnerd
Yeah, importing specific data for existing entities (based on their identifiers) might be something for a bot user.
Good, then we will jump to topic #1 (relationships) now before we will revisit the updating problem.
vivekumar08 joined the channel
Well, in the document it's the next section, not the first...
Let's start with the status quo: Relationship sets are intentionally empty for pending entities in order to simplify the task.
Relationship data might be written to the import_metadata JSONB column but is not used so far.
I've identified two approaches which both have their advantages and disadvantages.
vivekumar08 has quit
(1) Store representations of the pending entity's relationships in the import_metadata JSON column and create proper relationships only when the user accepts the import.
(2) Allow pending entities to also have relationship sets and, as a consequence, BBIDs. Relationship target entities will also be stored as pending entities.
(1) Requires additional logic to display and handle relationships while (2) uses the existing infrastructure
Its main disadvantage (besides having separate code paths to handle these temporary representations) would be that the data_id of an accepted entity can no longer be the same as the data_id of the equivalent pending entity.
vivekumar08 has quit
That's because the former includes a relationship set and the later has none.
vivekumar08 joined the channel
This makes the necessary comparisons for updating an accepted entity more complicated.
Following me so far?
monkey
So far reading your notes, option 2 seems like a more appropriate approach, keeping a lot closer to how we manage accepted entities. At a glance that would create fewer issues down the line
And I was just gonna suggest unidirectional relationships when I read that part :)
kellnerd
Ah, so you're already ahead of me ;)
Of course, reading is faster than writing
monkey
I'm trying to rack my brain for edge cases, so far haven't thought of any blocker
kellnerd
The only disadvantages of (2) are that we lose the clean separation of the import and entity tables, and that making a mistake here could lead to unidirectional relationships between accepted entities.
Cyclic dependencies ;)
monkey
Ouch
kellnerd
But these should not happen in good source data and we can add a check for that
monkey
We do have some of those unidirectional relationships in the database already from old bugs in the interface, I think this sort of thing can be resolved as a post-processing job run regularly
>we lose the clean separation of the import and entity tables
Do you mean because we use UUIDs of accepted entities in the import table?
s
Or are you refering to:
> combine the entity and import tables and only add one new flag here
kellnerd
There would be be no longer an import table if we completely follow my proposal
^yeah that one
monkey
I don't follow exaclty the requirement for this change
kellnerd
I'll explain
monkey
> we need a new flag for relationship source/target entities now in order to know whether the BBID belongs to an accepted entity or to a pending import
Can we not fetch from the entity table first, then if no results fetch from the import table ?
kellnerd
That's the easiest way to handle relationship source and target entity columns without both of them having an additional flag which indicates the table in which we can find their BBID values.
We could do, but that violates the current DB constraints
Yeah, seeing that now, thanks for the explanation :)
kellnerd
When I first studied the schema I wondered why the entity table and separate entity_header tables areĀ necessary at all because they seemed identical for all entity types ;)
monkey
Yeah, a little cumbersome if you ask me.
kellnerd
So when you agree that this is a sensible way to handle relationships, this would also simplify the comparison for updating already accepted entities (as we have discussed before).
We can simply check if the pending entity (which we never delete) refers to the same data_id as the master revision of the accepted entity.
monkey
Yes. That and the easier reuse of code and DB rows makes it a preferable option IMO
kellnerd
This way even weird edge cases like doing changes and reverting back to the revision of the import still allow for conflict-free updates.
*allows
monkey
Indeed
Vivekumar08 joined the channel
Vivekumar08 has quit
kellnerd
^ Flaky internet connection I guess
monkey
After some more thought I also think the is_import flag on the entity table (and dropping entity_import) does make sense.
I think we *may* also want to show bidirectional relationships, even if one side hasn't been accepted yet. The original goal was to show import entities along with accepted entities, but with a clear indication that it is imported/non-user-validated. So it does make sense to be using BBIDs everywhere and a single entity table
kellnerd
Yes, that would be engaging for users to import entities which are related to their favourite author.
(for example)
monkey
Indeed.
Harder to get users to confirm/validate data if we don't show it on the website :)
(i.e. if it's hidden on some "expert user" sort of page)
kellnerd
But we would definitely have to think about how we handle updating the relationship sets of accepted entities when we import related pending entities. I don't want revision histories to be spammed with these updates...
monkey
I think the goal should be to make them look (as much as possible) equivalent to a user creating a new entity
Which would update the relationship set of any related entity
kellnerd
Maybe we could update all unidirectional relationships to bidirectional relationships after the import process has finished?
That way we would get one new revision per entity (and per import run) at most.
monkey
I'm not 100% convinced we want unidirectional rels, for the reason I stated above: we do want to show the imported entities
kellnerd
Yeah the idea was to have no more unidirectional rels *after* the import, only while the process is still running.
But we could also silently create new relationship sets for accepted entities for which we delay creating a new revision until we are done.
monkey
I'm not sure I follow. What do you mean by "while the process is still running" ? Do you mean the parsing of records and creation of import entities?
kellnerd
As long as the consumerĀ is running
monkey
Right.
kellnerd
I don't want to create a new revision every time the consumer processes a new entity which is related to an accepted entity-
monkey
I see.
Say an Author exists in BB, and we import 10 new Works from somewhere else, we don't want 10 new revisions liking each import_work to the existing author.
kellnerd
So I'm thinking about ways to create only one revision per import run for an accepted entity at msot-