#bookbrainz

/

      • pbryan
        CatQuest KassOtsimine I see this disambiguation has been ping-ponging. Wanted to understand the background from you. https://beta.bookbrainz.org/author/036b53cf-029...
      • riya9 joined the channel
      • riya9 has quit
      • Leftmost has quit
      • maddy007[m] has quit
      • Leftmost joined the channel
      • maddy007[m] joined the channel
      • riya60 joined the channel
      • riya60 has quit
      • RohanSasne joined the channel
      • riya98 joined the channel
      • kellnerd joined the channel
      • RohanSasne has quit
      • riya98 has quit
      • riyaku11 joined the channel
      • riyaku11
        hey monkey! I have posted my proposal for gsoc 2023 on the community forum. Kindly review it and suggest any changes if required.
      • riyaku11 has quit
      • kellnerd has quit
      • kellnerd joined the channel
      • riyaku11 joined the channel
      • riyaku11 has quit
      • Sauradip_Ghosh joined the channel
      • kellnerd
        Hey monkey, are you up for a discussion about the importer project?
      • I have a few important questions about general design decisions which should be made before I can focus on one of the possible approaches for my proposal.
      • monkey
        Hi kellnerd, I'm about to go have some lunch, but definitely up for talking after that, in maybe 1h, 1h30?
      • kellnerd
        Sounds good to me
      • I have written down some notes with my thoughts, do you want to have a look at them to refresh your memory or do you prefer if I will shoot you with specific questions after lunch?
      • Sauradip_Ghosh has quit
      • Vivekumar08 joined the channel
      • Vivekumar08 has quit
      • monkey
        I'm back and have coffee in hand kellnerd, shoot :)
      • kellnerd
        Ok, give me a quick moment, I'm just filling out the GSoC registration form :)
      • (Almost done)
      • monkey
        šŸ‘
      • kellnerd
        Ok, so I want my proposal to address two shortcomings of the GSoC 2018 project which I have identified:
      • 1. Importing relationships
      • 2. Repeating the import process
      • monkey
        I agree with your assessment, those are two required features
      • kellnerd
        Since both topics depend on the design decisions for the other one, we will probably jump between topics a few times.
      • Let's start with #2
      • So there are three degrees of repeatability in my notes
      • 1. Identifying Already Imported Entities
      • This should already work using the origin_source_id and origin_id columns of the bookbrainz.link_import table
      • (Which are badly named IMO, but that's a minor issue we can skip now)
      • So when we are rerunning the import process using the latest dump, we can identify already imported entities and prevent adding duplicates.
      • monkey
        Following so far, and that sounds right
      • ShivamAwasthi joined the channel
      • kellnerd
        The question is now, do we want to skip already imported entities and if not, do we want to update only pending imports (2) or also accepted imports (3).
      • I would say, that we can at least do (2) as it should be easy to overwrite the pending data with the latest changes, which might be upstream changes or caused by improvements of the importer script.
      • monkey
        (2) is definitely a good idea
      • kellnerd
        (3) is where it gets more difficult, but that would be nice to have anyway
      • Currently the import data is discarded once an entity is accepted, so we'd have to change that as a first step.
      • monkey
        As for (3), there may be two separate workflows: one that creates new entities in the _import tables, and one that suggests edits to an already validate import_entity, which would be made against the proper entity
      • Not sure what shape that would take, and in any case I don't think it is something we would want to automate
      • kellnerd
        Exactly, that's similar to what I've outlined in https://gist.github.com/kellnerd/4e598a8ad6b0ed...
      • monkey
        Yep, reading it now
      • kellnerd
        We need a way to compare the pending entity level data of the last import with the current one.
      • (This is where it gets tricky when relationships would be involved, but we are coming to that one later.)
      • Give me a sign when you've followed that section (except for the hyperlink) or have any questions.
      • monkey
        And another level of complication on top of that if we want: there is also a possible distinction of pieces of data that could be modified automatically, for example we could decide to automatically apply a second run that only adds identifiers.
      • kellnerd
        Terminology (pending/accepted/external/queued entity) is explained right at the top of the file, by the way.
      • monkey
        but I think that's getting into the weeds
      • I've followed so far
      • kellnerd
        Yeah, importing specific data for existing entities (based on their identifiers) might be something for a bot user.
      • Good, then we will jump to topic #1 (relationships) now before we will revisit the updating problem.
      • vivekumar08 joined the channel
      • Well, in the document it's the next section, not the first...
      • Let's start with the status quo: Relationship sets are intentionally empty for pending entities in order to simplify the task.
      • Relationship data might be written to the import_metadata JSONB column but is not used so far.
      • I've identified two approaches which both have their advantages and disadvantages.
      • vivekumar08 has quit
      • (1) Store representations of the pending entity's relationships in the import_metadata JSON column and create proper relationships only when the user accepts the import.
      • (2) Allow pending entities to also have relationship sets and, as a consequence, BBIDs. Relationship target entities will also be stored as pending entities.
      • (1) Requires additional logic to display and handle relationships while (2) uses the existing infrastructure
      • For (1) my summary is relatively short: https://gist.github.com/kellnerd/4e598a8ad6b0ed...
      • ShivamAwasthi has quit
      • vivekumar08 joined the channel
      • Its main disadvantage (besides having separate code paths to handle these temporary representations) would be that the data_id of an accepted entity can no longer be the same as the data_id of the equivalent pending entity.
      • vivekumar08 has quit
      • That's because the former includes a relationship set and the later has none.
      • vivekumar08 joined the channel
      • This makes the necessary comparisons for updating an accepted entity more complicated.
      • Following me so far?
      • monkey
        So far reading your notes, option 2 seems like a more appropriate approach, keeping a lot closer to how we manage accepted entities. At a glance that would create fewer issues down the line
      • And I was just gonna suggest unidirectional relationships when I read that part :)
      • kellnerd
        Ah, so you're already ahead of me ;)
      • Of course, reading is faster than writing
      • monkey
        I'm trying to rack my brain for edge cases, so far haven't thought of any blocker
      • kellnerd
        The only disadvantages of (2) are that we lose the clean separation of the import and entity tables, and that making a mistake here could lead to unidirectional relationships between accepted entities.
      • Cyclic dependencies ;)
      • monkey
        Ouch
      • kellnerd
        But these should not happen in good source data and we can add a check for that
      • monkey
        We do have some of those unidirectional relationships in the database already from old bugs in the interface, I think this sort of thing can be resolved as a post-processing job run regularly
      • vivekumar08
        Excuse for a moment, hii monkey I uploaded my proposal, kindly review it at https://community.metabrainz.org/t/gsoc-2023-pr..., and if any changes or suggestions are required, lemme know
      • vivekumar08 has quit
      • monkey
        OK
      • Oh, gone.
      • >we lose the clean separation of the import and entity tables
      • Do you mean because we use UUIDs of accepted entities in the import table?
      • s
      • Or are you refering to:
      • > combine the entity and import tables and only add one new flag here
      • kellnerd
        There would be be no longer an import table if we completely follow my proposal
      • ^yeah that one
      • monkey
        I don't follow exaclty the requirement for this change
      • kellnerd
        I'll explain
      • monkey
        > we need a new flag for relationship source/target entities now in order to know whether the BBID belongs to an accepted entity or to a pending import
      • Can we not fetch from the entity table first, then if no results fetch from the import table ?
      • kellnerd
        That's the easiest way to handle relationship source and target entity columns without both of them having an additional flag which indicates the table in which we can find their BBID values.
      • We could do, but that violates the current DB constraints
      • `ALTER TABLE bookbrainz.relationship ADD FOREIGN KEY (source_bbid) REFERENCES bookbrainz.entity (bbid);`
      • monkey
        REFERENCES bookbrainz.entity (bbid)
      • Yeah, seeing that now, thanks for the explanation :)
      • kellnerd
        When I first studied the schema I wondered why the entity table and separate entity_header tables areĀ  necessary at all because they seemed identical for all entity types ;)
      • monkey
        Yeah, a little cumbersome if you ask me.
      • kellnerd
        So when you agree that this is a sensible way to handle relationships, this would also simplify the comparison for updating already accepted entities (as we have discussed before).
      • We can simply check if the pending entity (which we never delete) refers to the same data_id as the master revision of the accepted entity.
      • monkey
        Yes. That and the easier reuse of code and DB rows makes it a preferable option IMO
      • kellnerd
        This way even weird edge cases like doing changes and reverting back to the revision of the import still allow for conflict-free updates.
      • *allows
      • monkey
        Indeed
      • Vivekumar08 joined the channel
      • Vivekumar08 has quit
      • kellnerd
        ^ Flaky internet connection I guess
      • monkey
        After some more thought I also think the is_import flag on the entity table (and dropping entity_import) does make sense.
      • I think we *may* also want to show bidirectional relationships, even if one side hasn't been accepted yet. The original goal was to show import entities along with accepted entities, but with a clear indication that it is imported/non-user-validated. So it does make sense to be using BBIDs everywhere and a single entity table
      • kellnerd
        Yes, that would be engaging for users to import entities which are related to their favourite author.
      • (for example)
      • monkey
        Indeed.
      • Harder to get users to confirm/validate data if we don't show it on the website :)
      • (i.e. if it's hidden on some "expert user" sort of page)
      • kellnerd
        But we would definitely have to think about how we handle updating the relationship sets of accepted entities when we import related pending entities. I don't want revision histories to be spammed with these updates...
      • monkey
        I think the goal should be to make them look (as much as possible) equivalent to a user creating a new entity
      • Which would update the relationship set of any related entity
      • kellnerd
        Maybe we could update all unidirectional relationships to bidirectional relationships after the import process has finished?
      • That way we would get one new revision per entity (and per import run) at most.
      • monkey
        I'm not 100% convinced we want unidirectional rels, for the reason I stated above: we do want to show the imported entities
      • kellnerd
        Yeah the idea was to have no more unidirectional rels *after* the import, only while the process is still running.
      • But we could also silently create new relationship sets for accepted entities for which we delay creating a new revision until we are done.
      • monkey
        I'm not sure I follow. What do you mean by "while the process is still running" ? Do you mean the parsing of records and creation of import entities?
      • kellnerd
        As long as the consumerĀ  is running
      • monkey
        Right.
      • kellnerd
        I don't want to create a new revision every time the consumer processes a new entity which is related to an accepted entity-
      • monkey
        I see.
      • Say an Author exists in BB, and we import 10 new Works from somewhere else, we don't want 10 new revisions liking each import_work to the existing author.
      • kellnerd
        So I'm thinking about ways to create only one revision per import run for an accepted entity at msot-
      • *most
      • Exactly that