#bookbrainz

/

4:33 AM
pbryan

CatQuest KassOtsimine I see this disambiguation has been ping-ponging. Wanted to understand the background from you. https://beta.bookbrainz.org/author/036b53cf-0291-…

2023-03-27 08612, 2023

5:32 AM
riya9 joined the channel

2023-03-27 08606, 2023

5:41 AM
riya9 has quit

2023-03-27 08613, 2023

6:00 AM
Leftmost has quit

2023-03-27 08613, 2023

6:00 AM
maddy007[m] has quit

2023-03-27 08624, 2023

6:00 AM
Leftmost joined the channel

2023-03-27 08618, 2023

6:03 AM
maddy007[m] joined the channel

2023-03-27 08617, 2023

7:59 AM
riya60 joined the channel

2023-03-27 08639, 2023

8:00 AM
riya60 has quit

2023-03-27 08617, 2023

8:03 AM
RohanSasne joined the channel

2023-03-27 08641, 2023

8:03 AM
riya98 joined the channel

2023-03-27 08633, 2023

8:12 AM
kellnerd joined the channel

2023-03-27 08639, 2023

8:14 AM
RohanSasne has quit

2023-03-27 08622, 2023

8:24 AM
riya98 has quit

2023-03-27 08618, 2023

8:43 AM
riyaku11 joined the channel

2023-03-27 08634, 2023

9:52 AM
riyaku11

hey monkey! I have posted my proposal for gsoc 2023 on the community forum. Kindly review it and suggest any changes if required.

2023-03-27 08639, 2023

9:52 AM
riyaku11

https://community.metabrainz.org/t/gsoc-2023-prop…

2023-03-27 08619, 2023

9:53 AM
riyaku11 has quit

2023-03-27 08632, 2023

11:10 AM
kellnerd has quit

2023-03-27 08636, 2023

11:53 AM
kellnerd joined the channel

2023-03-27 08602, 2023

12:18 PM
riyaku11 joined the channel

2023-03-27 08647, 2023

12:25 PM
riyaku11 has quit

2023-03-27 08635, 2023

12:31 PM
Sauradip_Ghosh joined the channel

2023-03-27 08629, 2023

12:39 PM
kellnerd

Hey monkey, are you up for a discussion about the importer project?

2023-03-27 08629, 2023

12:39 PM
kellnerd

I have a few important questions about general design decisions which should be made before I can focus on one of the possible approaches for my proposal.

2023-03-27 08658, 2023

12:43 PM
monkey

Hi kellnerd, I'm about to go have some lunch, but definitely up for talking after that, in maybe 1h, 1h30?

2023-03-27 08631, 2023

12:44 PM
kellnerd

Sounds good to me

2023-03-27 08650, 2023

12:49 PM
kellnerd

I have written down some notes with my thoughts, do you want to have a look at them to refresh your memory or do you prefer if I will shoot you with specific questions after lunch?

2023-03-27 08646, 2023

13:54 PM
Sauradip_Ghosh has quit

2023-03-27 08628, 2023

14:04 PM
Vivekumar08 joined the channel

2023-03-27 08626, 2023

14:10 PM
Vivekumar08 has quit

2023-03-27 08620, 2023

14:29 PM
monkey

I'm back and have coffee in hand kellnerd, shoot :)

2023-03-27 08625, 2023

14:30 PM
kellnerd

Ok, give me a quick moment, I'm just filling out the GSoC registration form :)

2023-03-27 08636, 2023

14:30 PM
kellnerd

(Almost done)

2023-03-27 08655, 2023

14:30 PM
monkey

👍

2023-03-27 08628, 2023

14:34 PM
kellnerd

Ok, so I want my proposal to address two shortcomings of the GSoC 2018 project which I have identified:

2023-03-27 08645, 2023

14:34 PM
kellnerd

1. Importing relationships

2023-03-27 08657, 2023

14:34 PM
kellnerd

2. Repeating the import process

2023-03-27 08618, 2023

14:35 PM
monkey

I agree with your assessment, those are two required features

2023-03-27 08652, 2023

14:35 PM
kellnerd

Since both topics depend on the design decisions for the other one, we will probably jump between topics a few times.

2023-03-27 08603, 2023

14:36 PM
kellnerd

Let's start with #2

2023-03-27 08643, 2023

14:36 PM
kellnerd

(My notes for reference: https://gist.github.com/kellnerd/4e598a8ad6b0ed15…)

2023-03-27 08613, 2023

14:37 PM
kellnerd

So there are three degrees of repeatability in my notes

2023-03-27 08631, 2023

14:37 PM
kellnerd

1. Identifying Already Imported Entities

2023-03-27 08626, 2023

14:38 PM
kellnerd

This should already work using the origin_source_id and origin_id columns of the bookbrainz.link_import table

2023-03-27 08600, 2023

14:39 PM
kellnerd

(Which are badly named IMO, but that's a minor issue we can skip now)

2023-03-27 08606, 2023

14:40 PM
kellnerd

So when we are rerunning the import process using the latest dump, we can identify already imported entities and prevent adding duplicates.

2023-03-27 08635, 2023

14:40 PM
monkey

Following so far, and that sounds right

2023-03-27 08609, 2023

14:41 PM
ShivamAwasthi joined the channel

2023-03-27 08642, 2023

14:41 PM
kellnerd

The question is now, do we want to skip already imported entities and if not, do we want to update only pending imports (2) or also accepted imports (3).

2023-03-27 08647, 2023

14:42 PM
kellnerd

I would say, that we can at least do (2) as it should be easy to overwrite the pending data with the latest changes, which might be upstream changes or caused by improvements of the importer script.

2023-03-27 08606, 2023

14:43 PM
monkey

(2) is definitely a good idea

2023-03-27 08636, 2023

14:43 PM
kellnerd

(3) is where it gets more difficult, but that would be nice to have anyway

2023-03-27 08628, 2023

14:44 PM
kellnerd

Currently the import data is discarded once an entity is accepted, so we'd have to change that as a first step.

2023-03-27 08631, 2023

14:44 PM
monkey

As for (3), there may be two separate workflows: one that creates new entities in the _import tables, and one that suggests edits to an already validate import_entity, which would be made against the proper entity

2023-03-27 08618, 2023

14:45 PM
monkey

Not sure what shape that would take, and in any case I don't think it is something we would want to automate

2023-03-27 08649, 2023

14:45 PM
kellnerd

Exactly, that's similar to what I've outlined in https://gist.github.com/kellnerd/4e598a8ad6b0ed15…

2023-03-27 08605, 2023

14:46 PM
monkey

Yep, reading it now

2023-03-27 08631, 2023

14:46 PM
kellnerd

We need a way to compare the pending entity level data of the last import with the current one.

2023-03-27 08604, 2023

14:47 PM
kellnerd

(This is where it gets tricky when relationships would be involved, but we are coming to that one later.)

2023-03-27 08652, 2023

14:48 PM
kellnerd

Give me a sign when you've followed that section (except for the hyperlink) or have any questions.

2023-03-27 08649, 2023

14:49 PM
monkey

And another level of complication on top of that if we want: there is also a possible distinction of pieces of data that could be modified automatically, for example we could decide to automatically apply a second run that only adds identifiers.

2023-03-27 08649, 2023

14:49 PM
kellnerd

Terminology (pending/accepted/external/queued entity) is explained right at the top of the file, by the way.

2023-03-27 08600, 2023

14:50 PM
monkey

but I think that's getting into the weeds

2023-03-27 08601, 2023

14:51 PM
monkey

I've followed so far

2023-03-27 08601, 2023

14:51 PM
kellnerd

Yeah, importing specific data for existing entities (based on their identifiers) might be something for a bot user.

2023-03-27 08654, 2023

14:51 PM
kellnerd

Good, then we will jump to topic #1 (relationships) now before we will revisit the updating problem.

2023-03-27 08617, 2023

14:52 PM
vivekumar08 joined the channel

2023-03-27 08629, 2023

14:52 PM
kellnerd

Well, in the document it's the next section, not the first...

2023-03-27 08642, 2023

14:53 PM
kellnerd

Let's start with the status quo: Relationship sets are intentionally empty for pending entities in order to simplify the task.

2023-03-27 08622, 2023

14:54 PM
kellnerd

Relationship data might be written to the import_metadata JSONB column but is not used so far.

2023-03-27 08658, 2023

14:54 PM
kellnerd

I've identified two approaches which both have their advantages and disadvantages.

2023-03-27 08607, 2023

14:55 PM
vivekumar08 has quit

2023-03-27 08659, 2023

14:55 PM
kellnerd

(1) Store representations of the pending entity's relationships in the import_metadata JSON column and create proper relationships only when the user accepts the import.

2023-03-27 08611, 2023

14:56 PM
kellnerd

(2) Allow pending entities to also have relationship sets and, as a consequence, BBIDs. Relationship target entities will also be stored as pending entities.

2023-03-27 08609, 2023

14:57 PM
kellnerd

(1) Requires additional logic to display and handle relationships while (2) uses the existing infrastructure

2023-03-27 08615, 2023

14:58 PM
kellnerd

For (1) my summary is relatively short: https://gist.github.com/kellnerd/4e598a8ad6b0ed15…

2023-03-27 08657, 2023

14:59 PM
ShivamAwasthi has quit

2023-03-27 08620, 2023

15:00 PM
vivekumar08 joined the channel

2023-03-27 08634, 2023

15:00 PM
kellnerd

Its main disadvantage (besides having separate code paths to handle these temporary representations) would be that the data_id of an accepted entity can no longer be the same as the data_id of the equivalent pending entity.

2023-03-27 08656, 2023

15:00 PM
vivekumar08 has quit

2023-03-27 08659, 2023

15:00 PM
kellnerd

That's because the former includes a relationship set and the later has none.

2023-03-27 08611, 2023

15:01 PM
vivekumar08 joined the channel

2023-03-27 08651, 2023

15:02 PM
kellnerd

This makes the necessary comparisons for updating an accepted entity more complicated.

2023-03-27 08618, 2023

15:03 PM
kellnerd

Following me so far?

2023-03-27 08656, 2023

15:03 PM
monkey

So far reading your notes, option 2 seems like a more appropriate approach, keeping a lot closer to how we manage accepted entities. At a glance that would create fewer issues down the line

2023-03-27 08620, 2023

15:04 PM
monkey

And I was just gonna suggest unidirectional relationships when I read that part :)

2023-03-27 08642, 2023

15:04 PM
kellnerd

Ah, so you're already ahead of me ;)

2023-03-27 08655, 2023

15:04 PM
kellnerd

Of course, reading is faster than writing

2023-03-27 08603, 2023

15:08 PM
monkey

I'm trying to rack my brain for edge cases, so far haven't thought of any blocker

2023-03-27 08608, 2023

15:08 PM
kellnerd

The only disadvantages of (2) are that we lose the clean separation of the import and entity tables, and that making a mistake here could lead to unidirectional relationships between accepted entities.

2023-03-27 08641, 2023

15:08 PM
kellnerd

Cyclic dependencies ;)

2023-03-27 08651, 2023

15:08 PM
monkey

Ouch

2023-03-27 08606, 2023

15:09 PM
kellnerd

But these should not happen in good source data and we can add a check for that

2023-03-27 08631, 2023

15:09 PM
monkey

We do have some of those unidirectional relationships in the database already from old bugs in the interface, I think this sort of thing can be resolved as a post-processing job run regularly

2023-03-27 08618, 2023

15:10 PM
vivekumar08

Excuse for a moment, hii monkey I uploaded my proposal, kindly review it at https://community.metabrainz.org/t/gsoc-2023-prop…, and if any changes or suggestions are required, lemme know

2023-03-27 08625, 2023

15:10 PM
vivekumar08 has quit

2023-03-27 08641, 2023

15:10 PM
monkey

OK

2023-03-27 08647, 2023

15:10 PM
monkey

Oh, gone.

2023-03-27 08602, 2023

15:11 PM
monkey

>we lose the clean separation of the import and entity tables

2023-03-27 08618, 2023

15:11 PM
monkey

Do you mean because we use UUIDs of accepted entities in the import table?

2023-03-27 08619, 2023

15:11 PM
monkey

s

2023-03-27 08600, 2023

15:12 PM
monkey

Or are you refering to:

2023-03-27 08602, 2023

15:12 PM
monkey

> combine the entity and import tables and only add one new flag here

2023-03-27 08615, 2023

15:12 PM
kellnerd

There would be be no longer an import table if we completely follow my proposal

2023-03-27 08622, 2023

15:12 PM
kellnerd

^yeah that one

2023-03-27 08611, 2023

15:13 PM
monkey

I don't follow exaclty the requirement for this change

2023-03-27 08626, 2023

15:13 PM
kellnerd

I'll explain

2023-03-27 08629, 2023

15:13 PM
monkey

> we need a new flag for relationship source/target entities now in order to know whether the BBID belongs to an accepted entity or to a pending import

2023-03-27 08600, 2023

15:14 PM
monkey

Can we not fetch from the entity table first, then if no results fetch from the import table ?

2023-03-27 08631, 2023

15:14 PM
kellnerd

That's the easiest way to handle relationship source and target entity columns without both of them having an additional flag which indicates the table in which we can find their BBID values.

2023-03-27 08650, 2023

15:14 PM
kellnerd

We could do, but that violates the current DB constraints

2023-03-27 08637, 2023

15:15 PM
kellnerd

`ALTER TABLE bookbrainz.relationship ADD FOREIGN KEY (source_bbid) REFERENCES bookbrainz.entity (bbid);`

2023-03-27 08642, 2023

15:15 PM
monkey

REFERENCES bookbrainz.entity (bbid)

2023-03-27 08655, 2023

15:15 PM
monkey

Yeah, seeing that now, thanks for the explanation :)

2023-03-27 08613, 2023

15:17 PM
kellnerd

When I first studied the schema I wondered why the entity table and separate entity_header tables are necessary at all because they seemed identical for all entity types ;)

2023-03-27 08607, 2023

15:18 PM
monkey

Yeah, a little cumbersome if you ask me.

2023-03-27 08650, 2023

15:19 PM
kellnerd

So when you agree that this is a sensible way to handle relationships, this would also simplify the comparison for updating already accepted entities (as we have discussed before).

2023-03-27 08637, 2023

15:20 PM
kellnerd

We can simply check if the pending entity (which we never delete) refers to the same data_id as the master revision of the accepted entity.

2023-03-27 08648, 2023

15:20 PM
monkey

Yes. That and the easier reuse of code and DB rows makes it a preferable option IMO

2023-03-27 08625, 2023

15:21 PM
kellnerd

This way even weird edge cases like doing changes and reverting back to the revision of the import still allow for conflict-free updates.

2023-03-27 08636, 2023

15:21 PM
kellnerd

*allows

2023-03-27 08651, 2023

15:21 PM
monkey

Indeed

2023-03-27 08612, 2023

15:22 PM
Vivekumar08 joined the channel

2023-03-27 08641, 2023

15:22 PM
Vivekumar08 has quit

2023-03-27 08609, 2023

15:23 PM
kellnerd

^ Flaky internet connection I guess

2023-03-27 08638, 2023

15:25 PM
monkey

After some more thought I also think the is_import flag on the entity table (and dropping entity_import) does make sense.

2023-03-27 08656, 2023

15:27 PM
monkey

I think we *may* also want to show bidirectional relationships, even if one side hasn't been accepted yet. The original goal was to show import entities along with accepted entities, but with a clear indication that it is imported/non-user-validated. So it does make sense to be using BBIDs everywhere and a single entity table

2023-03-27 08652, 2023

15:29 PM
kellnerd

Yes, that would be engaging for users to import entities which are related to their favourite author.

2023-03-27 08610, 2023

15:30 PM
kellnerd

(for example)

2023-03-27 08628, 2023

15:30 PM
monkey

Indeed.

2023-03-27 08650, 2023

15:30 PM
monkey

Harder to get users to confirm/validate data if we don't show it on the website :)

2023-03-27 08610, 2023

15:31 PM
monkey

(i.e. if it's hidden on some "expert user" sort of page)

2023-03-27 08603, 2023

15:32 PM
kellnerd

But we would definitely have to think about how we handle updating the relationship sets of accepted entities when we import related pending entities. I don't want revision histories to be spammed with these updates...

2023-03-27 08605, 2023

15:33 PM
monkey

I think the goal should be to make them look (as much as possible) equivalent to a user creating a new entity

2023-03-27 08656, 2023

15:33 PM
monkey

Which would update the relationship set of any related entity

2023-03-27 08605, 2023

15:34 PM
kellnerd

Maybe we could update all unidirectional relationships to bidirectional relationships after the import process has finished?

2023-03-27 08649, 2023

15:34 PM
kellnerd

That way we would get one new revision per entity (and per import run) at most.

2023-03-27 08601, 2023

15:35 PM
monkey

I'm not 100% convinced we want unidirectional rels, for the reason I stated above: we do want to show the imported entities

2023-03-27 08602, 2023

15:36 PM
kellnerd

Yeah the idea was to have no more unidirectional rels *after* the import, only while the process is still running.

2023-03-27 08606, 2023

15:37 PM
kellnerd

But we could also silently create new relationship sets for accepted entities for which we delay creating a new revision until we are done.

2023-03-27 08650, 2023

15:38 PM
monkey

I'm not sure I follow. What do you mean by "while the process is still running" ? Do you mean the parsing of records and creation of import entities?

2023-03-27 08607, 2023

15:39 PM
kellnerd

As long as the consumer is running

2023-03-27 08614, 2023

15:39 PM
monkey

Right.

2023-03-27 08653, 2023

15:39 PM
kellnerd

I don't want to create a new revision every time the consumer processes a new entity which is related to an accepted entity-

2023-03-27 08613, 2023

15:40 PM
monkey

I see.

2023-03-27 08648, 2023

15:40 PM
monkey

Say an Author exists in BB, and we import 10 new Works from somewhere else, we don't want 10 new revisions liking each import_work to the existing author.

2023-03-27 08650, 2023

15:40 PM
kellnerd

So I'm thinking about ways to create only one revision per import run for an accepted entity at msot-

2023-03-27 08655, 2023

15:40 PM
kellnerd

*most

2023-03-27 08605, 2023

15:41 PM
kellnerd

Exactly that