Maybe give me another 10 minutes. I want to have some more observations.
2024-07-10 19222, 2024
mayhem[m]
ok, np
2024-07-10 19204, 2024
bitmap[m]
reosarevok[m]: I was hoping for a way to reproduce the wide char one by just browsing to a page, but I think the `EditExternalLinks` one suffices too, since that goes through Catalyst
2024-07-10 19239, 2024
reosarevok[m]
Yeah, I didn't look further since that one did hit it already :)
2024-07-10 19222, 2024
ericd[m]
<ericd[m]> "I'll check" <- ah not a bug. it's just that MB has that many releases to return :D
2024-07-10 19200, 2024
mayhem[m]
ericd[m]: Then maybe limit the number of items we return?
2024-07-10 19256, 2024
ericd[m]
mayhem[m]: yeah I will change it to some more reasonable amounts, or this may confuses users
2024-07-10 19228, 2024
mayhem[m]
if the list is truncated, maybe we can add a link to where they can see the rest on the web?
2024-07-10 19242, 2024
ericd[m]
s/amounts/amount/, s/confuses/confuse/
2024-07-10 19210, 2024
ericd[m]
<mayhem[m]> "if the list is truncated..." <- make sense. i will add a link in the feed content.
I got messed in it for quite some time. Had written a lot of code for Error handling when I realised I am just over complicating stuff, so went with straightforward impl
2024-07-10 19213, 2024
lucifer[m]
<mayhem[m]> "what was this in ref to?" <- Some GitHub action failure on a troi PR
2024-07-10 19232, 2024
mayhem[m]
ah.
2024-07-10 19205, 2024
djl has quit
2024-07-10 19218, 2024
djl joined the channel
2024-07-10 19232, 2024
yvanzo[m]
yellowhatpro: Ok, so you’re on using `this_error` atm?
2024-07-10 19214, 2024
theflash[m] joined the channel
2024-07-10 19214, 2024
theflash[m] uploaded an image: (407KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/matrix.org/DcGBWAPCriZamWJOFhWqglJe/IMG_7483.PNG >
2024-07-10 19222, 2024
yellowhatpro[m]
I haven't yet, but as rustynova suggested, I will explore and use it
2024-07-10 19227, 2024
reosarevok[m]
"We will rule over this error, and we will call it... this_error"
2024-07-10 19237, 2024
theflash[m]
akshaaatt[m]: hey, I have implemented pagination in the feed, when i am using LazyVStack, the duplicate events are not being loaded at once
2024-07-10 19217, 2024
yellowhatpro[m]
I will be focusing on these 2 points in the current pr:
2024-07-10 19217, 2024
yellowhatpro[m]
- api mocking
2024-07-10 19217, 2024
yellowhatpro[m]
- dealing with rate limiting
2024-07-10 19231, 2024
bitmap[m]
you will add some tests for make_archival_network_request using the API mocking, yes?
2024-07-10 19251, 2024
yellowhatpro[m]
Yes will add tests for this
2024-07-10 19255, 2024
bitmap[m]
besides the lack of tests and RustyNova's suggestions I think it looks pretty good
2024-07-10 19210, 2024
bitmap[m]
but I'd make sure to remove your API key too
2024-07-10 19241, 2024
yvanzo[m]
It should be retrieved from a configuration file instead.
2024-07-10 19243, 2024
yellowhatpro[m]
aah yes. Will remove it soon.
2024-07-10 19243, 2024
yellowhatpro[m]
Also,should we use some meb account for archiving?
2024-07-10 19216, 2024
yvanzo[m]
Maybe for deployment. Does the account matter for development?
2024-07-10 19238, 2024
yellowhatpro[m]
Nope, for dev I am using my own id and key
2024-07-10 19229, 2024
yvanzo[m]
Having a configuration file should probably be a priority as there are a number of hard-coded values in the code that would better fit that too.
2024-07-10 19235, 2024
yellowhatpro[m]
yvanzo[m]: Yeah I will add them in env fle itself, I thought to add in the final commits of the PR. Current credentials don't matter much
2024-07-10 19242, 2024
bitmap[m]
for MBS we wrote a small service that mocks the IA's S3 API, you could also write something similar here that mocks the /save endpoint (for development)
2024-07-10 19244, 2024
yellowhatpro[m]
bitmap[m]: Oh we are using IA's API in MBS, are we dealing with rate limiiting in that as well?
2024-07-10 19211, 2024
yvanzo[m]
yellowhatpro: `.env` might be too limited. A TOML file might be more appropriate. See the crate `config` for example.
2024-07-10 19216, 2024
yellowhatpro[m]
yvanzo[m]: ok will make it work soon
2024-07-10 19253, 2024
yellowhatpro[m]
Okk, gonna explore this_error and config crates then
2024-07-10 19217, 2024
yvanzo[m]
No, we aren’t using the same API from MBS.
2024-07-10 19242, 2024
bitmap[m]
yellowhatpro[m]: not really, we just sleep 1s between each event (but each event may take 1-2s to process). if we hit the rate limit (which is rare), it's just retried later
2024-07-10 19254, 2024
yellowhatpro[m]
ohh alright.
2024-07-10 19207, 2024
bitmap[m]
but yes, it's a completely different API
2024-07-10 19243, 2024
yellowhatpro[m]
bitmap[m]: Right. I should try something similar.
2024-07-10 19243, 2024
yellowhatpro[m]
Maybe I should just apply some math and since I am mostly polling, the time can be configured
2024-07-10 19245, 2024
bitmap[m]
for importing existing edits we may want something that takes better advantage of the rate limit though, since that process will take a while
2024-07-10 19208, 2024
yvanzo[m]
It is okay to start with a simplistic rate limiting indeed. It can be improved later on, once everything starts to be working together.
2024-07-10 19227, 2024
yvanzo[m]
Ideally, it should be what bitmap mentioned: different threads or processes to handle polling and requesting.
2024-07-10 19208, 2024
yellowhatpro[m]
Ummm a doubt here
2024-07-10 19228, 2024
yellowhatpro[m]
Ok nvm. I thought you meant we have to create multiple threds for requesting
2024-07-10 19219, 2024
yellowhatpro[m]
yvanzo[m]: Yupp I am running polling and archiving in different threads
2024-07-10 19241, 2024
yvanzo[m]
Threading isn’t in the main goals, so at most just make a note about it for the stretch goals if you want to remind about it.
2024-07-10 19215, 2024
yvanzo[m]
Great if you have some kind of threading already. :)
2024-07-10 19250, 2024
yvanzo[m]
Yes, multiple threads for requesting might be a thing if we can be allowed a higher rate limit.
2024-07-10 19220, 2024
yellowhatpro[m]
<bitmap[m]> "for importing existing edits..." <- Regarding this, If we are rate limited, then should I focus on maximizing the requests.
2024-07-10 19220, 2024
yellowhatpro[m]
For ex, if I don't have any URL to process in current poll (while making a request), should I devote that time to archive the existing ones?
2024-07-10 19234, 2024
yellowhatpro[m]
yvanzo[m]: Ohh did you mean archiving x URLs parallely ?
2024-07-10 19252, 2024
yvanzo[m]
Yes (as stretch goals)
2024-07-10 19221, 2024
yellowhatpro[m]
Got it ✅
2024-07-10 19201, 2024
bitmap[m]
yellowhatpro[m]: > <@yellowhatpro:matrix.org> Regarding this, If we are rate limited, then should I focus on maximizing the requests.
2024-07-10 19201, 2024
bitmap[m]
> For ex, if I don't have any URL to process in current poll (while making a request), should I devote that time to archive the existing ones?
2024-07-10 19201, 2024
bitmap[m]
if there's still work to do you should maximize the requests you can do, ideally
2024-07-10 19231, 2024
bitmap[m]
but you can start with something simple as yvanzo said
2024-07-10 19213, 2024
yellowhatpro[m]
btw there has to be another task that will do cleanup/re-archival part of URLs that couldn't get archived in the first place. That will also repeat after x amout of time. That has to be done after I am done with the archival part
2024-07-10 19233, 2024
bitmap[m]
not sure what you meant by "archive the existing ones" though, do you mean older edits?
2024-07-10 19202, 2024
yellowhatpro[m]
<bitmap[m]> "for importing existing edits..." <- yupp older ones. I thought you were referring to them when you said importing existing edits
2024-07-10 19203, 2024
bitmap[m]
yeah, I was, but I was under the impression that there was only one edit counter that is incremented; so it starts from the beginning, and doesn't process new edits until all previous ones have been processed
2024-07-10 19215, 2024
yellowhatpro[m]
I mean its configurable, we can either have it start from the beginning, or from the latest one as well
2024-07-10 19238, 2024
yellowhatpro[m]
I haven't really thought what should be the better thing to do. But later if we go with the trigger impl, we will have to start with the latest edits, which keeps on incrementing.
2024-07-10 19239, 2024
yellowhatpro[m]
But in any case, I will try to archive all the previous ones as well
2024-07-10 19209, 2024
yvanzo[m]
yellowhatpro: There is a feature in GitHub to mark your PRs as drafts if needed.
2024-07-10 19212, 2024
yellowhatpro[m]
ok, should I make the wip PR draft?
2024-07-10 19205, 2024
yvanzo[m]
It seems synonymous indeed :)
2024-07-10 19233, 2024
yvanzo[m]
* It seems to be synonymous indeed
2024-07-10 19233, 2024
yellowhatpro[m]
Okii made it a draft one
2024-07-10 19239, 2024
bitmap[m]
<yellowhatpro[m]> "I mean its configurable, we..." <- I assumed "if I don't have any URL to process in current poll (while making a request), should I devote that time to archive the existing ones" was within a single process -- i.e. how are you keeping track of which edits have been processed in that case
2024-07-10 19201, 2024
yellowhatpro[m]
Edits processed means when I am adding them to `internet_archive_urls` table right?
2024-07-10 19201, 2024
yellowhatpro[m]
I am just tracking the last edit in that case
2024-07-10 19248, 2024
yellowhatpro[m]
Sorry I get suuper confused sometimes
2024-07-10 19233, 2024
fletchto99 has quit
2024-07-10 19257, 2024
fletchto99 joined the channel
2024-07-10 19232, 2024
yvanzo[m]
No worries, it should become more clear once you have API requests in the loop.
2024-07-10 19217, 2024
bitmap[m]
maybe I misunderstood you :) I thought you were talking about prioritizing the processing of new (recent) URLs, and then processing old (existing) URLs only if there are no recent ones polled -- which would require separate counters
2024-07-10 19258, 2024
bitmap[m]
btw, if the service is stopped, where are edit_note_start_idx and edit_data_start_idx read from such that it can continue from where it left off?
Yeah, right. But internet_archive_urls is the only place for now where I can look for the data. Is there any other way where I can keep the latest edit data and edit note id?
2024-07-10 19203, 2024
yvanzo[m]
Probably a separate table last_processed_rows
2024-07-10 19208, 2024
bitmap[m]
you could introduce a new table to store them
2024-07-10 19228, 2024
yellowhatpro[m]
alright then, a new table coming right up ✅
2024-07-10 19223, 2024
bitmap[m]
that's why I was asking about prioritizing recent edits (so that they are archived right away such that the state of the page at the time the edit or note was entered is preserved) over older ones
2024-07-10 19236, 2024
yellowhatpro[m]
what do you refer to when you say the state of the page??
2024-07-10 19236, 2024
yellowhatpro[m]
The recent rows ?
2024-07-10 19204, 2024
bitmap[m]
the content of the page being archived
2024-07-10 19217, 2024
yvanzo[m]
bitmap: Prioritizing recent edits certainly is a longer term goal.
2024-07-10 19245, 2024
bitmap[m]
with the last_processed_rows table you could potentially keep separate counters for recent vs. historical edits later on
2024-07-10 19224, 2024
yellowhatpro[m]
bitmap[m]: oh nice, now I am able to process things
2024-07-10 19257, 2024
yvanzo[m]
It will have to be as flexible as possible, but if we just start with one row pointer per table, that would be a good start.
2024-07-10 19208, 2024
yellowhatpro[m]
cool, each row in last_processed_rows pointing to the latest processed rows of different tables (edit_data and edit_table currently)
2024-07-10 19209, 2024
yellowhatpro[m]
latest processed during polling, regardless it containg URL or not
2024-07-10 19233, 2024
yvanzo[m]
yellowhatpro: At first glance, what columns do you imagine for this new table?
2024-07-10 19257, 2024
yellowhatpro[m]
id, latest_row_processed, table_name
2024-07-10 19201, 2024
yellowhatpro[m]
as of now
2024-07-10 19233, 2024
yvanzo[m]
Yup, even though id is probably unneeded (or I’m missing the point).
2024-07-10 19200, 2024
yellowhatpro[m]
yeah it's not needed
2024-07-10 19210, 2024
yvanzo[m]
Or just use it to refer to the id in the other table?
2024-07-10 19235, 2024
pranav[m]
akshaaatt: I’ll try to get the stats page in soon before mid term evals
2024-07-10 19240, 2024
yellowhatpro[m]
yeah right
2024-07-10 19213, 2024
yvanzo[m]
You might also need a column column as not every table has a column id.
2024-07-10 19239, 2024
yvanzo[m]
(or id_column if it helps with clarity)
2024-07-10 19230, 2024
yellowhatpro[m]
id_column will refer to id in case of edit_note and edit in case of edit_data, right?
2024-07-10 19228, 2024
yvanzo[m]
That should work.
2024-07-10 19258, 2024
discordbrainz
<05rustynova> bitmap: "not really, we just sleep 1s between each event..." That's the easy part. Now deal with an async and parallel environment and it starts messing itself up in .23 femtoseconds. Either you do the clean way and use semaphores, holding permits until the next refresh window, or you do the ugly way and just hold a mutex until prev_request_start + 1. I had to do the later one for MB_RS as it doesn't have the http