Maybe give me another 10 minutes. I want to have some more observations.
mayhem[m]
ok, np
bitmap[m]
reosarevok[m]: I was hoping for a way to reproduce the wide char one by just browsing to a page, but I think the `EditExternalLinks` one suffices too, since that goes through Catalyst
reosarevok[m]
Yeah, I didn't look further since that one did hit it already :)
ericd[m]
<ericd[m]> "I'll check" <- ah not a bug. it's just that MB has that many releases to return :D
mayhem[m]
ericd[m]: Then maybe limit the number of items we return?
ericd[m]
mayhem[m]: yeah I will change it to some more reasonable amounts, or this may confuses users
mayhem[m]
if the list is truncated, maybe we can add a link to where they can see the rest on the web?
ericd[m]
s/amounts/amount/, s/confuses/confuse/
<mayhem[m]> "if the list is truncated..." <- make sense. i will add a link in the feed content.
I got messed in it for quite some time. Had written a lot of code for Error handling when I realised I am just over complicating stuff, so went with straightforward impl
lucifer[m]
<mayhem[m]> "what was this in ref to?" <- Some GitHub action failure on a troi PR
mayhem[m]
ah.
djl has quit
djl joined the channel
yvanzo[m]
yellowhatpro: Ok, so you’re on using `this_error` atm?
theflash[m] joined the channel
theflash[m] uploaded an image: (407KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/matrix.org/DcGBWAPCriZamWJOFhWqglJe/IMG_7483.PNG >
yellowhatpro[m]
I haven't yet, but as rustynova suggested, I will explore and use it
reosarevok[m]
"We will rule over this error, and we will call it... this_error"
theflash[m]
akshaaatt[m]: hey, I have implemented pagination in the feed, when i am using LazyVStack, the duplicate events are not being loaded at once
yellowhatpro[m]
I will be focusing on these 2 points in the current pr:
- api mocking
- dealing with rate limiting
bitmap[m]
you will add some tests for make_archival_network_request using the API mocking, yes?
yellowhatpro[m]
Yes will add tests for this
bitmap[m]
besides the lack of tests and RustyNova's suggestions I think it looks pretty good
but I'd make sure to remove your API key too
yvanzo[m]
It should be retrieved from a configuration file instead.
yellowhatpro[m]
aah yes. Will remove it soon.
Also,should we use some meb account for archiving?
yvanzo[m]
Maybe for deployment. Does the account matter for development?
yellowhatpro[m]
Nope, for dev I am using my own id and key
yvanzo[m]
Having a configuration file should probably be a priority as there are a number of hard-coded values in the code that would better fit that too.
yellowhatpro[m]
yvanzo[m]: Yeah I will add them in env fle itself, I thought to add in the final commits of the PR. Current credentials don't matter much
bitmap[m]
for MBS we wrote a small service that mocks the IA's S3 API, you could also write something similar here that mocks the /save endpoint (for development)
yellowhatpro[m]
bitmap[m]: Oh we are using IA's API in MBS, are we dealing with rate limiiting in that as well?
yvanzo[m]
yellowhatpro: `.env` might be too limited. A TOML file might be more appropriate. See the crate `config` for example.
yellowhatpro[m]
yvanzo[m]: ok will make it work soon
Okk, gonna explore this_error and config crates then
yvanzo[m]
No, we aren’t using the same API from MBS.
bitmap[m]
yellowhatpro[m]: not really, we just sleep 1s between each event (but each event may take 1-2s to process). if we hit the rate limit (which is rare), it's just retried later
yellowhatpro[m]
ohh alright.
bitmap[m]
but yes, it's a completely different API
yellowhatpro[m]
bitmap[m]: Right. I should try something similar.
Maybe I should just apply some math and since I am mostly polling, the time can be configured
bitmap[m]
for importing existing edits we may want something that takes better advantage of the rate limit though, since that process will take a while
yvanzo[m]
It is okay to start with a simplistic rate limiting indeed. It can be improved later on, once everything starts to be working together.
Ideally, it should be what bitmap mentioned: different threads or processes to handle polling and requesting.
yellowhatpro[m]
Ummm a doubt here
Ok nvm. I thought you meant we have to create multiple threds for requesting
yvanzo[m]: Yupp I am running polling and archiving in different threads
yvanzo[m]
Threading isn’t in the main goals, so at most just make a note about it for the stretch goals if you want to remind about it.
Great if you have some kind of threading already. :)
Yes, multiple threads for requesting might be a thing if we can be allowed a higher rate limit.
yellowhatpro[m]
<bitmap[m]> "for importing existing edits..." <- Regarding this, If we are rate limited, then should I focus on maximizing the requests.
For ex, if I don't have any URL to process in current poll (while making a request), should I devote that time to archive the existing ones?
yvanzo[m]: Ohh did you mean archiving x URLs parallely ?
yvanzo[m]
Yes (as stretch goals)
yellowhatpro[m]
Got it ✅
bitmap[m]
yellowhatpro[m]: > <@yellowhatpro:matrix.org> Regarding this, If we are rate limited, then should I focus on maximizing the requests.
> For ex, if I don't have any URL to process in current poll (while making a request), should I devote that time to archive the existing ones?
if there's still work to do you should maximize the requests you can do, ideally
but you can start with something simple as yvanzo said
yellowhatpro[m]
btw there has to be another task that will do cleanup/re-archival part of URLs that couldn't get archived in the first place. That will also repeat after x amout of time. That has to be done after I am done with the archival part
bitmap[m]
not sure what you meant by "archive the existing ones" though, do you mean older edits?
yellowhatpro[m]
<bitmap[m]> "for importing existing edits..." <- yupp older ones. I thought you were referring to them when you said importing existing edits
bitmap[m]
yeah, I was, but I was under the impression that there was only one edit counter that is incremented; so it starts from the beginning, and doesn't process new edits until all previous ones have been processed
yellowhatpro[m]
I mean its configurable, we can either have it start from the beginning, or from the latest one as well
I haven't really thought what should be the better thing to do. But later if we go with the trigger impl, we will have to start with the latest edits, which keeps on incrementing.
But in any case, I will try to archive all the previous ones as well
yvanzo[m]
yellowhatpro: There is a feature in GitHub to mark your PRs as drafts if needed.
yellowhatpro[m]
ok, should I make the wip PR draft?
yvanzo[m]
It seems synonymous indeed :)
* It seems to be synonymous indeed
yellowhatpro[m]
Okii made it a draft one
bitmap[m]
<yellowhatpro[m]> "I mean its configurable, we..." <- I assumed "if I don't have any URL to process in current poll (while making a request), should I devote that time to archive the existing ones" was within a single process -- i.e. how are you keeping track of which edits have been processed in that case
yellowhatpro[m]
Edits processed means when I am adding them to `internet_archive_urls` table right?
I am just tracking the last edit in that case
Sorry I get suuper confused sometimes
fletchto99 has quit
fletchto99 joined the channel
yvanzo[m]
No worries, it should become more clear once you have API requests in the loop.
bitmap[m]
maybe I misunderstood you :) I thought you were talking about prioritizing the processing of new (recent) URLs, and then processing old (existing) URLs only if there are no recent ones polled -- which would require separate counters
btw, if the service is stopped, where are edit_note_start_idx and edit_data_start_idx read from such that it can continue from where it left off?
Yeah, right. But internet_archive_urls is the only place for now where I can look for the data. Is there any other way where I can keep the latest edit data and edit note id?
yvanzo[m]
Probably a separate table last_processed_rows
bitmap[m]
you could introduce a new table to store them
yellowhatpro[m]
alright then, a new table coming right up ✅
bitmap[m]
that's why I was asking about prioritizing recent edits (so that they are archived right away such that the state of the page at the time the edit or note was entered is preserved) over older ones
yellowhatpro[m]
what do you refer to when you say the state of the page??
The recent rows ?
bitmap[m]
the content of the page being archived
yvanzo[m]
bitmap: Prioritizing recent edits certainly is a longer term goal.
bitmap[m]
with the last_processed_rows table you could potentially keep separate counters for recent vs. historical edits later on
yellowhatpro[m]
bitmap[m]: oh nice, now I am able to process things
yvanzo[m]
It will have to be as flexible as possible, but if we just start with one row pointer per table, that would be a good start.
yellowhatpro[m]
cool, each row in last_processed_rows pointing to the latest processed rows of different tables (edit_data and edit_table currently)
latest processed during polling, regardless it containg URL or not
yvanzo[m]
yellowhatpro: At first glance, what columns do you imagine for this new table?
yellowhatpro[m]
id, latest_row_processed, table_name
as of now
yvanzo[m]
Yup, even though id is probably unneeded (or I’m missing the point).
yellowhatpro[m]
yeah it's not needed
yvanzo[m]
Or just use it to refer to the id in the other table?
pranav[m]
akshaaatt: I’ll try to get the stats page in soon before mid term evals
yellowhatpro[m]
yeah right
yvanzo[m]
You might also need a column column as not every table has a column id.
(or id_column if it helps with clarity)
yellowhatpro[m]
id_column will refer to id in case of edit_note and edit in case of edit_data, right?
yvanzo[m]
That should work.
discordbrainz
<05rustynova> bitmap: "not really, we just sleep 1s between each event..." That's the easy part. Now deal with an async and parallel environment and it starts messing itself up in .23 femtoseconds. Either you do the clean way and use semaphores, holding permits until the next refresh window, or you do the ugly way and just hold a mutex until prev_request_start + 1. I had to do the later one for MB_RS as it doesn't have the http