#metabrainz

/

      • strider has quit
      • strider joined the channel
      • Shubh joined the channel
      • strider has quit
      • strider joined the channel
      • MRiddickW joined the channel
      • lucifer
        alastairp: the timeout thing might be doable not completely sure though. so like store the latest message in ts writer, if it times out and is redelivered do something about it? or do you mean if rabbitmq could automatically do this for us?
      • chinmay joined the channel
      • skelly37 joined the channel
      • chinmay has quit
      • SothoTalKer_ joined the channel
      • tykling_ joined the channel
      • pprkut_ joined the channel
      • loujine_ joined the channel
      • pprkut has quit
      • loujine has quit
      • SothoTalKer has quit
      • tykling has quit
      • rdswift has quit
      • SothoTalKer_ is now known as SothoTalKer
      • rdswift joined the channel
      • Shubh has quit
      • rdswift is now known as Guest7859
      • bitmap has quit
      • chinmay joined the channel
      • bitmap joined the channel
      • everdred has quit
      • Etua joined the channel
      • everdred joined the channel
      • Etua has quit
      • trolley has quit
      • trolley joined the channel
      • BrainzGit
        [guidelines] 14zas opened pull request #19 (03master…zas-patch-1): Typo fix Reccomended -> Recommended https://github.com/metabrainz/guidelines/pull/19
      • Shubh joined the channel
      • [guidelines] 14reosarevok merged pull request #19 (03master…zas-patch-1): Typo fix Reccomended -> Recommended https://github.com/metabrainz/guidelines/pull/19
      • mayhem
        moooin!
      • lucifer
        morning
      • mayhem
        reosarevok: are you back today?
      • reosarevok
        Yes, in an hour or two max
      • mayhem
        ok, I'm forwarding a schema change question to support@ then
      • reosarevok
        Works for me
      • mayhem
        atj: akshaaatt yvanzo bitmap alastairp monkey outsidecontext and anyone with a @meb email address. If you'd like a metabrainz dropbox account with loads of storage, go ahead and use your @meb email address to sign up for the dropbox account and you'll be automatically added to the team.
      • akshaaatt
        Sounds cool mayhem!
      • mayhem
        yeah, go for it. our friend at dropbox gave us a free business account.
      • alastairp
        thanks mayhem
      • mayhem
        np!
      • alastairp
        lucifer: hmm, adding some extra checks via tswriter sounds like trouble and complexity
      • I was just thinking out loud that it'd be great if rmq had a "you're _about_ to be disconnected" callback, rather than just "you've been disconnected"
      • any thoughts on the comment that I added on LB#2009?
      • BrainzBot
        Allow a maximim number of listens per import: https://github.com/metabrainz/listenbrainz-serv...
      • lucifer
        ah i see. i don't think that thing exists.
      • alastairp
        the documentation about max listen payload and the check that we do is different
      • so we should update one or the other
      • any thoughts on how we can select the largest listen payload from the database? can you do strlen on a json column in pg?
      • mayhem
        alastairp: slowly, yes.
      • atj
        so is my understanding correct, in that the issue last night was someone submitting a request with 70k listens which effectively DoS'd the application?
      • alastairp
        atj: more or less, yes
      • mayhem
      • alastairp
        there is a timing issue where rmq expects an acknolwledgement of the message within [timespan], and it took us longer than that to process the listens
      • mayhem
        `select id, pg_column_size(datab) from data.items;`
      • atj
        and the consumer was taking too long to process the request, so it timed out?
      • right
      • mayhem
        yes.
      • alastairp
        current fix in my PR ^ is to limit the number of items in a single payload
      • but I'm thinking out loud about how we could preempt this issue if we get into it again
      • atj
        yes, it sounds like a good idea to have some sort of limit :)
      • how big was the actual request?
      • I guess you can fit 70k listens in a few MB?
      • alastairp
        ahh, yesterday I could have told you that because I had the size in the rmq admin pane when I popped the message off of the queue
      • but I don't think we have that info any more
      • our "sample" listen is 700 bytes (https://listenbrainz.readthedocs.io/en/latest/u...)
      • we have a 10kb per listen upper limit, but our check is that the average size of all listens in the payload is less than 10kb, not the full message
      • lucifer
        re size of data, yeah pg_column_size may work, probably use it to find the largest listen and then do actual length on it because pg might be compressing jsonb and the size might be less than actual then.
      • regarding the check, i am not sure why it was added. is it there to prevent overwhelming something api side or db side?
      • alastairp
        or maybe we can estimate
      • atj
        seems a weird check
      • "the average size of all listens in the payload is less than 10kb"
      • alastairp
        I think this may have come about because of our changes in technologies
      • lucifer
        if the former then a per document size makes sense but if the latter then a per listen size limit is sensible.
      • alastairp
        that check may have been added before we started using rabbitmq (and so before we could have multiple listens per message)
      • lucifer
        we still have multiple listens per message
      • alastairp
        right, but did we beforehand?
      • lucifer
        ah sorry, i misread your message
      • alastairp
      • I guess it's just a simpler way of doing the check
      • of "each listen < 10k in size"
      • because otherwise you'd need to convert body json -> py dict, then iterate through it, then for each item convert back to json to check its size in bytes
      • lucifer
        checks seems to have been added in this commit. https://github.com/metabrainz/listenbrainz-serv...
      • so multiple listens existed that time too.probably that way for easier checking indeed
      • alastairp: `select *, pg_column_size(data) AS size from listen ORDER BY size DESC LIMIT 5;` to find largest listen. then `select length(data::text) from listen where listened_at = 1651504402 AND user_id = 8741;` for its length. length is 9331.
      • alastairp
        commit message by mayhem: "That is probably enough sanity checking for this minute."
      • I mean, at the time it probably was
      • lucifer: yes, multiple listens in an API payload existed. but I was talking about what happens after the API endpoint
      • I think that at that time we may have split the payload up into messages of 1 listen each
      • lucifer
        i see makes sense,
      • yup it splits the payload into separate listens
      • MRiddickW has quit
      • atj
        seems like it might be a good idea to check the payload size before parsing it
      • alastairp
        atj: hmm, right. so we could say "per-listen 10k, max 1000 listens -> max message payload 10mb" and reject early if the body is over that
      • then reject again after parsing if there are >1000 listens (e.g. maybe the user submitted 10k small listens which were under the 10mb limit)
      • atj
        exactly
      • you could open yourself up to DoS if someone decides to send lots of large payloads
      • I doubt "ujson.loads()" is a particularly cheap call, CPU or memory wise
      • alastairp
        it's certainly cheaper than json.loads()!
      • atj
        that's probably a low bar though isn't it? :)
      • alastairp
        but you're right - fail if necessary before doing the json parsing sounds like a good idea too
      • thanks
      • atj
        fail early and all that
      • isn't that just null?
      • ah, I see further down that Postgres doesn't like it and you have a performance check for it
      • alastairp
        I think you're right that it is just "null" check
      • I guess perhaps it's "null in a string" rather than "null anywhere else in the payload"
      • lucifer: did we ever check the performance of that span in sentry?
      • lucifer
        yup, it was miniscule impact.
      • for instance
      • alastairp
        cool
      • 100x faster than round-trip to redis
      • maybe we can remove the sentry span, then?
      • lucifer
        yes sounds good
      • alastairp
        that being said, great feature if we want to benchmark other stuff
      • lucifer: I'm looking into this requirements.txt -> setup.py install_requires change for BU
      • I think the least effort solution that'll work well is just to rewrite any lines that are git repos, right?
      • or we can try pyproject.toml :/
      • jesus2099 joined the channel
      • looks like this is still all a bit of a mess
      • lucifer
        alastairp: thats what the SO answer from yesterfay did?
      • alastairp
        hi jesus2099!
      • lucifer: yes, the answer from yesterday suggested a new format for source dependencies (PEP 508 I understand), but that's still different from what requirements.txt needs
      • lucifer
        i see, makes sense to rewrite then
      • alastairp
        so... we either need to manage these lists manually, convert automatically, or switch to poetry/pyproject.toml for local development so that we can reuse the same file for local dev and remote installs
      • agreed that automatically converting when inserting into setup.py is probably the easiest idea for now - and it's only temporary until new mbdata is released, I guess
      • lucifer
        if poetry/pyproject.toml handle this better then makes sense to migrate to those in future.
      • CatQuest
        ..
      • lucifer
        indeed, the current situation should be temporary
      • CatQuest
        no. i refuse. you can't call a foss project "poetry" that's just .. how even
      • sigh
      • alastairp
        yeah, though I'm just worried that we'll move to a new dependency tool just as a new "better" one gets released
      • lol
      • skelly37
        outsidecontext, zas: FIFO seems to work for unix, you might want to take a look and review the bare protocol and ideas behind it: https://github.com/skelly37/pipethon. In the evening I'll take care of doing the same but with Windows API and then it's ready to be implemented in Picard.
      • alastairp
        skelly37: great work getting this far so soon!
      • of course, as they say the first 90% is easy, it's the second 90% that takes most of the time :)
      • atj
        pipes work on windows?
      • skelly37
        alastairp: thanks :)
      • alastairp
        atj: "but with windows API" - I guess not
      • ansh: by the way, I didn't confirm with you the other day, but as mayhem pointed out starting early is fine. let me know when you want to discuss this. I was chatting with monkey and we think that it makes sense that I work with you directly when you're working on CB parts, and he works with you when doing BB parts. perhaps the 3 of us could get together in the next week or so and go over your plan again
      • skelly37
        atj: It requires pywin32 module, os.mkfifo() is unfortunately unix only.