#metabrainz

/

      • julian45[m]
        if k8s ends up being too far to one side on the complexity scale, nomad (by hashicorp) has been brought up occasionally in here before and could be worth investigating
      • and finally, one last thing before i step off my soapbox and go focus on things i should be doing: this may be based on a naive understanding of how ansible is currently used within MeB, but it may be beneficial to us to use a tool like AWX^ to help centralize ansible usage & observability related to it. having one place from which to run roles/playbooks would make it easier to see which are currently being used/enforced for any
      • given target server, as well as a place to trigger runs of playbooks, keep track of execution history, and provide a single consistent execution environment for ansible jobs. https://github.com/ansible/awx
      • ^ OSS upstream to red hat's ansible automation platform product (formerly known as ansible tower)
      • minimal has quit
      • leonardo- joined the channel
      • leonardo has quit
      • lucifer[m]
        mayhem: julian45 yup i have been thinking of upgrading timescale PG db to postgres 16 for parity with MB, if we want to do a server upgrade best to club both together.
      • a simple master/standby for timescale would be great or atleast adding barman backups to it. ideally without adding the complexity of new tools.
      • bitmap: done
      • lusciouslover has quit
      • lusciouslover joined the channel
      • pite has quit
      • function1_ joined the channel
      • function1 has quit
      • leonardo joined the channel
      • leonardo- has quit
      • lusciouslover has quit
      • lusciouslover joined the channel
      • allen joined the channel
      • Kladky joined the channel
      • the4oo4- has quit
      • BrainzGit
        [listenbrainz-server] 14amCap1712 opened pull request #3194 (03master…this-stats-to-date): Update this_(week/month/year) time ranges https://github.com/metabrainz/listenbrainz-serv...
      • [listenbrainz-server] 14amCap1712 merged pull request #3194 (03master…this-stats-to-date): Update this_(week/month/year) time ranges https://github.com/metabrainz/listenbrainz-serv...
      • pite joined the channel
      • mayhem[m]
        some MB data graph porn for your friday:
      • mayhem[m] uploaded an image: (60KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/URkmbpvGcvpiVNIFSIALufxk/image.png >
      • recording name lengths in MB.
      • monkey[m]
        What's that spike at 80 chars ?!
      • mayhem[m]
        everything longer than 80 characters
      • monkey[m]
        Ah, I see
      • monkey[m] wonders if it follows Zipf's law
      • mayhem[m]
        not sure, bit it fits zeppelin's law...
      • s/bit/but/
      • monkey[m]
        Is it.... a stairway?
      • mayhem[m] uploaded an image: (810KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/ixawiNCOWJZwOPWBBqwXMyYN/image.png >
      • lucifer[m]
        mayhem: can you please also review LB#3193
      • BrainzBot
        Implement listen deletion in the spark cluster: https://github.com/metabrainz/listenbrainz-serv...
      • mayhem[m] uploaded an image: (22KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/siWDMtNVMpzDjZcQzhVIPWKJ/image.png >
      • mayhem[m]
        same, but for artists with a max of 40 chars.
      • lucifer: do you happen to be about?
      • lucifer[m]
        yu
      • *yup
      • mayhem[m]
        kewl, want to talk through something.
      • so, I got the nmslib indexes to persist to disk -- which is a great!
      • someone implemented that feature in scikit learn and it works fine.
      • so I made an artists index that also loads all the tracks for each artist into ram.
      • and then persists the whole thing to disk.
      • lucifer[m]
        makes sense
      • mayhem[m]
        3.5G on disk, which isn't bad.
      • it takes 30 seconds to load from disk and all recording indexes are built at search time on demand.
      • so the first search might take 40ms, but a subsequent search takes about 4-10ms.
      • reosarevok[m]
        aerozol: another thing where your ideas would be appreciated: MBS-13945
      • BrainzBot
        MBS-13945: Include release link for edits made in the release relationship editor https://tickets.metabrainz.org/browse/MBS-13945
      • mayhem[m]
        which is all great, really.
      • lucifer[m]
        yup sounds good.
      • mayhem[m]
        but, the index only indexes the first x characters of each string. (which is why I was looking at the graphs)
      • lucifer[m]
        3.5G is with how many characters?
      • mayhem[m]
        I am now debating if I should store all the excess chars of each string in the index, or take the results and fetch them from PG.
      • 30 characters, currently. but most of the index data is literally all the strings and mbids residing in ram.
      • nmslib only has a single int that can be stored in the index.
      • lucifer[m]
        i think doing a subsequent PG query makes sense to me.
      • mayhem[m]
        so, secondary data needs to be stored in the index (which makes indexes huge and slow to load) OR we simply make the index as light as possible and then ask PG for the results.
      • lucifer[m]
        the biggest user of this index would be the mapper i think and that only needs a recording mbid.
      • mayhem[m]
        lucifer[m]: it does, but not if we use PG twice. right now we check for exact matches using PG and if nothing is found, we go to typsense.
      • lucifer[m]
        you could make an index in PG on the int and recording_mbid.
      • mayhem[m]
        we could also add recording_id to the canonical data.
      • lucifer[m]
        and it would never query the table for that query.
      • sure.
      • mayhem[m]
        because this index is built off canonical data, not all MB data.
      • lucifer[m]
        sounds good to me.
      • mayhem[m]
        but it really comes down to one of two ways:
      • 1) Store everything in the index and have indexes be slow to load.
      • 2) store nothing in index and fetch everything from PG.
      • and I am leaning towards 2.
      • lucifer[m]
        yup same.
      • mayhem[m]
        cool.
      • lucifer[m]
        for mbid mapper you can even skip the pg query.
      • mayhem[m]
        now comes the question: how the fuck do we host this?
      • lucifer[m]
        just write the recording id to the table and let the recording mbid be resolved later.
      • mayhem[m]
        lucifer: maybe. if the query strings are shorter than the max chars, possibly.
      • lucifer[m]
        either way indexes would optimize it.
      • mayhem[m]
        but hosting, is a pita.
      • lucifer[m]
        mayhem[m]: how much ram is consumed when all recording indexes have been built?
      • mayhem[m]
        ideally it would be a single-process, multi-threaded app.
      • lucifer[m]: dont know and I realize that this is a bad pattern. we will never want all of MB resident in the index. that's nonsense. LB users are going to be listening to a subset of all the data, so we should only keep the active things in the index.
      • lets leave this for one sec. we'll get to it.
      • lucifer[m]
        so a lru cache?
      • mayhem[m]
        yes!
      • but we can;t have a purely threaded app until the GIL is reliably gone.
      • lucifer[m]
        i think we could host it on a vm or a separate server with enough ram.
      • mayhem[m]
        it needs to be multi-process, mutli-thread.
      • yes!
      • I am thinking of dividing the dataset into chunks.
      • lucifer[m]
        sharding based on names and hosting multiple instances of the app?
      • mayhem[m]
        on the first level, we'll decide to break the data into P chunks where P is the number of desired processes.
      • say strings that start with A-E in process #1, F-H in #2 and so on.
      • when a request comes in, it gets resolved to a backend process to handle.
      • the process carries it out and returns the results to the dispatcher.
      • but each process' data is further broken into MANY smaller chunks.
      • and the many smaller chunks are not all loaded at load time -- everything is lazy loaded.
      • but there is a flat file that has all the built indexes serialized to disk.
      • lucifer[m]
        have you tried multithreaded querying?
      • mayhem[m]
        and the index we load into ram, says: this index chunk for this query is in file X, offset O, length L.
      • fetch the index, unpickle, query.
      • and then we set an upper memory consumption limit. if the proc get to that limit, it dumps LRU indexes.
      • lucifer[m]
        like host the index behind a uwsgi flask app and make multiple concurrent requests and see if it works. it might handle itself automatically for you.
      • mayhem[m]
        lucifer[m]: the GIL will be in your way.
      • you won't get much beyond 100% CPU use
      • lucifer[m]
      • according to this the actual search code doesn't hold the GIL only the glue code does so its possible it might work.
      • so worth a try if you already haven't.
      • mayhem[m]
        you're suggested a single flask app, single process and see what happens?
      • lucifer[m]
        actually even simpler. just a python process with a threadpoolexecutor to query the indexer.
      • test say 1k-10k items. on a single thread and two threads. and compare overall time to execute.
      • mayhem[m]
        well, a flask app is our end goal, so lets express it in terms of that. :)
      • I suppose I can stand up a simple flask end point and try it.
      • lucifer[m]
        sure a flask app but the dev mode is limited to one thread so with uwsgi workers and enable-threads to use threads instead of processes.
      • mayhem[m]
        yep, fer sure.
      • let me do that
      • lucifer[m]
        i think a threadpoolexecutor is a simpler and more accurate test though. and easier to debug too if needed.
      • mayhem[m]
        I detest threadpoolexecutor, I have to say. its always mental gymastics to get it what I need it to do.
      • and I am not sure if accurate is the correct term. our goal is to run under uwsgi, when not test there? testing in an artificial setup may not reflect reality.
      • the4oo4 joined the channel
      • yvanzo[m]
        julian45: About the SSO app in Jira provided by miniOrange, a faithful partner of Atlassian, the app is well maintained and closely follows new versions of Jira including 10.x. Happy to replace it with Jira 10.x native SSO feature if can be, but we can just keep using this app otherwise.
      • bitmap: Not sure. I would have guessed that it came from some Ansible repository but I couldn’t find anything. However, it matches the user `brainz` in the sshd container for fullexport.
      • monkey[m]
        ansh: I've been tweaking the mobile UI PR, and I think it's in a good state to get some feedback.
      • I'd like yours and aerozol's first if you have any, then perhaps we can deploy it to beta for a little while to get feedback from the community?
      • Currently deployed to test.LB , BTW
      • MyNetAz has quit
      • MyNetAz joined the channel
      • BrainzGit
        [musicbrainz-server] 14reosarevok opened pull request #3484 (03master…MBS-13771): MBS-13771: Filter edit search by RG primary type https://github.com/metabrainz/musicbrainz-serve...
      • bitmap[m]
        <yvanzo[m]> "bitmap: Not sure. I would have..." <- ah right, the brainz user is also in sshd-musicbrainz-json-dumps-incremental (which is based on the same sshd image). thanks! I'll just add a comment then
      • suvid[m] joined the channel
      • suvid[m]
        I was planning on working on this ticket:... (full message at <https://matrix.chatbrainz.org/_matrix/media/v3/...>)
      • julian45[m]
        <yvanzo[m]> "julian45: About the SSO app in..." <- Got it, thanks! I was not aware of miniOrange's relationship with Atlassian, so this is good context for me to have.
      • monkey[m]
        <suvid[m]> "I was planning on working on..." <- suvid: For this, you would want a toggle to turn of BrainzPlayer entirely. We already hava apage for BP settings at https://listenbrainz.org/settings/brainzplayer/
      • Then there will be some conditional rendering in a few places depending on the activation state of BP (for example hiding the play icon buttons on all the listencards, not rendering or loading the BrainzPlayer component, etc.)
      • mthax has quit
      • mthax joined the channel
      • mthax has quit
      • mthax joined the channel
      • mthax has quit
      • mthax joined the channel
      • jasje[m] joined the channel
      • jasje[m]
        Note for concerned: I wont be available till the end of this month (28th feb). Available for imp stuff only :)
      • * be available (travelling) till the
      • aerozol[m]
        Love the stats mayhem! I'll share them this weekend
      • reosarevok: re MBS-13945, I saw that the other day and it seemed relatively straightforward? The wording could use some work (I'm not sure "source MBID" will be universally understood) but I don't know if there's a downside. Whether it's done in the UI or via modbot I assume is a technical question. Let me know if you want me to comment on the ticket re. anything in particular
      • BrainzBot
        MBS-13945: Include release link for edits made in the release relationship editor https://tickets.metabrainz.org/browse/MBS-13945