#musicbrainz-devel

/

      • ianmcorvidae
        ianmcorvidae has changed the topic to: Self freezing week https://youtu.be/5T68TvdoSbI | http://musicbrainz.org/#devel | Agenda: Allowing murdos' bot to do more WD link edits (Freso), blog audience/dev blogging (ian)
      • JonnyJD joined the channel
      • ldmosquera joined the channel
      • ldmosquera
        hello all; question about the Virtualbox VM
      • I've downloaded the latest version from 2013-10-14, and followed the instructions to set it up
      • I'm using KVM instead of Virtualbox, but everything's good
      • derwin
        I think that won't actually work.
      • because there's been a schema change since then, and upgrading is apparently not feasible?
      • ianmcorvidae
        no, that's the post-schema-change VM
      • see the date :P
      • derwin
        oh, didn't realize that the schema change was so long ago.
      • ldmosquera
        in any case my problem is with the reindex script
      • it always crashes either because of SEGFAULTs or different Java exceptions in different places
      • always during tmp_track
      • anyway came across this?
      • derwin
        what exception? OOM?
      • ianmcorvidae
        are you running replication while reindexing? I believe it's not intended to work while replication is on
      • ldmosquera
        didn't run any replication commands, is it on by default?
      • ianmcorvidae
        it shouldn't be, no
      • ldmosquera
        example: Exception in thread "main" java.lang.IncompatibleClassChangeError at org.apache.lucene.document.Document.add(Document.java:64)
      • another was a NullPointerException in another place in the code
      • ianmcorvidae
        strange
      • ldmosquera
        it would seem random
      • ianmcorvidae
        I wonder if there's something with java versions going on?
      • ijabz is the search server dev and ruaok is the one who has had the biggest role in setting up the VM images
      • neither of them seem to be around at present
      • ldmosquera
        I guess that would affect anyone with the same VM version
      • btw I'm using 6GB RAM so memory is not a problem
      • alright, I'll look out for them
      • also, another question
      • I'm seeing practically the same level of performance using a desktop harddisk and an SSD
      • always inside KVM using a raw LVM partition
      • ianmcorvidae
        performance for what exactly? website, webservice, search?
      • ldmosquera
        the indexing, sorry
      • ianmcorvidae
        hm
      • ldmosquera
        I guess Postgres is the bottleneck
      • ianmcorvidae
        not sure what the bottlenecks there are, but yeah, that'd be my guess
      • it could be memory, I suppose, with postgres
      • ldmosquera
        going from 2GB to 6GB for the VM made a bit of different but not as big as I'd expect
      • *difference
      • ianmcorvidae
        search server indexing builds a variety of temporary tables, and the automatic tuning only accounts for memory, not anything like SSD tuning
      • with an SSD you want it to much less sharply penalize random seeks and disk read/write operations, as you'd imagine
      • ldmosquera
        I looked around for Postgres tuning tips for SSD, but couldn't find much
      • ianmcorvidae
        I don't remember the exact parameters for that though
      • ldmosquera
        I'm using the deadline IO scheduler instead of the default CFQ
      • ianmcorvidae
        I suspect this is higher, in the postgres query planner
      • e.g. for an SSD you'd want it to consider materializing a temporary table much more often than you would with a spinning disk
      • looks like it's seq_page_cost and random_page_cost
      • default is for seq_page_cost to be 1.0 and random_page_cost to be 4.0, with an SSD you might be able to squeeze some performance by kicking random_page_cost down some
      • ldmosquera
        I tried one of those, forgot which, but didn't see much change either; I'll read up more though
      • random_page_cost I believe
      • ianmcorvidae
        I don't necessarily know how much benefit you'll get from that though
      • ldmosquera
        thanks a lot! I'll do some tests
      • ianmcorvidae
        probably if you want to get more performance you'd need to ensure you know what the real bottleneck is and see what query plans it's getting
      • ldmosquera
        I'll just try some general tuning first
      • with a random_page_cost of 1.1 instead of the default 4.0, tmp_track took 144secs instead of 156secs
      • underwhelmed :P
      • ianmcorvidae
        heh
      • yeah, I don't really know -- it's possible the bottleneck is elsewhere, too
      • I haven't really played with any of this stuff running on SSDs, so :)
      • derwin
        but cmon, you have that 501c(3) cheese
      • make it rain SSDs
      • ianmcorvidae
        heh
      • ldmosquera
        maybe (likely) the bottleneck is KVM / virtio
      • I'll try with different cache modes
      • ianmcorvidae
        I know that SSDs were considered for the new DB server we bought in 2011, but ultimately it was decided against, I think because the world of the internet wasn't quite sure how much benefit SSDs would bring
      • ldmosquera
        what hardware is MB running on nowadays?
      • ianmcorvidae
        we have a half-rack of servers doing various things; one DB server, hot-spare DB server, three machines running the website/webservice code, two machines running search servers, one machine building search indexes, frontend/gateway machines, and a variety of smaller things (e.g. our Xen host, we have VMs for the wiki, forums, and some other things)
      • derwin
        you could read the 2012 blog post..
      • ianmcorvidae
      • ldmosquera
        nice! thanks
      • ianmcorvidae
        we've moved stimpy, dexter, and tails out of the rack and have a couple of new ones, at least one of which isn't just shut down as an ostensible future spare
      • heh, and hobbes is, I believe, currently sitting in our colo's fridge until we have a chance to get over there and open it up to replace some failing disks
      • ldmosquera
        the traffic graph is brutal
      • ianmcorvidae
        I think that graph hasn't been adjusted for ratelimited traffic, not sure
      • ldmosquera
        what happened in mid-2011? Maybe a new client software release?
      • ianmcorvidae
        headphones happened
      • ldmosquera
        figures :P that's exactly how I got here
      • ianmcorvidae
        which is a piece of software that our API is spectacularly bad for
      • so we ratelimit it really severely, which presumably is why you're setting up your own server :)
      • ldmosquera
        I recently discovered headphones, then beets through it, then I decided I needed my own MB mirror
      • ianmcorvidae
        http://stats.musicbrainz.org/mrtg/drraw/drraw.c... -- we refuse 2/3 of requests that come to us, from headphones
      • well, a bit less than that, but we still accept fewer than we let through :/
      • derwin
        oh ldmosquera I spoke with you last week!
      • ianmcorvidae
        http://stats.musicbrainz.org/mrtg/drraw/drraw.c... for pre-2012 traffic from headphones (by and large)
      • ldmosquera
        here? First time I hop in here
      • derwin
        in #musicbrainz..
      • or #beets :)
      • ldmosquera
        are you sure it was me? I haven't registered this nick, maybe someone else named like this (incredibly unlikely)
      • derwin
        guess it must have been someone with a similar path to musicbrainz
      • ianmcorvidae
        it's a pretty common one lately
      • especially for people setting up servers
      • ldmosquera
        I absolutely love MB; I built some scripts to "curate" my collections a few years ago, but it was a heap of manual work
      • the scripts inferred stuff and made suggestions, but I had to review everything
      • now I found beets and it manages to do 90% of it without input
      • derwin
        yeah, #beets exists btw, and is active, in case you need help :)
      • ldmosquera
        not so far; it's gloriously well made and I had no suprises
      • ianmcorvidae: how do you mean MB's API is bad for Headphones?
      • ianmcorvidae
        headphones tends to have to make a lot of requests in order to get the information it wants
      • ldmosquera
        one per track or something like that?
      • ianmcorvidae
        we don't really have much in the way of tools for synchronizing changes, as it were -- most of our API requires you to specify one entity at a time, and polling is really the only way to watch for changes to the data
      • ldmosquera
        oh I see
      • ianmcorvidae
        headphones has done some decent work improving that -- for example by using complicated hacks with things like search queries to get around the one-entity limits
      • but it's just still really not good for that
      • the MB API grew up around taggers, and that means that sometimes it's not good for things that don't match that pattern of usage
      • (with a tagger, it makes a lot of sense: you request one release at a time, and updating to account for changes is largely manual, not automated)
      • theoretically headphones could even do something semi-crazy like use replication packets, but that wouldn't help with the bits of headphones that are passing to beets and thus require a copy of the MB API
      • so the usual way of doing things seems to have become "set up a mirror"
      • it's at least gotten us to be better about releasing updated VMs :)
      • ldmosquera
        nice job downscaling everything into a single VM!
      • so basically Headphones operates on the entire collection instead of file by file like a tagger, and so ends up doing many requests for each file, right?
      • and thus would benefit from some kind of batch-mode API
      • ianmcorvidae
        well, a batch-mode API would mean that it could make fewer requests as a matter of polling
      • what would really help is if we had an effective way to push out changes
      • i.e., so headphones can watch something and then only make requests for things that have actually changed, rather than polling to see if there are changes
      • ldmosquera
        got it
      • ianmcorvidae
        we have a partially-done experiment in that, but it has a lot of weaknesses and we're perpetually short on resources to work on things, so
      • ldmosquera
        maybe if you could specify "releases newer than date XXX"
      • ianmcorvidae
        what would be fantastic for headphones is if it could just make a request every so often saying "hey, I care about these artist MBIDs, which ones have new releases/release groups?", get back a list of MBIDs, and then request only those
      • (where "new" would be defined in terms of some date, like you say)
      • our API also allows a lot of different representations/granularities to the data, though
      • which makes it hard; such a changed-entities thing would either have to assume that everyone only cares about one particular one of those resolutions/representations, or it needs a way to specify exactly what things a given client cares about (and then it needs to keep track of more data so it can accurately respond to those requests)
      • ldmosquera
        what are the resolutions, for example?
      • ianmcorvidae
        so if you look at http://wiki.musicbrainz.org/XML_Web_Service/Ver... and the followingthree sections, those are the various so-called 'inc parameters'
      • which specify which pieces of data to include
      • ldmosquera
        got it
      • ianmcorvidae
        and in some cases combining two inc parameters is not just a matter of merging the two, since sometimes one inc parameter will also affect the data returned by another
      • (especially those listed in "inc= arguments which affect subqueries", but)
      • (e.g. for a release, inc=artist-credits will include the release artist credit, inc=recordings will include the tracks on the release, but inc=artist-credits+recordings will include the release artist credit, the tracks, and all of the tracks' artist credits
      • )
      • we don't have very good internal caching/tracking of changes to data returned by the WS, too -- for HTTP caching stuff we can basically never avoid doing all the work, database-wise, before knowing if the response has changed
      • which, again, partly-finished experiments exist, but :)
      • ldmosquera
        so the problem would be to make it generic so it could satisfy any client without assuming things like what Headphones needs
      • ianmcorvidae
        yeah
      • also getting headphones to use it, which can sometimes be a struggle, but if it were well-made enough I guess we'd hope the benefits were self-evident :)
      • ldmosquera
        if Headphones is overwhelmingly more active than other clients, then maybe it'd pay to make just this API endpoint for it
      • then other clients would probably catch on
      • I see :)
      • I also use muspy, which I believe uses MusicBrainz too
      • how does it fare with the API?
      • ianmcorvidae
        muspy does it a bit better, because it essentially functions as an aggregator
      • ldmosquera
        right, centralized
      • ianmcorvidae
        if 3000 people all follow the same artist on muspy it still only has to make one request to us per day
      • yeah
      • muspy is something that it wouldn't be unreasonable for us to copy, in a rough sense, for the sort of changed-data API/feed I was talking about
      • ldmosquera
        gotta run for few hours, but I'll be back
      • ianmcorvidae
        cool, nice talking to you
      • hopefully you get your issues sorted
      • ldmosquera
        hopefully I can make myself useful :) I'm a developer and sysadmin