#metabrainz

/

      • alastairp
        see if you can understand the build code for essentia - for example, we split the code into a number of packages - a core essentia package, a package for docs, one for python bindings, and one for example files
      • I suggest we sign up for a PPA account on launchpad: https://launchpad.net/ubuntu/+ppas
      • kartikgupta0909
        and use flags for building each of them
      • yes we will have to do that
      • alastairp
        this way ubuntu will host the gaia packages for us. to install it we only need to add the ppa and apt-get install it
      • OK, cool. I think that's a good first milestone
      • do you want to keep talking about the next steps for building the dataset?
      • kartikgupta0909
        yes
      • a fair idea before start of the project would help me plan it well
      • alastairp
        right
      • kartikgupta0909
        so is the GUI mockup fine?
      • is this how you want to give the option to the user?
      • zas
        alastairp: once you have working shell scripts, bitmap, Gentlecat, and i can help you to write matching recipes for chef. To start with, Chef is able to run shell scripts ;) Though Chef does much better when it comes to ensure all is fine, and to handle updates.
      • alastairp
        cool
      • Gentlecat
        one thing I noticed with chef is that it makes sure that the system is in the right state
      • alastairp
        kartikgupta0909: I've added a new component in jira for this project: http://tickets.musicbrainz.org/browse/AB/compon...
      • Gentlecat
        this can be hard to do with just shell scripts
      • zas
        Yes that's the main point
      • Gentlecat
        I guess that's kind of its thing, yeah
      • alastairp
        we should add tickets for at least each of your milestones. Perhaps we can add more detail if we need to
      • kartikgupta0909
        sure
      • so you want me to add the tickets right now or as and when we progress?
      • alastairp
        this way we can keep everyone in the MB community up to date with what we are doing
      • it doesn't need to be right now, but some time in the next week or so
      • kartikgupta0909
        sure, will do that.
      • alastairp
        we should have tickets for the things you will work on in the future too
      • not just the current item
      • kartikgupta0909
        cool
      • alastairp
        OK, back to the interface
      • kartikgupta0909
        yes
      • alastairp
        I think it's OK for now. The reason that I added the component in jira is so that we can talk about this kind of stuff
      • kartikgupta0909
        no problems.
      • alastairp
        another option is a checkbox [ ] I want to train this model on my own server
      • kartikgupta0909
        we can change it as and when we work on it.
      • Gentlecat
        alastairp: do you want to deploy API changes now or with the blog post?
      • alastairp
        Gentlecat: as long as the old API still exists, I don't mind
      • Gentlecat
        only submission part
      • alastairp
        the old /mbid/low-level is still there too, right?
      • GET
      • kartikgupta0909: do you have an idea about how you are going to implement this part in the code?
      • kartikgupta0909
        the API for getting the datasets?
      • alastairp
        we have a table for new jobs. should our external jobs be in a new table? or should we add a new column which says if they are internal or external?
      • Gentlecat
      • kartikgupta0909
        i think a column would be fine
      • alastairp
        Gentlecat: ahhhh, I missed that. sorry
      • I thought that legacy had all existing methods
      • Gentlecat
        you think we should keep it for some time?
      • kartikgupta0909
        because everything else will be the same
      • alastairp
        kartikgupta0909: and do you have a plan for how a remote server will get jobs? do you think that the script will download a specific job, or it will scan for all jobs submitted by the user?
      • Gentlecat: yes, definitely
      • Gentlecat
        now that I'm thinking about it, this might be a good idea
      • alastairp
        then we announce it, then give people 6 months or something to get off it
      • we can do something to make it less enticing (rate limiting?) so people notice that it doesn't work
      • Gentlecat
        ok, I'll add them back
      • alastairp
        one thing we don't have is webserver logs (zas?), so in fact I have no idea how many people get data from us
      • Gentlecat
        nah, if they don't watch the announcements it's their problem
      • alastairp
        that's not the right attitude to have in this case I think
      • zas
        alastairp: for acousticbrainz ?
      • Gentlecat
        especially if we are giving so much time to change
      • alastairp
        we should work out exactly how widely used it is
      • and base our decision on that
      • zas: yes
      • Gentlecat
        what decision?
      • alastairp
        how to remove the endpoint
      • zas
        We currently lack of web stats, but i can do a rapid evalutation in hits/s
      • Gentlecat
        I still think that change is good. there aren't a lot of ways to do it without breaking something
      • and I don't think keeping dead endpoints around is a good idea
      • alastairp
        I'm not saying that I don't want to change it,
      • I'm saying that we should announce it, verify that people take up the announcement and change
      • and then work out what to do once we have more concrete data
      • zas: right. that's pretty much all I want
      • kartikgupta0909
        alastairp: if there can be multiple jobs to be evaluated by the same author, then first we send the list of the jobs and then n the client asks for them one by one.
      • and run jobs on them one by one
      • alastairp
        zas: how can we do that? client on spike which sends summary data to the main stats site? script on spike which generates stats/images itself?
      • kartikgupta0909: right. good idea. I was thinking the same
      • kartikgupta0909
        a dataset wont be downloaded before the previous one has been completely evaluated
      • alastairp
        I'd like to see this kind of detailed discussion in the tickets too.
      • kartikgupta0909
        sure.
      • alastairp
        when the client requests a dataset, will it also submit a status update?
      • zas
        alastairp: generate stats on spike, allow access through http api ?
      • alastairp
        so that the user can see that it's processing
      • zas: sure, that would be fine
      • kartikgupta0909
        I think the server should handle this in the sense that when it recieved the request for downloading a paticular dataset, it automatically updates the status
      • because it will know that client will now run evaluation on that dataset
      • alastairp
        I'm not sure that's a good idea - because if the download fails, the dataset will now be "stuck"
      • it's probably a better idea for the client to send an update status once it knows that it has the whole dataset
      • kartikgupta0909
        oh, then maybe we can put up an acknowledgement by the client
      • alastairp
        great
      • kartikgupta0909
        yes exactly
      • alastairp
        now, to actually get the dataset we need to plan some things
      • we decided that perhaps json is too large
      • kartikgupta0909
        yes
      • protobuf is an option
      • alastairp
        yes. I want to keep our options open here, because we have another requirement for smaller storage formats for another part of acousticbrainz: http://tickets.musicbrainz.org/browse/AB-101
      • MBJenkins has quit
      • another problem that we have is that the process of extracting lowlevel data from the database into files is quite slow
      • MBJenkins joined the channel
      • kartikgupta0909
        what is exactly frame level 2 data?
      • I am working on a paper which might be useful here
      • alastairp
        so if a client contacts the server and says "give me all of the files for an evaluation", this could take quite some time to get them
      • perhaps we don't want to leave a connection open for so long
      • do you know what an audio frame is?
      • kartikgupta0909
        yes
      • alastairp
        right
      • so the lowlevel data that we currently have in acousticbrainz only measures the mean and variance of particular values over the entire song
      • kartikgupta0909
        yeah
      • zas
        alastairp: i will upgrade a bunch of packages on spike, postgresql may need to be restarted, as well as nginx
      • alastairp
        we calculate this value for every frame in the audio, and then calculate these statistics
      • zas: ok
      • kartikgupta0909
        yes, are we storing the details for each frame?
      • alastairp
        not yet
      • kartikgupta0909
        or just storing these calculated stats
      • alastairp
        but by the end of the summer we want to also store all these details
      • unfortunately, it's too big to fit in postgres, so we will have to store it on disk
      • kartikgupta0909
        oh, that is definitely goint to be a problem then
      • alastairp
        but we don't know what format we want to use
      • protobufs is definitely something I want to try here
      • my point is, that we should only have 1 compact format in AB, even if we use it in different places
      • kartikgupta0909
        protobuf wouldnt help here much, because it helps when the keys are being repeated a lot of times or there are a large number of keys
      • alastairp
        well, there will be lots of repeated keys
      • because we have 9 statistics for each frame
      • the other place it will help is that it has a floating-point type
      • we currently store floating point numbers in json as strings
      • which is very inefficient
      • anyway, this is getting a bit off topic
      • kartikgupta0909
        yes it has a lot of keys
      • alastairp
        for now let's assume we will have some kind of binary transmission format
      • kartikgupta0909
        any idea how echo nest did it?
      • alastairp
        they served json
      • I don't know if they have any other internal representation
      • kartikgupta0909
        but did they stored it as json?
      • oh okay
      • alastairp
        I have some ideas for how we could do this client server talk
      • kartikgupta0909
        yeah sure
      • alastairp
        -client tells its username and api key and asks for any jobs to be done
      • -server returns a list
      • kartikgupta0909
        yes
      • alastairp
        -client asks for the first job
      • kartikgupta0909
        okay
      • alastairp
        -server says OK, and starts a background job (on the server) to collect all the data that the client needs [we can use celery here - http://www.celeryproject.org/]
      • -client keeps polling the server asking if its data is ready
      • -when the data is ready, server says yes, and client downloads it
      • kartikgupta0909
        seems ready, and recently I did a project like this
      • so something like a heartbeat
      • this seems okay
      • alastairp
        -client sends an update to the server to say it has it - server can now delete this temporary data (if the client doesn't send an update within x hours, we assume it's abandoned and delete it to save space)
      • kartikgupta0909
        from the memory right?
      • and not the disk
      • alastairp
        the server should probably store the data package on disk
      • it could get very big