in #metabrainz

9:16 AM
alastairp

see if you can understand the build code for essentia - for example, we split the code into a number of packages - a core essentia package, a package for docs, one for python bindings, and one for example files
9:16 AM
I suggest we sign up for a PPA account on launchpad: https://launchpad.net/ubuntu/+ppas
9:16 AM
kartikgupta0909

and use flags for building each of them
9:17 AM
yes we will have to do that
9:17 AM
alastairp

this way ubuntu will host the gaia packages for us. to install it we only need to add the ppa and apt-get install it
9:17 AM
OK, cool. I think that's a good first milestone
9:17 AM
do you want to keep talking about the next steps for building the dataset?
9:18 AM
kartikgupta0909

yes
9:18 AM
a fair idea before start of the project would help me plan it well
9:18 AM
alastairp

right
9:19 AM
kartikgupta0909

so is the GUI mockup fine?
9:19 AM
is this how you want to give the option to the user?
9:21 AM
zas

alastairp: once you have working shell scripts, bitmap, Gentlecat, and i can help you to write matching recipes for chef. To start with, Chef is able to run shell scripts ;) Though Chef does much better when it comes to ensure all is fine, and to handle updates.
9:21 AM
alastairp

cool
9:21 AM
Gentlecat

one thing I noticed with chef is that it makes sure that the system is in the right state
9:21 AM
alastairp

kartikgupta0909: I've added a new component in jira for this project: http://tickets.musicbrainz.org/browse/AB/compon...
9:21 AM
Gentlecat

this can be hard to do with just shell scripts
9:22 AM
zas

Yes that's the main point
9:22 AM
Gentlecat

I guess that's kind of its thing, yeah
9:22 AM
alastairp

we should add tickets for at least each of your milestones. Perhaps we can add more detail if we need to
9:22 AM
kartikgupta0909

sure
9:22 AM
so you want me to add the tickets right now or as and when we progress?
9:22 AM
alastairp

this way we can keep everyone in the MB community up to date with what we are doing
9:22 AM
it doesn't need to be right now, but some time in the next week or so
9:23 AM
kartikgupta0909

sure, will do that.
9:23 AM
alastairp

we should have tickets for the things you will work on in the future too
9:23 AM
not just the current item
9:23 AM
kartikgupta0909

cool
9:23 AM
alastairp

OK, back to the interface
9:23 AM
kartikgupta0909

yes
9:23 AM
alastairp

I think it's OK for now. The reason that I added the component in jira is so that we can talk about this kind of stuff
9:23 AM
kartikgupta0909

no problems.
9:24 AM
alastairp

another option is a checkbox [ ] I want to train this model on my own server
9:24 AM
kartikgupta0909

we can change it as and when we work on it.
9:24 AM
Gentlecat

alastairp: do you want to deploy API changes now or with the blog post?
9:24 AM
alastairp

Gentlecat: as long as the old API still exists, I don't mind
9:24 AM
Gentlecat

only submission part
9:24 AM
alastairp

the old /mbid/low-level is still there too, right?
9:24 AM
GET
9:25 AM
kartikgupta0909: do you have an idea about how you are going to implement this part in the code?
9:25 AM
kartikgupta0909

the API for getting the datasets?
9:26 AM
alastairp

we have a table for new jobs. should our external jobs be in a new table? or should we add a new column which says if they are internal or external?
9:26 AM
Gentlecat

no https://github.com/metabrainz/acousticbrainz-se...
9:26 AM
kartikgupta0909

i think a column would be fine
9:26 AM
alastairp

Gentlecat: ahhhh, I missed that. sorry
9:26 AM
I thought that legacy had all existing methods
9:26 AM
Gentlecat

you think we should keep it for some time?
9:27 AM
kartikgupta0909

because everything else will be the same
9:27 AM
alastairp

kartikgupta0909: and do you have a plan for how a remote server will get jobs? do you think that the script will download a specific job, or it will scan for all jobs submitted by the user?
9:27 AM
Gentlecat: yes, definitely
9:27 AM
Gentlecat

now that I'm thinking about it, this might be a good idea
9:27 AM
alastairp

then we announce it, then give people 6 months or something to get off it
9:28 AM
we can do something to make it less enticing (rate limiting?) so people notice that it doesn't work
9:28 AM
Gentlecat

ok, I'll add them back
9:28 AM
alastairp

one thing we don't have is webserver logs (zas?), so in fact I have no idea how many people get data from us
9:28 AM
Gentlecat

nah, if they don't watch the announcements it's their problem
9:29 AM
alastairp

that's not the right attitude to have in this case I think
9:29 AM
zas

alastairp: for acousticbrainz ?
9:29 AM
Gentlecat

especially if we are giving so much time to change
9:29 AM
alastairp

we should work out exactly how widely used it is
9:29 AM
and base our decision on that
9:29 AM
zas: yes
9:29 AM
Gentlecat

what decision?
9:29 AM
alastairp

how to remove the endpoint
9:30 AM
zas

We currently lack of web stats, but i can do a rapid evalutation in hits/s
9:30 AM
Gentlecat

I still think that change is good. there aren't a lot of ways to do it without breaking something
9:30 AM
and I don't think keeping dead endpoints around is a good idea
9:30 AM
alastairp

I'm not saying that I don't want to change it,
9:31 AM
I'm saying that we should announce it, verify that people take up the announcement and change
9:31 AM
and then work out what to do once we have more concrete data
9:31 AM
zas: right. that's pretty much all I want
9:31 AM
kartikgupta0909

alastairp: if there can be multiple jobs to be evaluated by the same author, then first we send the list of the jobs and then n the client asks for them one by one.
9:32 AM
and run jobs on them one by one
9:32 AM
alastairp

zas: how can we do that? client on spike which sends summary data to the main stats site? script on spike which generates stats/images itself?
9:32 AM
kartikgupta0909: right. good idea. I was thinking the same
9:32 AM
kartikgupta0909

a dataset wont be downloaded before the previous one has been completely evaluated
9:32 AM
alastairp

I'd like to see this kind of detailed discussion in the tickets too.
9:33 AM
kartikgupta0909

sure.
9:33 AM
alastairp

when the client requests a dataset, will it also submit a status update?
9:33 AM
zas

alastairp: generate stats on spike, allow access through http api ?
9:33 AM
alastairp

so that the user can see that it's processing
9:34 AM
zas: sure, that would be fine
9:34 AM
kartikgupta0909

I think the server should handle this in the sense that when it recieved the request for downloading a paticular dataset, it automatically updates the status
9:34 AM
because it will know that client will now run evaluation on that dataset
9:34 AM
alastairp

I'm not sure that's a good idea - because if the download fails, the dataset will now be "stuck"
9:35 AM
it's probably a better idea for the client to send an update status once it knows that it has the whole dataset
9:35 AM
kartikgupta0909

oh, then maybe we can put up an acknowledgement by the client
9:35 AM
alastairp

great
9:35 AM
kartikgupta0909

yes exactly
9:35 AM
alastairp

now, to actually get the dataset we need to plan some things
9:35 AM
we decided that perhaps json is too large
9:36 AM
kartikgupta0909

yes
9:36 AM
protobuf is an option
9:37 AM
alastairp

yes. I want to keep our options open here, because we have another requirement for smaller storage formats for another part of acousticbrainz: http://tickets.musicbrainz.org/browse/AB-101
9:37 AM
MBJenkins has quit
9:38 AM
another problem that we have is that the process of extracting lowlevel data from the database into files is quite slow
9:38 AM
MBJenkins joined the channel
9:38 AM
kartikgupta0909

what is exactly frame level 2 data?
9:39 AM
I am working on a paper which might be useful here
9:39 AM
alastairp

so if a client contacts the server and says "give me all of the files for an evaluation", this could take quite some time to get them
9:39 AM
perhaps we don't want to leave a connection open for so long
9:39 AM
do you know what an audio frame is?
9:39 AM
kartikgupta0909

yes
9:39 AM
alastairp

right
9:40 AM
so the lowlevel data that we currently have in acousticbrainz only measures the mean and variance of particular values over the entire song
9:40 AM
kartikgupta0909

yeah
9:40 AM
zas

alastairp: i will upgrade a bunch of packages on spike, postgresql may need to be restarted, as well as nginx
9:40 AM
alastairp

we calculate this value for every frame in the audio, and then calculate these statistics
9:40 AM
zas: ok
9:40 AM
kartikgupta0909

yes, are we storing the details for each frame?
9:41 AM
alastairp

not yet
9:41 AM
kartikgupta0909

or just storing these calculated stats
9:41 AM
alastairp

but by the end of the summer we want to also store all these details
9:41 AM
unfortunately, it's too big to fit in postgres, so we will have to store it on disk
9:41 AM
kartikgupta0909

oh, that is definitely goint to be a problem then
9:41 AM
alastairp

but we don't know what format we want to use
9:41 AM
protobufs is definitely something I want to try here
9:42 AM
my point is, that we should only have 1 compact format in AB, even if we use it in different places
9:42 AM
kartikgupta0909

protobuf wouldnt help here much, because it helps when the keys are being repeated a lot of times or there are a large number of keys
9:42 AM
alastairp

well, there will be lots of repeated keys
9:43 AM
because we have 9 statistics for each frame
9:43 AM
the other place it will help is that it has a floating-point type
9:43 AM
we currently store floating point numbers in json as strings
9:43 AM
which is very inefficient
9:43 AM
anyway, this is getting a bit off topic
9:44 AM
kartikgupta0909

yes it has a lot of keys
9:44 AM
alastairp

for now let's assume we will have some kind of binary transmission format
9:44 AM
kartikgupta0909

any idea how echo nest did it?
9:44 AM
alastairp

they served json
9:44 AM
I don't know if they have any other internal representation
9:44 AM
kartikgupta0909

but did they stored it as json?
9:44 AM
oh okay
9:44 AM
alastairp

I have some ideas for how we could do this client server talk
9:45 AM
kartikgupta0909

yeah sure
9:45 AM
alastairp

-client tells its username and api key and asks for any jobs to be done
9:45 AM
-server returns a list
9:45 AM
kartikgupta0909

yes
9:45 AM
alastairp

-client asks for the first job
9:45 AM
kartikgupta0909

okay
9:45 AM
alastairp

-server says OK, and starts a background job (on the server) to collect all the data that the client needs [we can use celery here - http://www.celeryproject.org/]
9:46 AM
-client keeps polling the server asking if its data is ready
9:46 AM
-when the data is ready, server says yes, and client downloads it
9:46 AM
kartikgupta0909

seems ready, and recently I did a project like this
9:46 AM
so something like a heartbeat
9:47 AM
this seems okay
9:47 AM
alastairp

-client sends an update to the server to say it has it - server can now delete this temporary data (if the client doesn't send an update within x hours, we assume it's abandoned and delete it to save space)
9:47 AM
kartikgupta0909

from the memory right?
9:47 AM
and not the disk
9:47 AM
alastairp

the server should probably store the data package on disk
9:47 AM
it could get very big