see if you can understand the build code for essentia - for example, we split the code into a number of packages - a core essentia package, a package for docs, one for python bindings, and one for example files
this way ubuntu will host the gaia packages for us. to install it we only need to add the ppa and apt-get install it
OK, cool. I think that's a good first milestone
do you want to keep talking about the next steps for building the dataset?
kartikgupta0909
yes
a fair idea before start of the project would help me plan it well
alastairp
right
kartikgupta0909
so is the GUI mockup fine?
is this how you want to give the option to the user?
zas
alastairp: once you have working shell scripts, bitmap, Gentlecat, and i can help you to write matching recipes for chef. To start with, Chef is able to run shell scripts ;) Though Chef does much better when it comes to ensure all is fine, and to handle updates.
alastairp
cool
Gentlecat
one thing I noticed with chef is that it makes sure that the system is in the right state
kartikgupta0909: and do you have a plan for how a remote server will get jobs? do you think that the script will download a specific job, or it will scan for all jobs submitted by the user?
Gentlecat: yes, definitely
Gentlecat
now that I'm thinking about it, this might be a good idea
alastairp
then we announce it, then give people 6 months or something to get off it
we can do something to make it less enticing (rate limiting?) so people notice that it doesn't work
Gentlecat
ok, I'll add them back
alastairp
one thing we don't have is webserver logs (zas?), so in fact I have no idea how many people get data from us
Gentlecat
nah, if they don't watch the announcements it's their problem
alastairp
that's not the right attitude to have in this case I think
zas
alastairp: for acousticbrainz ?
Gentlecat
especially if we are giving so much time to change
alastairp
we should work out exactly how widely used it is
and base our decision on that
zas: yes
Gentlecat
what decision?
alastairp
how to remove the endpoint
zas
We currently lack of web stats, but i can do a rapid evalutation in hits/s
Gentlecat
I still think that change is good. there aren't a lot of ways to do it without breaking something
and I don't think keeping dead endpoints around is a good idea
alastairp
I'm not saying that I don't want to change it,
I'm saying that we should announce it, verify that people take up the announcement and change
and then work out what to do once we have more concrete data
zas: right. that's pretty much all I want
kartikgupta0909
alastairp: if there can be multiple jobs to be evaluated by the same author, then first we send the list of the jobs and then n the client asks for them one by one.
and run jobs on them one by one
alastairp
zas: how can we do that? client on spike which sends summary data to the main stats site? script on spike which generates stats/images itself?
kartikgupta0909: right. good idea. I was thinking the same
kartikgupta0909
a dataset wont be downloaded before the previous one has been completely evaluated
alastairp
I'd like to see this kind of detailed discussion in the tickets too.
kartikgupta0909
sure.
alastairp
when the client requests a dataset, will it also submit a status update?
zas
alastairp: generate stats on spike, allow access through http api ?
alastairp
so that the user can see that it's processing
zas: sure, that would be fine
kartikgupta0909
I think the server should handle this in the sense that when it recieved the request for downloading a paticular dataset, it automatically updates the status
because it will know that client will now run evaluation on that dataset
alastairp
I'm not sure that's a good idea - because if the download fails, the dataset will now be "stuck"
it's probably a better idea for the client to send an update status once it knows that it has the whole dataset
kartikgupta0909
oh, then maybe we can put up an acknowledgement by the client
alastairp
great
kartikgupta0909
yes exactly
alastairp
now, to actually get the dataset we need to plan some things
we decided that perhaps json is too large
kartikgupta0909
yes
protobuf is an option
alastairp
yes. I want to keep our options open here, because we have another requirement for smaller storage formats for another part of acousticbrainz: http://tickets.musicbrainz.org/browse/AB-101
MBJenkins has quit
another problem that we have is that the process of extracting lowlevel data from the database into files is quite slow
MBJenkins joined the channel
kartikgupta0909
what is exactly frame level 2 data?
I am working on a paper which might be useful here
alastairp
so if a client contacts the server and says "give me all of the files for an evaluation", this could take quite some time to get them
perhaps we don't want to leave a connection open for so long
do you know what an audio frame is?
kartikgupta0909
yes
alastairp
right
so the lowlevel data that we currently have in acousticbrainz only measures the mean and variance of particular values over the entire song
kartikgupta0909
yeah
zas
alastairp: i will upgrade a bunch of packages on spike, postgresql may need to be restarted, as well as nginx
alastairp
we calculate this value for every frame in the audio, and then calculate these statistics
zas: ok
kartikgupta0909
yes, are we storing the details for each frame?
alastairp
not yet
kartikgupta0909
or just storing these calculated stats
alastairp
but by the end of the summer we want to also store all these details
unfortunately, it's too big to fit in postgres, so we will have to store it on disk
kartikgupta0909
oh, that is definitely goint to be a problem then
alastairp
but we don't know what format we want to use
protobufs is definitely something I want to try here
my point is, that we should only have 1 compact format in AB, even if we use it in different places
kartikgupta0909
protobuf wouldnt help here much, because it helps when the keys are being repeated a lot of times or there are a large number of keys
alastairp
well, there will be lots of repeated keys
because we have 9 statistics for each frame
the other place it will help is that it has a floating-point type
we currently store floating point numbers in json as strings
which is very inefficient
anyway, this is getting a bit off topic
kartikgupta0909
yes it has a lot of keys
alastairp
for now let's assume we will have some kind of binary transmission format
kartikgupta0909
any idea how echo nest did it?
alastairp
they served json
I don't know if they have any other internal representation
kartikgupta0909
but did they stored it as json?
oh okay
alastairp
I have some ideas for how we could do this client server talk
kartikgupta0909
yeah sure
alastairp
-client tells its username and api key and asks for any jobs to be done
-server returns a list
kartikgupta0909
yes
alastairp
-client asks for the first job
kartikgupta0909
okay
alastairp
-server says OK, and starts a background job (on the server) to collect all the data that the client needs [we can use celery here - http://www.celeryproject.org/]
-client keeps polling the server asking if its data is ready
-when the data is ready, server says yes, and client downloads it
kartikgupta0909
seems ready, and recently I did a project like this
so something like a heartbeat
this seems okay
alastairp
-client sends an update to the server to say it has it - server can now delete this temporary data (if the client doesn't send an update within x hours, we assume it's abandoned and delete it to save space)
kartikgupta0909
from the memory right?
and not the disk
alastairp
the server should probably store the data package on disk