alastairp: hi! I'm finally done with my midterms. Regarding my project, ruaok said you had a BigQuery guy over and it seems it might be super easy to switch to bigquery
2016-05-06 12703, 2016
cetko
you said you've already migrated some of the data but it's taking up a lot of memory to run a query
2016-05-06 12710, 2016
cetko
can you elaborate on that?
2016-05-06 12716, 2016
cetko
where should I start?
2016-05-06 12709, 2016
Major_Lurker joined the channel
2016-05-06 12710, 2016
MajorLurker has quit
2016-05-06 12753, 2016
Freso
cetko: Congrats on being done :)
2016-05-06 12728, 2016
alastairp
cetko: hi
2016-05-06 12733, 2016
alastairp
cool, let's talk
2016-05-06 12750, 2016
alastairp
can you give me an email address? I'll send you what we have
2016-05-06 12701, 2016
alastairp
or pm me
2016-05-06 12731, 2016
LordSputnik joined the channel
2016-05-06 12754, 2016
LordSputnik is now known as Guest82506
2016-05-06 12732, 2016
cetko
alastairp: pm!
2016-05-06 12756, 2016
cetko
thanks!
2016-05-06 12700, 2016
alastairp
got it. I forwarded you the email thread that we used when we were playing with bigquery
2016-05-06 12720, 2016
alastairp
it's quite easy to blindly add data to bigquery
2016-05-06 12736, 2016
alastairp
I made a file which had json documents 1 per line
2016-05-06 12743, 2016
alastairp
you upload them to google cloud storage
2016-05-06 12703, 2016
alastairp
and from there you can import them into bigquery
2016-05-06 12711, 2016
alastairp
we did this with Felipe, and it work
2016-05-06 12712, 2016
alastairp
s
2016-05-06 12739, 2016
alastairp
you can see some sample queries in the email I sent
(if you use the redash demo you actually count against their monthly quota, if you go against the bigquery console you go against yours)
2016-05-06 12708, 2016
alastairp
for us, the easiest way to play with the data was to import it as strings representing the json data. this is quick, but it means you have to parse the *whole* document to get an item out of it
2016-05-06 12721, 2016
alastairp
and you use up your quota much more quickly. (see the email)
2016-05-06 12722, 2016
alastairp
this means the first thing that we should do is develop a schema which represents our lowlevel documents, this means that we can query just a specific field, and the data usage should be pretty small
2016-05-06 12738, 2016
alastairp
so I see our plan as being something like this
2016-05-06 12745, 2016
alastairp
1. get familiar with all the tools
2016-05-06 12758, 2016
alastairp
2. work with Felipe (from google) on developing a schema
2016-05-06 12734, 2016
alastairp
3. load the current data into a metabrainz/acousticbrainz database (as opposed to Felipe's private one as it is currently)
2016-05-06 12758, 2016
alastairp
4. install redash so that we can do cool queries and graphs
2016-05-06 12723, 2016
alastairp
5. work out how to upload new items to BQ as they come in to AB
2016-05-06 12753, 2016
alastairp
6. optionally use redash to make some neat graphs that we host on the AB website
2016-05-06 12754, 2016
alastairp
7. ???
2016-05-06 12732, 2016
Gentlecat
8. profit
2016-05-06 12759, 2016
alastairp
We also talked in your proposal about looking at the more detailed frame-level data. If we have time, it'd be great to also integrate that into bigquery. the idea is that bq would be the main interface that other people consume this data from, as it's quite large
2016-05-06 12736, 2016
alastairp
the thing here is that this also requires development on the AB server, so I'm not sure how we would split up that work. It's also on my list for the summer, but perhaps we could do it together if we have tiem
2016-05-06 12709, 2016
Gentlecat
alastairp: do we have a license for datasets on AB?
alastairp: +1 to "forcing people's hand a bit". CC0 would still require attribution in most jurisdictions I believe, but CC-by is pretty close to being CC0 anyway, so might work too.
2016-05-06 12750, 2016
alastairp
we're already doing with cc0 on the data itself
2016-05-06 12701, 2016
alastairp
and the explicit aim of AB is to open this stuff up
2016-05-06 12713, 2016
alastairp
I do like Gentlecat's suggestion of citation though
2016-05-06 12725, 2016
alastairp
so I'm tending towards ccby
2016-05-06 12708, 2016
Freso
(Or CC by-sa for the datasets, with a note that MetaBrainz can relicense as CC by for commercial entities that don't want their derived algorithms or whatever be CC'd.)
2016-05-06 12743, 2016
alastairp
remember that a dataset is just a list of mbids
2016-05-06 12749, 2016
Freso
Yeah, I know the data itself is CC0, only talking datasets here. :)
2016-05-06 12752, 2016
alastairp
attached to a label (e.g. genre class)
2016-05-06 12702, 2016
alastairp
so I don't think derived algorithms comes into it
2016-05-06 12703, 2016
alastairp
right?
2016-05-06 12712, 2016
alastairp
hmm
2016-05-06 12714, 2016
Gentlecat
I've heard someone who I shall not name saying, "I license most of my work under ccby, but I don't mind if people don't credit me"