alastairp: hi! I'm finally done with my midterms. Regarding my project, ruaok said you had a BigQuery guy over and it seems it might be super easy to switch to bigquery
you said you've already migrated some of the data but it's taking up a lot of memory to run a query
can you elaborate on that?
where should I start?
Major_Lurker joined the channel
MajorLurker has quit
Freso
cetko: Congrats on being done :)
alastairp
cetko: hi
cool, let's talk
can you give me an email address? I'll send you what we have
or pm me
LordSputnik joined the channel
LordSputnik is now known as Guest82506
cetko
alastairp: pm!
thanks!
alastairp
got it. I forwarded you the email thread that we used when we were playing with bigquery
it's quite easy to blindly add data to bigquery
I made a file which had json documents 1 per line
you upload them to google cloud storage
and from there you can import them into bigquery
we did this with Felipe, and it work
s
you can see some sample queries in the email I sent
(if you use the redash demo you actually count against their monthly quota, if you go against the bigquery console you go against yours)
for us, the easiest way to play with the data was to import it as strings representing the json data. this is quick, but it means you have to parse the *whole* document to get an item out of it
and you use up your quota much more quickly. (see the email)
this means the first thing that we should do is develop a schema which represents our lowlevel documents, this means that we can query just a specific field, and the data usage should be pretty small
so I see our plan as being something like this
1. get familiar with all the tools
2. work with Felipe (from google) on developing a schema
3. load the current data into a metabrainz/acousticbrainz database (as opposed to Felipe's private one as it is currently)
4. install redash so that we can do cool queries and graphs
5. work out how to upload new items to BQ as they come in to AB
6. optionally use redash to make some neat graphs that we host on the AB website
7. ???
Gentlecat
8. profit
alastairp
We also talked in your proposal about looking at the more detailed frame-level data. If we have time, it'd be great to also integrate that into bigquery. the idea is that bq would be the main interface that other people consume this data from, as it's quite large
the thing here is that this also requires development on the AB server, so I'm not sure how we would split up that work. It's also on my list for the summer, but perhaps we could do it together if we have tiem
Gentlecat
alastairp: do we have a license for datasets on AB?
alastairp: +1 to "forcing people's hand a bit". CC0 would still require attribution in most jurisdictions I believe, but CC-by is pretty close to being CC0 anyway, so might work too.
alastairp
we're already doing with cc0 on the data itself
and the explicit aim of AB is to open this stuff up
I do like Gentlecat's suggestion of citation though
so I'm tending towards ccby
Freso
(Or CC by-sa for the datasets, with a note that MetaBrainz can relicense as CC by for commercial entities that don't want their derived algorithms or whatever be CC'd.)
alastairp
remember that a dataset is just a list of mbids
Freso
Yeah, I know the data itself is CC0, only talking datasets here. :)
alastairp
attached to a label (e.g. genre class)
so I don't think derived algorithms comes into it
right?
hmm
Gentlecat
I've heard someone who I shall not name saying, "I license most of my work under ccby, but I don't mind if people don't credit me"