by real time you mean like updating whenever a recording is submitted ?
alastairp
the problem with integration into the database is that it's too slow to query all of the data, even if we did it periodically
right, perhaps not that often, but say once a week
_lucifer
that's doable
alastairp
we have much of this information in the `similarity.similarity` table
and it's much smaller than the lowlevel table
_lucifer
that's nice!
alastairp
so perhaps we could have a periodic task that we run that summarises this table
if not, we could definitely also create another statistics table, although there is a question about what data we should add there
we could make some initial tables, and load data, and then if we need more data for more graphs, we add those at a later stage
for example - the similarity table doesn't have years, so we'd have to get that separately
then say for example we wanted to compare year to loudness, we'd need some kind of table that allowed us to join this info together
the genre or mood tables are much easier, because we just need categories and counts
_lucifer
+1
alastairp
OK, so
let's focus on the following charts:
genre rosamerica, feature/genre, feature/year, key estimation, genre mood (at the end of mood)
_lucifer
awesome!
alastairp
on bono, you can `psql -U acousticbrainz acousticbrainz_big` (not inside docker)
that has a full lowlevel table, and full `similarity.similarity` table
I think we should create a new postgresql schema (call it `statistics`), and for each graph, make a new table in this schema that stores just the information that we need for that graph
_lucifer
yeah that's a great place to start
alastairp
then we can see how easy it is to 1) get the data from similarity.similarity, or 2) get the data from lowlevel as the data comes in
_lucifer
okay, will be needing to use saprk ?
alastairp
I don't think so
this isn't really analysis, it's just loading and transforming data
if we wanted to use it, we'd have to load all of the necessary data into hdfs, which I suspect would be really annoying
_lucifer
okay, yeah right. the similarity table is smaller and we can probably process it directly
alastairp
that's what I'm hoping
_lucifer
we can use spark without hdfs but that's a thing to consider for afterwards
alastairp
oh? how would that work?
in some cases it might make sense to use spark for machine learning in AB, we should look into it as future option
_lucifer
> Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.
Spark home page says this
i had also read an article on the same but cannot find it right now
PostgreSQL has provides a JDBC plugin to allow spark to connect to it directly
BrainzGit
[mb-solr] yvanzo merged pull request #39 (master…SEARCH-611): SEARCH-628: 'primary-type-id' field is missing from JSON release group search results https://github.com/metabrainz/mb-solr/pull/39
I actually think this is an evil plot by pristine___ to get back at me
JoshDi
Hey quick question. I currently run a musicbrainz slave server via the docker image. Is there a way to turn off indexing completely so all local queries go directly to the database?
shivam-kapila
ruaok: save yourself
ruaok
K says: "Never gonna give listenbrainz up, never gonna let listenbrainz down, never gonna turn around and hurt listenbrainz!"
yvanzo: ^^ see JoshDi's query
ruaok waves at JoshDi
JoshDi
Hey
I find even with SIR tweaks, live indexing daily updates of the slave , take like 12 hours to finish. When full reindexing takes 3 hrs
shivam-kapila
ruaok: I introduced a lil of troi to ppl
And they were like damn. Dynamic playlists
JoshDi
I only use this server for some local processes so its not like my musicbrainz server is very busy.
shivam-kapila
They felt really excited
JoshDi
Any ideas?
ruaok
I dont know, but yvanzo will. hang tight for him to return and he'll sort you out. (he is around)