MBS-5193: Regression : impossible to purposely set bad encoded alias (search hints)
antlarr2 has quit
antlarr joined the channel
D4RK-PH0ENiX has quit
aidanlw17
Freso: I'll be out right before the meeting but should be back on when it starts - I mailed in my review incase I'm late :)
alastairp
aidanlw17: good morning
I'm just heading off to lunch, but should be back in an hour or so
D4RK-PH0ENiX joined the channel
aidanlw17
alastairp: hi, sounds good, we can talk when you’re back?
I made some new comments on the metrics PR
alastairp
great. how's it going on the query optimisation?
aidanlw17
I think that it’s good. We now only need one query to select the data, and one to insert it
One select query per batch!
alastairp
awesome!
that's going to be so fast
I'll test it when I get back then
aidanlw17
~28 seconds to compute and insert one 10k recording batch on my machine
alastairp
compared to how long before?
aidanlw17
I’ll need to look back in my notes to report
I’ll tell you when you’re back from lunch! Haha.
alastairp
cool, talk soon
D4RK-PH0ENiX has quit
D4RK-PH0ENiX joined the channel
yvanzo: thanks for all of the feedback on my tickets!
ruaok has a slow start to the day
ruaok
but I really needed that ride, even if it was super hot.
alastairp
where did you go?
ruaok
just up besos, nothing fancy. I wanted to go all weekend, but I ended up getting distracted by everything.
alastairp
nice
yeah, we've started riding after work at 8ish, to get a bit of coolness in the day
almost any other time is impossible
ruaok
Mr_Monkey: back yet?
alastairp: yeah, 8pm would work, but there are too many other things going on then.
alastairp
sure, you fit stuff in whenever you can
Darkloke has quit
TOPIC: MetaBrainz Community and Development channel | MusicBrainz non-development: #musicbrainz | New GSoC students start here: https://goo.gl/7jsjG2 | Channel is logged; see https://musicbrainz.org/doc/IRC for details | Meeting agenda: Reviews, MB Summit (ruaok)
ruaok
pristine__: how are you doing?
pristine__
Hey
I am good. Sorry for being afk. Was travelling.
And shifting the room.
How are you?
ruaok
good, just checking in to see if you need anything.
I have a pile of metabrainz things to do today -- I might get around to doing some MSB stuff later.
ruaok: what does Default now() means? If we don't provide a timestamp then current timestamp will be added, no?
ruaok
correct.
pristine__
Then why not null clause
So that no one can push null value in the col?
ruaok
yes
pristine__
Okay. Thanks
aidanlw17
alastairp: you could do 40 batches with the new query in the time it used to take to do only 1!
alastairp
great, sounds good
I'm just finishing up some reviews on another project and I'll take a look at this PR again
so you also fixed the query parameters?
aidanlw17
It took my machine ~19 minutes to do the old method for one batch.
Yes I did fix them!
alastairp
perfect, sounds good
aidanlw17
Sort of related, Philip used arrays of NaN casted to double precision to represent rows with missing data. We decided for annoy to use vectors of the form [0, ..., 0] to represent those that didn't have a submission instead. For us that makes more sense, so I started inserting rows of 0 rather than NaN when there is missing data for a metric as well.
alastairp
ok, cool. it makes sense that what we have in the database is exactly what we insert into annoy
aidanlw17
I think so too. I found this interesting, if you add a vector to an Annoy index containing the value `None`, it converts that value to -1 when adding it to the index.
alastairp
ah, that's very interesting too
aidanlw17
We also have negative elements of our vectors though, so I still think it makes the most sense to use the value 0?
alastairp
that was about to be my next question -
what is the scale of our features? are they all normalised from 0-1?
aidanlw17
Almost all values range from -1 to 1, but looking closely they are not all < 1. Some have magnitudes larger
I took the transformation functions directly from Philip, I should look closer to see about that.
alastairp
we have the NormalizedLowLevelMetric classes
what does that normalise to?
aidanlw17
Again my background on the transformation is weak, some of it I don't fully understand. For normalized lowlevel metrics, the values are: (value_from_lowlevel - mean_value)/std_dev
Then if it is a weighteed normalized lowlevel metric, that value is multiplied afterwards by a weight factor `self.weight_vector = np.array([self.weight ** i for i in indices])`
Where self.weight is currently set to 0.95.
alastairp
ok, cool
let
let's leave it as-is for now, perhaps we can modify it in the future
Previously, we used a function get_data to extract the lowlevel data with a specific path or the highlevel models
alastairp
a specific postgres query path, right?
lowlevel.data->'blah'->'foo'
aidanlw17
Yeah exactly. I wrote a new function get_feature_data, which takes that path and extracts it from the dictionary.
Then passes the value to transform.
alastairp
ahh, I see
aidanlw17
I left the paths as is, because I thought soon we may be able to just use the select feature paths in the postgres query
rather than getting the whole document
yvanzo
alastairp: You’re welcome, musicbrainz-docker is currently sluggish until PR #106 can be updated/merged with a working SIR.
alastairp
mm, right. I agree that leaving the path is a good idea, I'm not sure I would have done it this way. especially `features = self.path[7:-1]` makes me a bit worried
yvanzo
There are two annoying bugs atm: sir reindex not always returning (which can be worked around by downloading prebuilt indexes) and sir reindex failing over some invalid characters (which is required to build indexes).
I would have written specific methods (or perhaps some lambdas?) that explicitly select the items from the dictionary
ruaok
the point here is to store user specific output from the collaborative filtering system.
aidanlw17
Yes I agree that felt a little sketchy... I'll see about rewriting that in another way.
ruaok
and then to allow multiples recommender scripts to access these tables and keep a record of which script has used which tracks.
alastairp
yvanzo: no problem. I was looking at upgrading our mirror to new schema, but perhaps I'll just wait for all of this to be finished. we only use the server/api and no search, so for us it's a matter of updating the image and running upgrade
but I had some custom modifications to point to the external database server, so the fewer changes I have to make the better
aidanlw17: cool. it's true that it might become a bit more complex - perhaps we'll have to write a custom transformer per method?
otherwise - what about a list of dictionary keys? ['lowlevel', 'mfcc', 'mean']
in fact, we could then construct the path from this anyway
that way we can keep your method, but it won't involve messy string splitting
ruaok: I'll have a look. while you're here, a good time to ask a question about pg schemas. it looks like you're splitting different parts of lb into separate schemas, which sounds like a great idea to me
we're making some more tables for the similarity stuff. it feels like we could put this in a schema too
ruaok
in AB?
alastairp
yes
ruaok
yea, please do.
in the end the AB similarity data ought to be copied to the LB recommendation schema.
the idea is to provide complete dumps of this schema for anyone willing to try writing a recommendation engine.
and it should have collabortive filtered tracks, similarity tracks, artist-artist similarity.
alastairp
aidanlw17: sorry, so this is one more thing on this pr :)
let's put similarity tables in a schema. this is as easy as `create schema similarity` and prefix tables with the schema name when using them (`select x from similarity.similarity`)
it will help us to logically separate all of the tables
aidanlw17
alastairp: it makes sense to me to store the keys in a list like that, and I think we’ve already done something similar in AB-404.
one nice thing you can do is `drop schema s cascade;` will drop all of the tables in the schema s, you don't have to individually drop them in drop_tables
aidanlw17
Okay thanks alastairp
Cool!! Sounds handy
alastairp: are the other tables in AB related to data are part of a different schema already?
alastairp
no, we have no other schemas except the default
we should move some of them
aidanlw17
Okay. I can do that after I do these then
If you want!
alastairp
that's a larger process, since we have to move existing data