#metabrainz

/

9:34 AM
ruaok

https://github.com/metabrainz/listenbrainz-server…

2017-04-19 10950, 2017

9:34 AM
ruaok

the influx schema bits are here:

2017-04-19 10951, 2017

9:34 AM
ruaok

https://github.com/metabrainz/listenbrainz-server…

2017-04-19 10907, 2017

9:35 AM
ruaok

any feedback would be appreciated.

2017-04-19 10925, 2017

9:35 AM
ruaok

I used to have only one measurement called "listen" which turned out to be a bad idea.

2017-04-19 10944, 2017

9:35 AM
ruaok

each user has their own measurement now.

2017-04-19 10922, 2017

9:37 AM
zas

why it was a bad idea?

2017-04-19 10943, 2017

9:37 AM
ruaok

too many points in one measurement and the load on lemmy shot to 100.

2017-04-19 10958, 2017

9:37 AM
ruaok

spread it across many more measurements and influx is much happier.

2017-04-19 10905, 2017

9:39 AM
zas

yup, makes sense

2017-04-19 10937, 2017

9:39 AM
ruaok

that was hard to see when first using influx. like other non relational data stores, it takes a while to change your thinking.

2017-04-19 10959, 2017

9:41 AM
zas

well, in theory, using user_name as tag is ok, using separate measurements for each user is obviously more complex when it comes to sum up listens

2017-04-19 10937, 2017

9:42 AM
ruaok

yeah, but summing up listens is old skool thinking. adding a new measurement to keep track of them is key.

2017-04-19 10951, 2017

9:42 AM
ruaok

write more, don't update/delete

2017-04-19 10956, 2017

9:44 AM
zas

yes, much faster approach

2017-04-19 10927, 2017

9:45 AM
bochecha has quit

2017-04-19 10924, 2017

9:47 AM
zag has quit

2017-04-19 10918, 2017

9:48 AM
d4rkie has quit

2017-04-19 10945, 2017

9:48 AM
D4RK-PH0ENiX joined the channel

2017-04-19 10915, 2017

9:51 AM
samj1912 joined the channel

2017-04-19 10944, 2017

10:01 AM
ruaok wonders if logging into jira is something he could track as a fitness activity in Google Fit.

2017-04-19 10912, 2017

10:18 AM
agentsim joined the channel

2017-04-19 10949, 2017

10:22 AM
agentsim has quit

2017-04-19 10927, 2017

10:44 AM
SothoTalKer joined the channel

2017-04-19 10902, 2017

10:45 AM
Freso

Did anyone ever hear the complaint that "MB wants too much personal data during registration" before?

2017-04-19 10913, 2017

10:46 AM
Quesito hopes ruaok knows how unreliable the data is from wearables....

2017-04-19 10920, 2017

10:46 AM
SothoTalker_ has quit

2017-04-19 10954, 2017

10:48 AM
Quesito

Freso: that has to be a cop out for I'm lazy....

2017-04-19 10955, 2017

10:48 AM
ruaok

Quesito: yep, it is a suggestion at best.

2017-04-19 10911, 2017

10:49 AM
Quesito

;)

2017-04-19 10925, 2017

10:49 AM
SothoTalker_ joined the channel

2017-04-19 10942, 2017

10:49 AM
ruaok

yeah, pure BS, Freso.

2017-04-19 10917, 2017

10:51 AM
SothoTalKer has quit

2017-04-19 10947, 2017

10:52 AM
alastairp

Freso: kind of interesting that those releases are in mb at all!

2017-04-19 10938, 2017

10:53 AM
Freso

alastairp: :)

2017-04-19 10947, 2017

10:54 AM
reosarevok

Freso: I guess some people might somehow think that all the stuff in the profile is mandatory?

2017-04-19 10957, 2017

10:55 AM
reosarevok

I know a bunch of artists are confused about all the fields in the Add Artist page for example because they assume every single one is mandatory and they don't know what IPI is or something

2017-04-19 10921, 2017

10:56 AM
Freso

Yeah.

2017-04-19 10927, 2017

10:56 AM
reosarevok

(to be fair, "bolded label means mandatory" is not obvious nor explained)

2017-04-19 10948, 2017

10:56 AM
Freso

Will IPIs and ISNIs be moved to attributes with the schema change? Or will that have to be done later down the line?

2017-04-19 10932, 2017

11:13 AM
ruaok

crap. the metabrainz press corps realized that chocolate should be present in the office at all times.

2017-04-19 10942, 2017

11:13 AM
ruaok

anyone need chocolate?? come on down!

2017-04-19 10932, 2017

11:24 AM
naught101 joined the channel

2017-04-19 10954, 2017

11:28 AM
Rotab

y...yes

2017-04-19 10931, 2017

11:32 AM
alastairp

ruaok: LB-154, interesting

2017-04-19 10931, 2017

11:32 AM
BrainzBot

LB-154: Use sets rather than dicts to unique timestamps https://tickets.metabrainz.org/browse/LB-154

2017-04-19 10946, 2017

11:32 AM
alastairp

I thought that dict keys hash exactly the same as sets?

2017-04-19 10913, 2017

11:33 AM
ruaok

functionally they are the same, but I agree with zas that sets are faster for this.

2017-04-19 10922, 2017

11:33 AM
alastairp

If not, I've learned something

2017-04-19 10928, 2017

11:33 AM
ruaok

from what I know.

2017-04-19 10944, 2017

11:33 AM
ruaok

easy to find out. let me file this paper work and hack up a test.

2017-04-19 10911, 2017

11:35 AM
ruaok makes a calendar entry for 2019

2017-04-19 10903, 2017

11:41 AM
ruaok

alastairp, zas: https://gist.github.com/mayhem/26f54dde4ca27a3ab0…

2017-04-19 10907, 2017

11:41 AM
ruaok

not at all what I expected.

2017-04-19 10912, 2017

11:41 AM
ruaok

did I do that right?

2017-04-19 10919, 2017

11:44 AM
ruaok

1.3257920742

2017-04-19 10919, 2017

11:44 AM
ruaok

3.98488116264

2017-04-19 10952, 2017

11:44 AM
ruaok

on 10M rows, now with verifying that the same results are generated.

2017-04-19 10945, 2017

11:45 AM
colbydray joined the channel

2017-04-19 10949, 2017

11:46 AM
alastairp

Can you use https://docs.python.org/2/library/timeit.html?

2017-04-19 10933, 2017

11:47 AM
Gentlecat

https://wiki.python.org/moin/TimeComplexity

2017-04-19 10909, 2017

11:48 AM
alastairp

sets.Set is very deprecated. I'm not sure if it has the same behavior as 'set'

2017-04-19 10913, 2017

11:49 AM
alastairp

Interesting, I don't see a entry for set.add

2017-04-19 10922, 2017

11:49 AM
alastairp

But there is for dict insert

2017-04-19 10911, 2017

11:51 AM
Gentlecat

https://www.ics.uci.edu/~pattis/ICS-33/lectures/c…

2017-04-19 10913, 2017

11:51 AM
ruaok

really, measuring this differently isn't going to change anything for a simple case like this.

2017-04-19 10921, 2017

11:51 AM
Gentlecat

it's constant

2017-04-19 10921, 2017

11:52 AM
alastairp

Yeah, I thought that was the case

2017-04-19 10947, 2017

11:52 AM
alastairp

I wonder what the speed of appending to a list then calling set(list) is

2017-04-19 10954, 2017

11:52 AM
Gentlecat

you should use https://docs.python.org/2/library/stdtypes.html#s… though

2017-04-19 10930, 2017

11:53 AM
ruaok

ok, now using set (not Set):

2017-04-19 10935, 2017

11:53 AM
ruaok

1.37781596184

2017-04-19 10935, 2017

11:53 AM
ruaok

1.85512280464

2017-04-19 10940, 2017

11:53 AM
Gentlecat

however much time it takes to go through a list, I imagine

2017-04-19 10944, 2017

11:53 AM
ruaok

sets are still slower.

2017-04-19 10910, 2017

11:54 AM
ruaok

well, both do it the same way, so that doesn't factor into determining the difference between the two.

2017-04-19 10925, 2017

11:54 AM
alastairp

Wow, that's really interesting

2017-04-19 10939, 2017

11:54 AM
alastairp

That the set type is so much faster than the class

2017-04-19 10959, 2017

11:54 AM
alastairp

Which version of python?

2017-04-19 10910, 2017

11:55 AM
ruaok

https://gist.github.com/mayhem/2c09a454c436836851…

2017-04-19 10915, 2017

11:55 AM
ruaok

hipster classic, of course.

2017-04-19 10917, 2017

11:55 AM
ruaok

2.2!

2017-04-19 10918, 2017

11:55 AM
D4RK-PH0ENiX has quit

2017-04-19 10946, 2017

11:56 AM
alastairp

This is where a nerd will decompile those two statements and see exactly how many operations each is doing :)

2017-04-19 10913, 2017

11:57 AM
ruaok

thought about it. didn't care enough.

2017-04-19 10919, 2017

11:57 AM
ruaok

checking 2 vs 3 though

2017-04-19 10943, 2017

11:57 AM
alastairp

Cool

2017-04-19 10901, 2017

11:58 AM
ruaok

O_O

2017-04-19 10943, 2017

11:58 AM
ruaok

https://gist.github.com/mayhem/ab35ecea98da585421…

2017-04-19 10946, 2017

11:58 AM
ruaok

clear as mud. :)

2017-04-19 10949, 2017

11:58 AM
alastairp

How many items are we considering?

2017-04-19 10955, 2017

11:58 AM
alastairp

In lb

2017-04-19 10900, 2017

11:59 AM
ruaok

dicts got faster, sets got slower.

2017-04-19 10924, 2017

11:59 AM
ruaok

for that piece of code, we're talking about a batch which is limited to the size of a block of listens.

2017-04-19 10926, 2017

11:59 AM
ruaok

sub 100.

2017-04-19 10933, 2017

11:59 AM
alastairp

Hmm

2017-04-19 10937, 2017

11:59 AM
ruaok

ie. not worth talking about for the most part.

2017-04-19 10941, 2017

11:59 AM
alastairp

Yeah

2017-04-19 10910, 2017

12:00 PM
alastairp

Sounds like premature optimisation... My recommendation would be to optimise for readability not speed

2017-04-19 10923, 2017

12:00 PM
ruaok

the bigger problem we have is that we do too much serializing/deserializing the incoming listens. and passing over them too many times to sanity check them.

2017-04-19 10940, 2017

12:00 PM
alastairp

Especially considering the two timestamp conversions that you're doing in the same loop

2017-04-19 10945, 2017

12:00 PM
ruaok

well, I've done a pile of stress tests to see what is going to be a problem.

2017-04-19 10901, 2017

12:01 PM
alastairp

Yeah, we wondered about that when we first did it

2017-04-19 10908, 2017

12:01 PM
alastairp

Even with ujson?

2017-04-19 10915, 2017

12:01 PM
ruaok

the influx schema was a problem, but is much better now.

2017-04-19 10935, 2017

12:01 PM
ruaok

yes, even with ujson. we're just doing too much stuff that needs streamlining.

2017-04-19 10900, 2017

12:02 PM
ruaok

to the point where importing listens the sanity checking the incoming data is the major bottleneck now.

2017-04-19 10944, 2017

12:02 PM
ruaok

which is why I stuffed tons of fake listens directly into rabbitmq to find problems.

2017-04-19 10902, 2017

12:03 PM
alastairp

Right

2017-04-19 10918, 2017

12:03 PM
ruaok

I'm now happy that we'll avoid the obvious walls and i have other, less pressing issues on my radar.

2017-04-19 10943, 2017

12:03 PM
ruaok

but my focus still remains data integrity, which I'm almost happy with.

2017-04-19 10956, 2017

12:03 PM
alastairp

This is things like 'validate_listens'?

2017-04-19 10908, 2017

12:04 PM
alastairp

Or is it bigger than that?

2017-04-19 10914, 2017

12:04 PM
ruaok

I need to run a test and make sure that all data that enters correctly ends up at BQ.

2017-04-19 10921, 2017

12:04 PM
ruaok

yeah, mostly all that.

2017-04-19 10929, 2017

12:04 PM
alastairp

I'll see if I can put it through a profiler to see where the bad parts are

2017-04-19 10938, 2017

12:04 PM
alastairp

That's not very difficult

2017-04-19 10942, 2017

12:04 PM
ruaok

it is too early for a profiler.

2017-04-19 10948, 2017

12:04 PM
ruaok

there is just stupid shit going on. :)

2017-04-19 10913, 2017

12:05 PM
ruaok

we can find problems by inspection, much like shooting fish in a barrel.

2017-04-19 10926, 2017

12:05 PM
ruaok

but, this is easy to fix as opposed to getting the schema wrong.

2017-04-19 10928, 2017

12:05 PM
alastairp

Sure. So if you can point out a stupid thing, we can fix it. But the profiler will tell us the same thing without having to wait for you

2017-04-19 10952, 2017

12:05 PM
alastairp

We can inspect for loops and thing

2017-04-19 10909, 2017

12:06 PM
ruaok

sure, if you want to do more premature optimization, go for it. :) :)

2017-04-19 10945, 2017

12:06 PM
ruaok

I think in passing the listens around we do more than one conversion of the data and that is dumb.

2017-04-19 10956, 2017

12:06 PM
Gentlecat

profiler actually helps you figure out if something is premature or not

2017-04-19 10922, 2017

12:07 PM
ruaok

do premature optimization with a profiler to determine if optimization is premature?

2017-04-19 10925, 2017

12:07 PM
ruaok

I like it, very meta. :)

2017-04-19 10933, 2017

12:07 PM
ruaok

and recursive.

2017-04-19 10908, 2017

12:08 PM
ruaok

in any case, I've got much more signficant issues to tackle for the time being, so I am going to focus on those

2017-04-19 10919, 2017

12:08 PM
alastairp

Just had a look at validate. There are not many loops :(

2017-04-19 10920, 2017

12:08 PM
ruaok

I know what is slow and that is sufficient for now.

2017-04-19 10940, 2017

12:08 PM
ruaok

I think the slowness comes one level up.

2017-04-19 10941, 2017

12:08 PM
alastairp

Uuid validation. Perhaps that takes some time

2017-04-19 10945, 2017

12:08 PM
alastairp

Ah, ok

2017-04-19 10951, 2017

12:08 PM
ruaok

parse for validate, then store.

2017-04-19 10957, 2017

12:08 PM
D4RK-PH0ENiX joined the channel

2017-04-19 10959, 2017

12:08 PM
ruaok

parse/convert again for dedeup

2017-04-19 10916, 2017

12:09 PM
Gentlecat

if you don't know for sure what's slow then it's just guesswork

2017-04-19 10917, 2017

12:09 PM
alastairp

Right, now I see where you're coming from

2017-04-19 10937, 2017

12:09 PM
alastairp

That's definitely the right place to start, then

2017-04-19 10937, 2017

12:09 PM
Gentlecat

and there are tools to help with that, so we should use them

2017-04-19 10938, 2017

12:09 PM
ruaok

we need to review the code flow, not individual statements.

2017-04-19 10911, 2017

12:10 PM
Gentlecat

http://www.brendangregg.com/flamegraphs.html would actually help with code flow

2017-04-19 10926, 2017

12:10 PM
Gentlecat

if I understand what you mean by "code flow" correctly

2017-04-19 10932, 2017

12:10 PM
D4RK-PH0ENiX has quit

2017-04-19 10939, 2017

12:10 PM
D4RK-PH0ENiX joined the channel

2017-04-19 10938, 2017

12:11 PM
zas

ruaok: making a loop to add to a set() isn't the best idea, when you can just pass the iterable as set() argument: s = set() ; for x in r; s.add(x) vs s = set(r)

2017-04-19 10957, 2017

12:11 PM
ruaok

zas: sure, but for this test it doesn't matter.

2017-04-19 10957, 2017

12:11 PM
zas

but i don't know if it is actually faster or how much it is