#metabrainz

/

15:13 PM
alastairp

and we have 2 scores, one for each recording

2019-08-21 23325, 2019

15:14 PM
alastairp

so the idea would be to make 2 more works, each linked to its respective recording, and both linked to the work. and then I can link the score to those new works?

2019-08-21 23312, 2019

15:15 PM
yvanzo

Right, reosarevok has probably some example from classical music that uses the same trick.

2019-08-21 23324, 2019

15:15 PM
reosarevok

I'd say that, yes

2019-08-21 23331, 2019

15:15 PM
reosarevok

Is this a thing that you have happen often?

2019-08-21 23329, 2019

15:16 PM
alastairp

we have 160 recordings/transcriptions

2019-08-21 23354, 2019

16:27 PM
gr0uch0mars joined the channel

2019-08-21 23319, 2019

16:40 PM
pristine__

ruaok: hey

2019-08-21 23329, 2019

16:42 PM
BrainzGit

[musicbrainz-server] reosarevok opened pull request #1176 (master…MBS-4644): MBS-4644: Show releases that have CAA images in lists https://github.com/metabrainz/musicbrainz-server/…

2019-08-21 23330, 2019

16:42 PM
BrainzBot

MBS-4644: Indicate which releases have CAA art in listings https://tickets.metabrainz.org/browse/MBS-4644

2019-08-21 23320, 2019

16:43 PM
ruaok

pristine__: hay is for horses.

2019-08-21 23343, 2019

16:43 PM
pristine__

Lol

2019-08-21 23309, 2019

16:44 PM
pristine__

So, you remember the to_date and from_date in create_datafames?

2019-08-21 23331, 2019

16:45 PM
ruaok

Yes.

2019-08-21 23320, 2019

16:46 PM
pristine__

https://github.com/metabrainz/listenbrainz-labs/b…

2019-08-21 23321, 2019

16:46 PM
pristine__

nice

2019-08-21 23325, 2019

16:46 PM
pristine__

so,

2019-08-21 23350, 2019

16:46 PM
pristine__

we need to store it these two values in model_metadata.parquet

2019-08-21 23314, 2019

16:47 PM
pristine__

https://github.com/metabrainz/listenbrainz-labs/b…

2019-08-21 23318, 2019

16:47 PM
pristine__

here

2019-08-21 23356, 2019

16:47 PM
pristine__

but what I plan to do is, when train_models.py complete its execution then we will append data to this dataframe

2019-08-21 23321, 2019

16:48 PM
pristine__

so how do we get the values of to_date and from_date?

2019-08-21 23352, 2019

16:49 PM
alastairp

hay is for horses, but hey is an interjection

2019-08-21 23304, 2019

16:50 PM
pristine__

alastairp: hi

2019-08-21 23310, 2019

16:50 PM
alastairp

hi pristine__

2019-08-21 23317, 2019

16:50 PM
alastairp

yes, I saw your note, I haven't got to it yet

2019-08-21 23321, 2019

16:50 PM
pristine__

<3

2019-08-21 23328, 2019

16:50 PM
pristine__

no problem :)

2019-08-21 23347, 2019

16:50 PM
oknozor has quit

2019-08-21 23331, 2019

16:51 PM
pristine__

alastairp: want to review the above problem?

2019-08-21 23301, 2019

16:52 PM
alastairp

I can't right now, sorry, but I'll come back to it when I look at the other one if you still haven't solved it

2019-08-21 23311, 2019

16:52 PM
pristine__

oh. sure :)

2019-08-21 23331, 2019

16:52 PM
alastairp

ah, actually. I just looked

2019-08-21 23311, 2019

16:53 PM
pristine__

Nice

2019-08-21 23315, 2019

16:53 PM
alastairp

so, when dealing with things like dates (and other ranges), it's almost always the best idea to move the selection of the dates as close to the user/calling script as possible

2019-08-21 23351, 2019

16:53 PM
alastairp

this makes it easy to pass this information around different places (make a query, store the data in a parquet), and it also makes testing much easier

2019-08-21 23312, 2019

16:54 PM
alastairp

imagine that you wanted to test get_listens_for_training_model_window, this would give a different result depending on when you ran the tests

2019-08-21 23338, 2019

16:54 PM
pristine__

> it's almost always the best idea to move the selection of the dates as close to the user/calling script as possible

2019-08-21 23341, 2019

16:54 PM
pristine__

like?

2019-08-21 23342, 2019

16:54 PM
alastairp

in this case, I'd definitely pass in to_date as an argument to get_listens_for_training_model_window

2019-08-21 23304, 2019

16:55 PM
alastairp

you already do something similar - the call to utils.get_listens already passes in the date range

2019-08-21 23320, 2019

16:55 PM
pristine__

yup. there are two scripts

2019-08-21 23330, 2019

16:55 PM
pristine__

one will be run at some point in time

2019-08-21 23346, 2019

16:55 PM
pristine__

we need to record that point in time and use it when he other script runs

2019-08-21 23351, 2019

16:55 PM
alastairp

so you could make to_date in main(), or even better, consider passing it in as an argument

2019-08-21 23354, 2019

16:55 PM
pristine__

is it possible?

2019-08-21 23323, 2019

16:56 PM
pristine__

passing as argument to main?

2019-08-21 23325, 2019

16:56 PM
alastairp

I don't really like it when scripts use the time that they're run at as a value - this is important for things like error recovery

2019-08-21 23334, 2019

16:56 PM
pristine__

right

2019-08-21 23353, 2019

16:56 PM
pristine__

so I calculate time somwhere else, and use that value in all places

2019-08-21 23356, 2019

16:56 PM
alastairp

e.g. imagine if you run every day, but one day for some reason it failed... machine was broken, unexpected power cut, bug in the code

2019-08-21 23305, 2019

16:57 PM
pristine__

and not calculate it while running the script?

2019-08-21 23312, 2019

16:57 PM
pristine__

true.

2019-08-21 23319, 2019

16:57 PM
alastairp

I do this kind of thing in a different project. I have a database table that contains a list of dates

2019-08-21 23334, 2019

16:57 PM
alastairp

and so when I run it, I look at the latest date that was run, and then do the next one

2019-08-21 23303, 2019

16:58 PM
pristine__

can i have link of the project?

2019-08-21 23346, 2019

16:58 PM
alastairp

there are some special cases here - e.g. if there are no dates in the table, you can definitely choose 'today' as the date to do it. also, do a check to make sure what today actually is, so that you can make sure that you run the script twice if needed (if for example you skipped a day)

2019-08-21 23313, 2019

16:59 PM
alastairp

it's not very important what the project is, just the concept. it won't translate very cleanly to what you're doing, I think

2019-08-21 23302, 2019

17:00 PM
pristine__

> do a check to make sure what today is so that you can run the script twice

2019-08-21 23314, 2019

17:00 PM
pristine__

Store today's date you mean?

2019-08-21 23327, 2019

17:00 PM
pristine__

If there are no dates.

2019-08-21 23341, 2019

17:00 PM
pristine__

Use today's date and store it.

2019-08-21 23346, 2019

17:00 PM
pristine__

no?

2019-08-21 23350, 2019

17:01 PM
alastairp

imagine if you run this script daily. you start the script today (aug 21), but you see that the last time that you ran it was aug 18. you need to decide what you want to do. do you just run it once (and so you calculate data for aug 19), or do you run it twice (19th/20th?)

2019-08-21 23338, 2019

17:02 PM
alastairp

looking at your code, I have a few other questions - you get a datetime at utcnow. this includes hours, minutes, seconds, microseconds. then you go back in time a month, then to the first of that month

2019-08-21 23355, 2019

17:02 PM
pristine__

Yup

2019-08-21 23304, 2019

17:03 PM
alastairp

but this doesn't appear to be midnight on the first of that month, instead it's from whatever hour/minute/second you run the script

2019-08-21 23347, 2019

17:03 PM
pristine__

Yes. But would that affect anything, since we only need the month?

2019-08-21 23351, 2019

17:03 PM
pristine__

Oh.

2019-08-21 23300, 2019

17:04 PM
alastairp

I don't know. are you using this value to select listens from the database?

2019-08-21 23310, 2019

17:04 PM
pristine__

Yes

2019-08-21 23318, 2019

17:04 PM
alastairp

or do you discard hour information at some point?

2019-08-21 23326, 2019

17:04 PM
pristine__

That 1.parquet and so on.

2019-08-21 23340, 2019

17:04 PM
pristine__

We don't discard the hour info but just use the month value

2019-08-21 23358, 2019

17:04 PM
pristine__

So do you mean that this inconsistency in hours can add on in time and lead to error

2019-08-21 23339, 2019

17:05 PM
alastairp

oh, I see. you pass in a datetime, and then use the year and month to read a file from hdfs

2019-08-21 23346, 2019

17:05 PM
pristine__

Yup

2019-08-21 23357, 2019

17:05 PM
alastairp

is there any reason why you don't use a `date`, then?

2019-08-21 23332, 2019

17:06 PM
pristine__

Not really.

2019-08-21 23341, 2019

17:06 PM
alastairp

what's the behaviour if I run it right now? (Aug 21)

2019-08-21 23307, 2019

17:07 PM
alastairp

it's inclusive, according to the docs, so... it'll process data from July, and data from August up til now?

2019-08-21 23310, 2019

17:07 PM
pristine__

Suppose

2019-08-21 23333, 2019

17:07 PM
alastairp

does that make sense?

2019-08-21 23339, 2019

17:07 PM
pristine__

It will take dataframes of these 2 months.

2019-08-21 23339, 2019

17:07 PM
pristine__

Yes

2019-08-21 23341, 2019

17:07 PM
alastairp

(perhaps it does, I'm just trying to understand it)

2019-08-21 23349, 2019

17:07 PM
alastairp

but the data for August is incomplete

2019-08-21 23352, 2019

17:07 PM
alastairp

is that a problem?

2019-08-21 23355, 2019

17:07 PM
pristine__

Yes.

2019-08-21 23356, 2019

17:07 PM
pristine__

No

2019-08-21 23301, 2019

17:08 PM
pristine__

I will tell you

2019-08-21 23311, 2019

17:08 PM
pristine__

Data in hdfs is monthwise

2019-08-21 23319, 2019

17:08 PM
pristine__

One parquet for every month

2019-08-21 23322, 2019

17:08 PM
alastairp

sorry, I don't have much time, I'm in the middle of some other stuff

2019-08-21 23340, 2019

17:08 PM
pristine__

So if I want a weeks data I need to get parquet of whole month

2019-08-21 23355, 2019

17:08 PM
alastairp

in response to the original question, I'd move the call to datetime.utcnow() to main, or use it as an argument to the script

2019-08-21 23300, 2019

17:09 PM
pristine__

No prob. Whenever you are free (whenever) , just ping.

2019-08-21 23302, 2019

17:10 PM
pristine__

Umm....We cannot pass it as argument to other script because scripts are independent. So maybe we can calcukte date some where else and use those values everywhere

2019-08-21 23322, 2019

17:10 PM
pristine__

(maybe)

2019-08-21 23300, 2019

17:19 PM
pristine__

ruaok: I guess, I will make another script just to calculate dates and store them centrally

2019-08-21 23311, 2019

17:19 PM
pristine__

so that they can be used by any file

2019-08-21 23304, 2019

18:41 PM
pristine__

ruaok: like alastairp suggested the idea of tables, our case is diff. We just need to store when was last create_dataframes was run so that when train_models is run we can fetch the last to_date and from_datw value and store in model_metadata. Note that model_metdata would already have info about previous runs.

2019-08-21 23356, 2019

18:41 PM
alastairp

how often do you run create_dataframes?

2019-08-21 23303, 2019

18:42 PM
alastairp

and how often do you run train_models?

2019-08-21 23325, 2019

18:42 PM
pristine__

Not decided yet. But weekly most probably

2019-08-21 23349, 2019

18:42 PM
pristine__

First create_dataframes, then train_models then recommend

2019-08-21 23320, 2019

18:43 PM
pristine__

Like pre-process data, train data, generate recommendations

2019-08-21 23341, 2019

18:45 PM
alastairp

ok, sure. it's not important

2019-08-21 23301, 2019

18:46 PM
alastairp

but you'll run it periodically?

2019-08-21 23319, 2019

18:46 PM
alastairp

did you decide how you'll run it yet? something like cron?

2019-08-21 23332, 2019

18:48 PM
pristine__

Yup. Every week LB server would request recommendations of certain users and we will run these scripts

2019-08-21 23306, 2019

18:49 PM
pristine__

On new data. (Data added over the past week)

2019-08-21 23324, 2019

18:52 PM
pristine__

alastairp: what is your time zone?

2019-08-21 23303, 2019

18:53 PM
alastairp

I'm the same as ruaok

2019-08-21 23324, 2019

18:53 PM
alastairp

(when he's not travelling...)

2019-08-21 23345, 2019

18:53 PM
alastairp

so, some of these scripts depend on knowing what users to run for, and some don't require this information yet, right?

2019-08-21 23308, 2019

18:54 PM
ruaok

I'm still in the same time zone. :-)

2019-08-21 23320, 2019

18:54 PM
alastairp

from the names, I'm guessing that train_models depends on user information, but create_dataframes runs all the time regardless?

2019-08-21 23341, 2019

18:57 PM
pristine__

Create_dataframe will fetch listens of X months from hdfs. Pre process them, save back to hdfs. Train_models fetches this pre process data from hdfs and train it.

2019-08-21 23315, 2019

18:58 PM
pristine__

This data will include all the users covered in last X months

2019-08-21 23337, 2019

18:59 PM
pristine__

But we may require rec for only a subset of users, so candidate_sets.py will generate sets only for those users and recommd.py will generate recommedations using these sets.

2019-08-21 23317, 2019

19:00 PM
pristine__

( sorry for the hard time understanding this workflow. It is not well defined yet )

2019-08-21 23333, 2019

19:02 PM
alastairp

I think that once you understand the workflow a little more, the additional stuff you'll need will make more sense

2019-08-21 23352, 2019

19:02 PM
alastairp

for example how you want to keep track of how often you run this script

2019-08-21 23304, 2019

19:03 PM
alastairp

be it a database table, or ..., or ,...

2019-08-21 23302, 2019

19:05 PM
Gazooo has quit

2019-08-21 23300, 2019

19:06 PM
pristine__

Right now, we wish to just store the X months data of which was used to train the model. So for eg, 1-06-2019 and 1-08-2019

2019-08-21 23347, 2019

19:06 PM
Gazooo joined the channel

2019-08-21 23301, 2019

19:07 PM
pristine__

We just need to keep track of dates, data of which was used to generate dataframes.

2019-08-21 23337, 2019

19:07 PM
gr0uch0mars has quit

2019-08-21 23306, 2019

19:08 PM
pristine__

When train_model would be run, we need these dates because after execution of train model a call will be made to store these dates along with model id and other info, to be stored in a table that keeps model metadata

2019-08-21 23325, 2019

19:09 PM
alastairp

do you have code to store this data in the table?

2019-08-21 23335, 2019

19:09 PM
alastairp

is this part of the train_model script?

2019-08-21 23345, 2019

19:10 PM
pristine__

https://github.com/metabrainz/listenbrainz-labs/b…

2019-08-21 23311, 2019

19:11 PM
pristine__

I am writing the code, got suck with this date thing. Yes it will be part of train_model script.

2019-08-21 23318, 2019

19:11 PM
travis-ci joined the channel

2019-08-21 23318, 2019

19:11 PM
travis-ci

metabrainz/picard#4836 (master - 462add3 : Philipp Wolfer): The build passed.

2019-08-21 23318, 2019

19:11 PM
travis-ci

Change view : https://github.com/metabrainz/picard/compare/ea70…

2019-08-21 23318, 2019

19:11 PM
travis-ci

Build details : https://travis-ci.org/metabrainz/picard/builds/57…

2019-08-21 23318, 2019

19:11 PM
travis-ci has left the channel

2019-08-21 23354, 2019

19:14 PM
reosarevok

ruaok: make sure you are away from bears :/ https://www.bbc.com/news/world-us-canada-49412385

2019-08-21 23308, 2019

19:15 PM
alastairp

ah, that's great then if you're already going to add it to this table

2019-08-21 23345, 2019

19:15 PM
alastairp

my recommendation is to read the date of the last time you ran it from the table, and then use that to calculate the date ranges

2019-08-21 23358, 2019

19:15 PM
alastairp

pass those ranges into whatever functions you need, to get the data out of it

2019-08-21 23311, 2019

19:16 PM
alastairp

and then you have the dates to add the new data to the table

2019-08-21 23303, 2019

19:21 PM
pristine__

What if the range change? Last time I ran from 1-05-19 (to_date) to 1-06-19(from_date) but this time I want to run on two months data. So probably just fetch from_date from table, which will become to_date of your new run and then add no of months to this date so from_datw now becomes 1-08-19

2019-08-21 23310, 2019

19:21 PM
pristine__

and continue like wise.

2019-08-21 23327, 2019

19:22 PM
alastairp

why might you want to change the range?

2019-08-21 23328, 2019

19:23 PM
pristine__

We might start with training on a weeks data but we then realise that recommendations are not good enough, so may be for next time we change the range to include more listens

2019-08-21 23346, 2019

19:24 PM
pristine__

I did not consider range to be constant as of now because it all depends on quality of recommendation. We might need to increase/decrease range and play with recommendations