so the idea would be to make 2 more works, each linked to its respective recording, and both linked to the work. and then I can link the score to those new works?
2019-08-21 23312, 2019
yvanzo
Right, reosarevok has probably some example from classical music that uses the same trick.
but what I plan to do is, when train_models.py complete its execution then we will append data to this dataframe
2019-08-21 23321, 2019
pristine__
so how do we get the values of to_date and from_date?
2019-08-21 23352, 2019
alastairp
hay is for horses, but hey is an interjection
2019-08-21 23304, 2019
pristine__
alastairp: hi
2019-08-21 23310, 2019
alastairp
hi pristine__
2019-08-21 23317, 2019
alastairp
yes, I saw your note, I haven't got to it yet
2019-08-21 23321, 2019
pristine__
<3
2019-08-21 23328, 2019
pristine__
no problem :)
2019-08-21 23347, 2019
oknozor has quit
2019-08-21 23331, 2019
pristine__
alastairp: want to review the above problem?
2019-08-21 23301, 2019
alastairp
I can't right now, sorry, but I'll come back to it when I look at the other one if you still haven't solved it
2019-08-21 23311, 2019
pristine__
oh. sure :)
2019-08-21 23331, 2019
alastairp
ah, actually. I just looked
2019-08-21 23311, 2019
pristine__
Nice
2019-08-21 23315, 2019
alastairp
so, when dealing with things like dates (and other ranges), it's almost always the best idea to move the selection of the dates as close to the user/calling script as possible
2019-08-21 23351, 2019
alastairp
this makes it easy to pass this information around different places (make a query, store the data in a parquet), and it also makes testing much easier
2019-08-21 23312, 2019
alastairp
imagine that you wanted to test get_listens_for_training_model_window, this would give a different result depending on when you ran the tests
2019-08-21 23338, 2019
pristine__
> it's almost always the best idea to move the selection of the dates as close to the user/calling script as possible
2019-08-21 23341, 2019
pristine__
like?
2019-08-21 23342, 2019
alastairp
in this case, I'd definitely pass in to_date as an argument to get_listens_for_training_model_window
2019-08-21 23304, 2019
alastairp
you already do something similar - the call to utils.get_listens already passes in the date range
2019-08-21 23320, 2019
pristine__
yup. there are two scripts
2019-08-21 23330, 2019
pristine__
one will be run at some point in time
2019-08-21 23346, 2019
pristine__
we need to record that point in time and use it when he other script runs
2019-08-21 23351, 2019
alastairp
so you could make to_date in main(), or even better, consider passing it in as an argument
2019-08-21 23354, 2019
pristine__
is it possible?
2019-08-21 23323, 2019
pristine__
passing as argument to main?
2019-08-21 23325, 2019
alastairp
I don't really like it when scripts use the time that they're run at as a value - this is important for things like error recovery
2019-08-21 23334, 2019
pristine__
right
2019-08-21 23353, 2019
pristine__
so I calculate time somwhere else, and use that value in all places
2019-08-21 23356, 2019
alastairp
e.g. imagine if you run every day, but one day for some reason it failed... machine was broken, unexpected power cut, bug in the code
2019-08-21 23305, 2019
pristine__
and not calculate it while running the script?
2019-08-21 23312, 2019
pristine__
true.
2019-08-21 23319, 2019
alastairp
I do this kind of thing in a different project. I have a database table that contains a list of dates
2019-08-21 23334, 2019
alastairp
and so when I run it, I look at the latest date that was run, and then do the next one
2019-08-21 23303, 2019
pristine__
can i have link of the project?
2019-08-21 23346, 2019
alastairp
there are some special cases here - e.g. if there are no dates in the table, you can definitely choose 'today' as the date to do it. also, do a check to make sure what today actually is, so that you can make sure that you run the script twice if needed (if for example you skipped a day)
2019-08-21 23313, 2019
alastairp
it's not very important what the project is, just the concept. it won't translate very cleanly to what you're doing, I think
2019-08-21 23302, 2019
pristine__
> do a check to make sure what today is so that you can run the script twice
2019-08-21 23314, 2019
pristine__
Store today's date you mean?
2019-08-21 23327, 2019
pristine__
If there are no dates.
2019-08-21 23341, 2019
pristine__
Use today's date and store it.
2019-08-21 23346, 2019
pristine__
no?
2019-08-21 23350, 2019
alastairp
imagine if you run this script daily. you start the script today (aug 21), but you see that the last time that you ran it was aug 18. you need to decide what you want to do. do you just run it once (and so you calculate data for aug 19), or do you run it twice (19th/20th?)
2019-08-21 23338, 2019
alastairp
looking at your code, I have a few other questions - you get a datetime at utcnow. this includes hours, minutes, seconds, microseconds. then you go back in time a month, then to the first of that month
2019-08-21 23355, 2019
pristine__
Yup
2019-08-21 23304, 2019
alastairp
but this doesn't appear to be midnight on the first of that month, instead it's from whatever hour/minute/second you run the script
2019-08-21 23347, 2019
pristine__
Yes. But would that affect anything, since we only need the month?
2019-08-21 23351, 2019
pristine__
Oh.
2019-08-21 23300, 2019
alastairp
I don't know. are you using this value to select listens from the database?
2019-08-21 23310, 2019
pristine__
Yes
2019-08-21 23318, 2019
alastairp
or do you discard hour information at some point?
2019-08-21 23326, 2019
pristine__
That 1.parquet and so on.
2019-08-21 23340, 2019
pristine__
We don't discard the hour info but just use the month value
2019-08-21 23358, 2019
pristine__
So do you mean that this inconsistency in hours can add on in time and lead to error
2019-08-21 23339, 2019
alastairp
oh, I see. you pass in a datetime, and then use the year and month to read a file from hdfs
2019-08-21 23346, 2019
pristine__
Yup
2019-08-21 23357, 2019
alastairp
is there any reason why you don't use a `date`, then?
2019-08-21 23332, 2019
pristine__
Not really.
2019-08-21 23341, 2019
alastairp
what's the behaviour if I run it right now? (Aug 21)
2019-08-21 23307, 2019
alastairp
it's inclusive, according to the docs, so... it'll process data from July, and data from August up til now?
2019-08-21 23310, 2019
pristine__
Suppose
2019-08-21 23333, 2019
alastairp
does that make sense?
2019-08-21 23339, 2019
pristine__
It will take dataframes of these 2 months.
2019-08-21 23339, 2019
pristine__
Yes
2019-08-21 23341, 2019
alastairp
(perhaps it does, I'm just trying to understand it)
2019-08-21 23349, 2019
alastairp
but the data for August is incomplete
2019-08-21 23352, 2019
alastairp
is that a problem?
2019-08-21 23355, 2019
pristine__
Yes.
2019-08-21 23356, 2019
pristine__
No
2019-08-21 23301, 2019
pristine__
I will tell you
2019-08-21 23311, 2019
pristine__
Data in hdfs is monthwise
2019-08-21 23319, 2019
pristine__
One parquet for every month
2019-08-21 23322, 2019
alastairp
sorry, I don't have much time, I'm in the middle of some other stuff
2019-08-21 23340, 2019
pristine__
So if I want a weeks data I need to get parquet of whole month
2019-08-21 23355, 2019
alastairp
in response to the original question, I'd move the call to datetime.utcnow() to main, or use it as an argument to the script
2019-08-21 23300, 2019
pristine__
No prob. Whenever you are free (whenever) , just ping.
2019-08-21 23302, 2019
pristine__
Umm....We cannot pass it as argument to other script because scripts are independent. So maybe we can calcukte date some where else and use those values everywhere
2019-08-21 23322, 2019
pristine__
(maybe)
2019-08-21 23300, 2019
pristine__
ruaok: I guess, I will make another script just to calculate dates and store them centrally
2019-08-21 23311, 2019
pristine__
so that they can be used by any file
2019-08-21 23304, 2019
pristine__
ruaok: like alastairp suggested the idea of tables, our case is diff. We just need to store when was last create_dataframes was run so that when train_models is run we can fetch the last to_date and from_datw value and store in model_metadata. Note that model_metdata would already have info about previous runs.
2019-08-21 23356, 2019
alastairp
how often do you run create_dataframes?
2019-08-21 23303, 2019
alastairp
and how often do you run train_models?
2019-08-21 23325, 2019
pristine__
Not decided yet. But weekly most probably
2019-08-21 23349, 2019
pristine__
First create_dataframes, then train_models then recommend
2019-08-21 23320, 2019
pristine__
Like pre-process data, train data, generate recommendations
2019-08-21 23341, 2019
alastairp
ok, sure. it's not important
2019-08-21 23301, 2019
alastairp
but you'll run it periodically?
2019-08-21 23319, 2019
alastairp
did you decide how you'll run it yet? something like cron?
2019-08-21 23332, 2019
pristine__
Yup. Every week LB server would request recommendations of certain users and we will run these scripts
2019-08-21 23306, 2019
pristine__
On new data. (Data added over the past week)
2019-08-21 23324, 2019
pristine__
alastairp: what is your time zone?
2019-08-21 23303, 2019
alastairp
I'm the same as ruaok
2019-08-21 23324, 2019
alastairp
(when he's not travelling...)
2019-08-21 23345, 2019
alastairp
so, some of these scripts depend on knowing what users to run for, and some don't require this information yet, right?
2019-08-21 23308, 2019
ruaok
I'm still in the same time zone. :-)
2019-08-21 23320, 2019
alastairp
from the names, I'm guessing that train_models depends on user information, but create_dataframes runs all the time regardless?
2019-08-21 23341, 2019
pristine__
Create_dataframe will fetch listens of X months from hdfs. Pre process them, save back to hdfs. Train_models fetches this pre process data from hdfs and train it.
2019-08-21 23315, 2019
pristine__
This data will include all the users covered in last X months
2019-08-21 23337, 2019
pristine__
But we may require rec for only a subset of users, so candidate_sets.py will generate sets only for those users and recommd.py will generate recommedations using these sets.
2019-08-21 23317, 2019
pristine__
( sorry for the hard time understanding this workflow. It is not well defined yet )
2019-08-21 23333, 2019
alastairp
I think that once you understand the workflow a little more, the additional stuff you'll need will make more sense
2019-08-21 23352, 2019
alastairp
for example how you want to keep track of how often you run this script
2019-08-21 23304, 2019
alastairp
be it a database table, or ..., or ,...
2019-08-21 23302, 2019
Gazooo has quit
2019-08-21 23300, 2019
pristine__
Right now, we wish to just store the X months data of which was used to train the model. So for eg, 1-06-2019 and 1-08-2019
2019-08-21 23347, 2019
Gazooo joined the channel
2019-08-21 23301, 2019
pristine__
We just need to keep track of dates, data of which was used to generate dataframes.
2019-08-21 23337, 2019
gr0uch0mars has quit
2019-08-21 23306, 2019
pristine__
When train_model would be run, we need these dates because after execution of train model a call will be made to store these dates along with model id and other info, to be stored in a table that keeps model metadata
ah, that's great then if you're already going to add it to this table
2019-08-21 23345, 2019
alastairp
my recommendation is to read the date of the last time you ran it from the table, and then use that to calculate the date ranges
2019-08-21 23358, 2019
alastairp
pass those ranges into whatever functions you need, to get the data out of it
2019-08-21 23311, 2019
alastairp
and then you have the dates to add the new data to the table
2019-08-21 23303, 2019
pristine__
What if the range change? Last time I ran from 1-05-19 (to_date) to 1-06-19(from_date) but this time I want to run on two months data. So probably just fetch from_date from table, which will become to_date of your new run and then add no of months to this date so from_datw now becomes 1-08-19
2019-08-21 23310, 2019
pristine__
and continue like wise.
2019-08-21 23327, 2019
alastairp
why might you want to change the range?
2019-08-21 23328, 2019
pristine__
We might start with training on a weeks data but we then realise that recommendations are not good enough, so may be for next time we change the range to include more listens
2019-08-21 23346, 2019
pristine__
I did not consider range to be constant as of now because it all depends on quality of recommendation. We might need to increase/decrease range and play with recommendations