#metabrainz

/

      • alastairp
        and we have 2 scores, one for each recording
      • 2019-08-21 23325, 2019

      • alastairp
        so the idea would be to make 2 more works, each linked to its respective recording, and both linked to the work. and then I can link the score to those new works?
      • 2019-08-21 23312, 2019

      • yvanzo
        Right, reosarevok has probably some example from classical music that uses the same trick.
      • 2019-08-21 23324, 2019

      • reosarevok
        I'd say that, yes
      • 2019-08-21 23331, 2019

      • reosarevok
        Is this a thing that you have happen often?
      • 2019-08-21 23329, 2019

      • alastairp
        we have 160 recordings/transcriptions
      • 2019-08-21 23354, 2019

      • gr0uch0mars joined the channel
      • 2019-08-21 23319, 2019

      • pristine__
        ruaok: hey
      • 2019-08-21 23329, 2019

      • BrainzGit
        [musicbrainz-server] reosarevok opened pull request #1176 (master…MBS-4644): MBS-4644: Show releases that have CAA images in lists https://github.com/metabrainz/musicbrainz-server/…
      • 2019-08-21 23330, 2019

      • BrainzBot
        MBS-4644: Indicate which releases have CAA art in listings https://tickets.metabrainz.org/browse/MBS-4644
      • 2019-08-21 23320, 2019

      • ruaok
        pristine__: hay is for horses.
      • 2019-08-21 23343, 2019

      • pristine__
        Lol
      • 2019-08-21 23309, 2019

      • pristine__
        So, you remember the to_date and from_date in create_datafames?
      • 2019-08-21 23331, 2019

      • ruaok
        Yes.
      • 2019-08-21 23320, 2019

      • pristine__
      • 2019-08-21 23321, 2019

      • pristine__
        nice
      • 2019-08-21 23325, 2019

      • pristine__
        so,
      • 2019-08-21 23350, 2019

      • pristine__
        we need to store it these two values in model_metadata.parquet
      • 2019-08-21 23314, 2019

      • pristine__
      • 2019-08-21 23318, 2019

      • pristine__
        here
      • 2019-08-21 23356, 2019

      • pristine__
        but what I plan to do is, when train_models.py complete its execution then we will append data to this dataframe
      • 2019-08-21 23321, 2019

      • pristine__
        so how do we get the values of to_date and from_date?
      • 2019-08-21 23352, 2019

      • alastairp
        hay is for horses, but hey is an interjection
      • 2019-08-21 23304, 2019

      • pristine__
        alastairp: hi
      • 2019-08-21 23310, 2019

      • alastairp
        hi pristine__
      • 2019-08-21 23317, 2019

      • alastairp
        yes, I saw your note, I haven't got to it yet
      • 2019-08-21 23321, 2019

      • pristine__
        <3
      • 2019-08-21 23328, 2019

      • pristine__
        no problem :)
      • 2019-08-21 23347, 2019

      • oknozor has quit
      • 2019-08-21 23331, 2019

      • pristine__
        alastairp: want to review the above problem?
      • 2019-08-21 23301, 2019

      • alastairp
        I can't right now, sorry, but I'll come back to it when I look at the other one if you still haven't solved it
      • 2019-08-21 23311, 2019

      • pristine__
        oh. sure :)
      • 2019-08-21 23331, 2019

      • alastairp
        ah, actually. I just looked
      • 2019-08-21 23311, 2019

      • pristine__
        Nice
      • 2019-08-21 23315, 2019

      • alastairp
        so, when dealing with things like dates (and other ranges), it's almost always the best idea to move the selection of the dates as close to the user/calling script as possible
      • 2019-08-21 23351, 2019

      • alastairp
        this makes it easy to pass this information around different places (make a query, store the data in a parquet), and it also makes testing much easier
      • 2019-08-21 23312, 2019

      • alastairp
        imagine that you wanted to test get_listens_for_training_model_window, this would give a different result depending on when you ran the tests
      • 2019-08-21 23338, 2019

      • pristine__
        > it's almost always the best idea to move the selection of the dates as close to the user/calling script as possible
      • 2019-08-21 23341, 2019

      • pristine__
        like?
      • 2019-08-21 23342, 2019

      • alastairp
        in this case, I'd definitely pass in to_date as an argument to get_listens_for_training_model_window
      • 2019-08-21 23304, 2019

      • alastairp
        you already do something similar - the call to utils.get_listens already passes in the date range
      • 2019-08-21 23320, 2019

      • pristine__
        yup. there are two scripts
      • 2019-08-21 23330, 2019

      • pristine__
        one will be run at some point in time
      • 2019-08-21 23346, 2019

      • pristine__
        we need to record that point in time and use it when he other script runs
      • 2019-08-21 23351, 2019

      • alastairp
        so you could make to_date in main(), or even better, consider passing it in as an argument
      • 2019-08-21 23354, 2019

      • pristine__
        is it possible?
      • 2019-08-21 23323, 2019

      • pristine__
        passing as argument to main?
      • 2019-08-21 23325, 2019

      • alastairp
        I don't really like it when scripts use the time that they're run at as a value - this is important for things like error recovery
      • 2019-08-21 23334, 2019

      • pristine__
        right
      • 2019-08-21 23353, 2019

      • pristine__
        so I calculate time somwhere else, and use that value in all places
      • 2019-08-21 23356, 2019

      • alastairp
        e.g. imagine if you run every day, but one day for some reason it failed... machine was broken, unexpected power cut, bug in the code
      • 2019-08-21 23305, 2019

      • pristine__
        and not calculate it while running the script?
      • 2019-08-21 23312, 2019

      • pristine__
        true.
      • 2019-08-21 23319, 2019

      • alastairp
        I do this kind of thing in a different project. I have a database table that contains a list of dates
      • 2019-08-21 23334, 2019

      • alastairp
        and so when I run it, I look at the latest date that was run, and then do the next one
      • 2019-08-21 23303, 2019

      • pristine__
        can i have link of the project?
      • 2019-08-21 23346, 2019

      • alastairp
        there are some special cases here - e.g. if there are no dates in the table, you can definitely choose 'today' as the date to do it. also, do a check to make sure what today actually is, so that you can make sure that you run the script twice if needed (if for example you skipped a day)
      • 2019-08-21 23313, 2019

      • alastairp
        it's not very important what the project is, just the concept. it won't translate very cleanly to what you're doing, I think
      • 2019-08-21 23302, 2019

      • pristine__
        > do a check to make sure what today is so that you can run the script twice
      • 2019-08-21 23314, 2019

      • pristine__
        Store today's date you mean?
      • 2019-08-21 23327, 2019

      • pristine__
        If there are no dates.
      • 2019-08-21 23341, 2019

      • pristine__
        Use today's date and store it.
      • 2019-08-21 23346, 2019

      • pristine__
        no?
      • 2019-08-21 23350, 2019

      • alastairp
        imagine if you run this script daily. you start the script today (aug 21), but you see that the last time that you ran it was aug 18. you need to decide what you want to do. do you just run it once (and so you calculate data for aug 19), or do you run it twice (19th/20th?)
      • 2019-08-21 23338, 2019

      • alastairp
        looking at your code, I have a few other questions - you get a datetime at utcnow. this includes hours, minutes, seconds, microseconds. then you go back in time a month, then to the first of that month
      • 2019-08-21 23355, 2019

      • pristine__
        Yup
      • 2019-08-21 23304, 2019

      • alastairp
        but this doesn't appear to be midnight on the first of that month, instead it's from whatever hour/minute/second you run the script
      • 2019-08-21 23347, 2019

      • pristine__
        Yes. But would that affect anything, since we only need the month?
      • 2019-08-21 23351, 2019

      • pristine__
        Oh.
      • 2019-08-21 23300, 2019

      • alastairp
        I don't know. are you using this value to select listens from the database?
      • 2019-08-21 23310, 2019

      • pristine__
        Yes
      • 2019-08-21 23318, 2019

      • alastairp
        or do you discard hour information at some point?
      • 2019-08-21 23326, 2019

      • pristine__
        That 1.parquet and so on.
      • 2019-08-21 23340, 2019

      • pristine__
        We don't discard the hour info but just use the month value
      • 2019-08-21 23358, 2019

      • pristine__
        So do you mean that this inconsistency in hours can add on in time and lead to error
      • 2019-08-21 23339, 2019

      • alastairp
        oh, I see. you pass in a datetime, and then use the year and month to read a file from hdfs
      • 2019-08-21 23346, 2019

      • pristine__
        Yup
      • 2019-08-21 23357, 2019

      • alastairp
        is there any reason why you don't use a `date`, then?
      • 2019-08-21 23332, 2019

      • pristine__
        Not really.
      • 2019-08-21 23341, 2019

      • alastairp
        what's the behaviour if I run it right now? (Aug 21)
      • 2019-08-21 23307, 2019

      • alastairp
        it's inclusive, according to the docs, so... it'll process data from July, and data from August up til now?
      • 2019-08-21 23310, 2019

      • pristine__
        Suppose
      • 2019-08-21 23333, 2019

      • alastairp
        does that make sense?
      • 2019-08-21 23339, 2019

      • pristine__
        It will take dataframes of these 2 months.
      • 2019-08-21 23339, 2019

      • pristine__
        Yes
      • 2019-08-21 23341, 2019

      • alastairp
        (perhaps it does, I'm just trying to understand it)
      • 2019-08-21 23349, 2019

      • alastairp
        but the data for August is incomplete
      • 2019-08-21 23352, 2019

      • alastairp
        is that a problem?
      • 2019-08-21 23355, 2019

      • pristine__
        Yes.
      • 2019-08-21 23356, 2019

      • pristine__
        No
      • 2019-08-21 23301, 2019

      • pristine__
        I will tell you
      • 2019-08-21 23311, 2019

      • pristine__
        Data in hdfs is monthwise
      • 2019-08-21 23319, 2019

      • pristine__
        One parquet for every month
      • 2019-08-21 23322, 2019

      • alastairp
        sorry, I don't have much time, I'm in the middle of some other stuff
      • 2019-08-21 23340, 2019

      • pristine__
        So if I want a weeks data I need to get parquet of whole month
      • 2019-08-21 23355, 2019

      • alastairp
        in response to the original question, I'd move the call to datetime.utcnow() to main, or use it as an argument to the script
      • 2019-08-21 23300, 2019

      • pristine__
        No prob. Whenever you are free (whenever) , just ping.
      • 2019-08-21 23302, 2019

      • pristine__
        Umm....We cannot pass it as argument to other script because scripts are independent. So maybe we can calcukte date some where else and use those values everywhere
      • 2019-08-21 23322, 2019

      • pristine__
        (maybe)
      • 2019-08-21 23300, 2019

      • pristine__
        ruaok: I guess, I will make another script just to calculate dates and store them centrally
      • 2019-08-21 23311, 2019

      • pristine__
        so that they can be used by any file
      • 2019-08-21 23304, 2019

      • pristine__
        ruaok: like alastairp suggested the idea of tables, our case is diff. We just need to store when was last create_dataframes was run so that when train_models is run we can fetch the last to_date and from_datw value and store in model_metadata. Note that model_metdata would already have info about previous runs.
      • 2019-08-21 23356, 2019

      • alastairp
        how often do you run create_dataframes?
      • 2019-08-21 23303, 2019

      • alastairp
        and how often do you run train_models?
      • 2019-08-21 23325, 2019

      • pristine__
        Not decided yet. But weekly most probably
      • 2019-08-21 23349, 2019

      • pristine__
        First create_dataframes, then train_models then recommend
      • 2019-08-21 23320, 2019

      • pristine__
        Like pre-process data, train data, generate recommendations
      • 2019-08-21 23341, 2019

      • alastairp
        ok, sure. it's not important
      • 2019-08-21 23301, 2019

      • alastairp
        but you'll run it periodically?
      • 2019-08-21 23319, 2019

      • alastairp
        did you decide how you'll run it yet? something like cron?
      • 2019-08-21 23332, 2019

      • pristine__
        Yup. Every week LB server would request recommendations of certain users and we will run these scripts
      • 2019-08-21 23306, 2019

      • pristine__
        On new data. (Data added over the past week)
      • 2019-08-21 23324, 2019

      • pristine__
        alastairp: what is your time zone?
      • 2019-08-21 23303, 2019

      • alastairp
        I'm the same as ruaok
      • 2019-08-21 23324, 2019

      • alastairp
        (when he's not travelling...)
      • 2019-08-21 23345, 2019

      • alastairp
        so, some of these scripts depend on knowing what users to run for, and some don't require this information yet, right?
      • 2019-08-21 23308, 2019

      • ruaok
        I'm still in the same time zone. :-)
      • 2019-08-21 23320, 2019

      • alastairp
        from the names, I'm guessing that train_models depends on user information, but create_dataframes runs all the time regardless?
      • 2019-08-21 23341, 2019

      • pristine__
        Create_dataframe will fetch listens of X months from hdfs. Pre process them, save back to hdfs. Train_models fetches this pre process data from hdfs and train it.
      • 2019-08-21 23315, 2019

      • pristine__
        This data will include all the users covered in last X months
      • 2019-08-21 23337, 2019

      • pristine__
        But we may require rec for only a subset of users, so candidate_sets.py will generate sets only for those users and recommd.py will generate recommedations using these sets.
      • 2019-08-21 23317, 2019

      • pristine__
        ( sorry for the hard time understanding this workflow. It is not well defined yet )
      • 2019-08-21 23333, 2019

      • alastairp
        I think that once you understand the workflow a little more, the additional stuff you'll need will make more sense
      • 2019-08-21 23352, 2019

      • alastairp
        for example how you want to keep track of how often you run this script
      • 2019-08-21 23304, 2019

      • alastairp
        be it a database table, or ..., or ,...
      • 2019-08-21 23302, 2019

      • Gazooo has quit
      • 2019-08-21 23300, 2019

      • pristine__
        Right now, we wish to just store the X months data of which was used to train the model. So for eg, 1-06-2019 and 1-08-2019
      • 2019-08-21 23347, 2019

      • Gazooo joined the channel
      • 2019-08-21 23301, 2019

      • pristine__
        We just need to keep track of dates, data of which was used to generate dataframes.
      • 2019-08-21 23337, 2019

      • gr0uch0mars has quit
      • 2019-08-21 23306, 2019

      • pristine__
        When train_model would be run, we need these dates because after execution of train model a call will be made to store these dates along with model id and other info, to be stored in a table that keeps model metadata
      • 2019-08-21 23325, 2019

      • alastairp
        do you have code to store this data in the table?
      • 2019-08-21 23335, 2019

      • alastairp
        is this part of the train_model script?
      • 2019-08-21 23345, 2019

      • pristine__
      • 2019-08-21 23311, 2019

      • pristine__
        I am writing the code, got suck with this date thing. Yes it will be part of train_model script.
      • 2019-08-21 23318, 2019

      • travis-ci joined the channel
      • 2019-08-21 23318, 2019

      • travis-ci
        metabrainz/picard#4836 (master - 462add3 : Philipp Wolfer): The build passed.
      • 2019-08-21 23318, 2019

      • travis-ci
      • 2019-08-21 23318, 2019

      • travis-ci
      • 2019-08-21 23318, 2019

      • travis-ci has left the channel
      • 2019-08-21 23354, 2019

      • reosarevok
        ruaok: make sure you are away from bears :/ https://www.bbc.com/news/world-us-canada-49412385
      • 2019-08-21 23308, 2019

      • alastairp
        ah, that's great then if you're already going to add it to this table
      • 2019-08-21 23345, 2019

      • alastairp
        my recommendation is to read the date of the last time you ran it from the table, and then use that to calculate the date ranges
      • 2019-08-21 23358, 2019

      • alastairp
        pass those ranges into whatever functions you need, to get the data out of it
      • 2019-08-21 23311, 2019

      • alastairp
        and then you have the dates to add the new data to the table
      • 2019-08-21 23303, 2019

      • pristine__
        What if the range change? Last time I ran from 1-05-19 (to_date) to 1-06-19(from_date) but this time I want to run on two months data. So probably just fetch from_date from table, which will become to_date of your new run and then add no of months to this date so from_datw now becomes 1-08-19
      • 2019-08-21 23310, 2019

      • pristine__
        and continue like wise.
      • 2019-08-21 23327, 2019

      • alastairp
        why might you want to change the range?
      • 2019-08-21 23328, 2019

      • pristine__
        We might start with training on a weeks data but we then realise that recommendations are not good enough, so may be for next time we change the range to include more listens
      • 2019-08-21 23346, 2019

      • pristine__
        I did not consider range to be constant as of now because it all depends on quality of recommendation. We might need to increase/decrease range and play with recommendations