#metabrainz

/

      • Pratha-Fish
        So what's the 1 single command that sets up everything? A shell script I assume?
      • 2022-09-02 24551, 2022

      • BrainzGit joined the channel
      • 2022-09-02 24521, 2022

      • alastairp
        one advantage of interactive tools is that it could do validation before it writes the config file (e.g. check if a database url is valid, or a directory exists), but in my opinion, I'd prefer to do that test in the code itself
      • 2022-09-02 24532, 2022

      • Pratha-Fish
        I see
      • 2022-09-02 24537, 2022

      • alastairp
        they use a tool called ansible, which is designed for automated system administration
      • 2022-09-02 24544, 2022

      • Pratha-Fish takes notes
      • 2022-09-02 24509, 2022

      • Pratha-Fish
        Do we use ansible at MetaBrainz?
      • 2022-09-02 24509, 2022

      • alastairp
        anyway, let's not get distracted
      • 2022-09-02 24518, 2022

      • Pratha-Fish
        Right :)
      • 2022-09-02 24524, 2022

      • alastairp
        mm, which command are you talking about?
      • 2022-09-02 24531, 2022

      • alastairp
        LB? or metabrainz setup
      • 2022-09-02 24539, 2022

      • Pratha-Fish
        Both
      • 2022-09-02 24541, 2022

      • alastairp
        LB is a shell script, develop.sh, you ran it yourself
      • 2022-09-02 24548, 2022

      • alastairp
        metabrainz setup uses ansible
      • 2022-09-02 24503, 2022

      • alastairp
        btw, double-check your setup instructions. you got the syntax of the pip install command wrong
      • 2022-09-02 24544, 2022

      • Pratha-Fish
        forgot the ```-r``` flag
      • 2022-09-02 24551, 2022

      • alastairp
        yep
      • 2022-09-02 24555, 2022

      • Pratha-Fish
        ```pip install -r requirements.txt```
      • 2022-09-02 24504, 2022

      • Pratha-Fish
        that's better
      • 2022-09-02 24533, 2022

      • alastairp
        (btw, running your instructions from scratch in a new checkout of your code is a great way of testing that everything works)
      • 2022-09-02 24556, 2022

      • Pratha-Fish
        yep. Adding it to the to-do list
      • 2022-09-02 24511, 2022

      • alastairp
        some comments about gen_tables.py:
      • 2022-09-02 24543, 2022

      • alastairp
        I always set up a main guard in my files. because if you don't and you import the file for whatever reason, it's going to run that code
      • 2022-09-02 24514, 2022

      • Pratha-Fish
        How do I setup that guard? ๐Ÿ‘€
      • 2022-09-02 24503, 2022

      • Pratha-Fish
        And while we're at it, could you also touch upon writing libraries in Python?
      • 2022-09-02 24503, 2022

      • Pratha-Fish
        I was really confused by the __init__.py stuff for module declaration, etc
      • 2022-09-02 24544, 2022

      • Pratha-Fish
        (Check the lib folder for the same)
      • 2022-09-02 24546, 2022

      • alastairp
      • 2022-09-02 24502, 2022

      • Pratha-Fish
        Great! That explains some of it
      • 2022-09-02 24518, 2022

      • alastairp
        in short, "if you want to import a file in a folder, the folder needs to have __init__.py in it to tell python that it's a module"
      • 2022-09-02 24535, 2022

      • alastairp
        it gets a lot more complicated than that, but that's the basics
      • 2022-09-02 24552, 2022

      • alastairp
        Pratha-Fish: OK, I've run gen_tables.py. what's next?
      • 2022-09-02 24500, 2022

      • Pratha-Fish
        Do I have to include anything in the __init__.py file? I've kept it empty for now. and that seems to do the trick
      • 2022-09-02 24516, 2022

      • alastairp
        that's part of the "it gets more complicated"
      • 2022-09-02 24536, 2022

      • alastairp
        in short, if you're always doing "from lib import x" or "import lib.x" then no, you don't have to include anything in it
      • 2022-09-02 24509, 2022

      • Pratha-Fish
        Got it
      • 2022-09-02 24520, 2022

      • Pratha-Fish
        alastairp: after running hte gen_tables.py:
      • 2022-09-02 24531, 2022

      • Pratha-Fish
        Just run any required script.
      • 2022-09-02 24540, 2022

      • Pratha-Fish
        e.g. ```python mapper.py```
      • 2022-09-02 24516, 2022

      • alastairp
        what does mapper.py do? I see no comment at the top of the file describing its purpose (and nothing in the readme!)
      • 2022-09-02 24530, 2022

      • Pratha-Fish
        I'll add it asap :)
      • 2022-09-02 24547, 2022

      • Pratha-Fish
        It basically grabs a random set of MLHD files
      • 2022-09-02 24551, 2022

      • Pratha-Fish
        Oh wait.
      • 2022-09-02 24503, 2022

      • Pratha-Fish
        To generate the random set of files, you'll have to run another module
      • 2022-09-02 24528, 2022

      • Pratha-Fish
        ```python lib/gen_test_paths.py 100 'warehouse/samples/random_file_paths.txt'```
      • 2022-09-02 24502, 2022

      • Pratha-Fish
        The above command generates a list of 100 random files from MLHD, upon which the mapper.py works (for now)
      • 2022-09-02 24535, 2022

      • alastairp
        perfect, I have this file now
      • 2022-09-02 24546, 2022

      • Pratha-Fish
        Did it run in the first attempt?
      • 2022-09-02 24549, 2022

      • alastairp
        yes
      • 2022-09-02 24502, 2022

      • Pratha-Fish
        Oh great
      • 2022-09-02 24514, 2022

      • alastairp
        Pratha-Fish: great to see that gen_test_paths uses argparse.
      • 2022-09-02 24523, 2022

      • alastairp
        not so great to see that mapper.py has interactive prompts :)
      • 2022-09-02 24528, 2022

      • alastairp
        this comes back to the previous discussion
      • 2022-09-02 24547, 2022

      • Pratha-Fish
        Started using arg parse since the last time you introduced me to it :)
      • 2022-09-02 24510, 2022

      • Pratha-Fish
        Not proud of the prompts in mapper.py haha. It was just for my personal testing purposes. I'll replace it ASAP
      • 2022-09-02 24513, 2022

      • alastairp
        imagine if I wanted to run this 5 times each with a different configuration option (e.g. once with 1000 items, once with 10000, not running the cache, maybe with a different set of input files)
      • 2022-09-02 24523, 2022

      • alastairp
        yeah, and you already know how to use argparse!
      • 2022-09-02 24543, 2022

      • Pratha-Fish
        ๐ŸŽ‰
      • 2022-09-02 24547, 2022

      • alastairp
        the great thing about argparse, is that you can fill in the help= argument to add_argument, and suddenly you have documentation
      • 2022-09-02 24554, 2022

      • alastairp
        with no additional effort
      • 2022-09-02 24502, 2022

      • Pratha-Fish
        Oooo
      • 2022-09-02 24547, 2022

      • Pratha-Fish
        I'll check that one out
      • 2022-09-02 24528, 2022

      • alastairp
      • 2022-09-02 24534, 2022

      • alastairp
        Pratha-Fish: got this error when running mapper
      • 2022-09-02 24509, 2022

      • Pratha-Fish
        You probably tried using cached data
      • 2022-09-02 24515, 2022

      • alastairp
        I did
      • 2022-09-02 24539, 2022

      • Pratha-Fish
        I was about to touch that one. The script can't use any cached data on the first attempt because it has to run and create the cache first before using it
      • 2022-09-02 24556, 2022

      • Pratha-Fish
        If you run it without cache once, and then use cache from the next time, it will run
      • 2022-09-02 24502, 2022

      • Pratha-Fish
        I'll add some documentation about it
      • 2022-09-02 24511, 2022

      • alastairp
        so that prompt isn't "generate a cache", it's "use an existing cache"?
      • 2022-09-02 24520, 2022

      • alastairp
        but the cache is automatically generated?
      • 2022-09-02 24523, 2022

      • alastairp
        what is this cache of?
      • 2022-09-02 24554, 2022

      • Pratha-Fish
        The cache is basically a cleaned MLHD file with artist credit name and recording names along with clean canonical rec mbids
      • 2022-09-02 24518, 2022

      • alastairp
        I'm reading the code now
      • 2022-09-02 24521, 2022

      • alastairp
        btw, another error:
      • 2022-09-02 24538, 2022

      • alastairp
      • 2022-09-02 24545, 2022

      • style- has quit
      • 2022-09-02 24557, 2022

      • alastairp
        looks like you're missing an `os.makedirs("warehouse/mapper_outputs", exist_ok=True)`
      • 2022-09-02 24523, 2022

      • Pratha-Fish
        The big idea is,
      • 2022-09-02 24523, 2022

      • Pratha-Fish
        loading MLHD files, and their respective MB tables (even from parquet) takes ages. So if you just want to test the mapping process, it's better to use previously cleaned MLHD data and then run updated mapper functions on that data instead of creating that data everytime (which takes a few minutes at leaset)
      • 2022-09-02 24527, 2022

      • alastairp
        as I mentioned - great idea to run your instructions on a fresh checkout of the code to ensure that everything works as expected. for now I'm just making the directory now
      • 2022-09-02 24541, 2022

      • Pratha-Fish
        alastairp: Noted that one as well
      • 2022-09-02 24543, 2022

      • alastairp
        right, got it
      • 2022-09-02 24508, 2022

      • alastairp
        so this is definitely a testing thing, rather than something that will be needed for running on the full dataset?
      • 2022-09-02 24544, 2022

      • Pratha-Fish
        yes exactly
      • 2022-09-02 24552, 2022

      • Pratha-Fish
        I also have a full production script ready BTW
      • 2022-09-02 24500, 2022

      • Pratha-Fish
        the one that we used to convert MLHD to zstd
      • 2022-09-02 24531, 2022

      • Pratha-Fish
        The rec_track_checker.py script was used for the same
      • 2022-09-02 24502, 2022

      • Pratha-Fish
        But the script is basically redundant now, since the original MLHD data has been deleted. and we won't need to run the script ever again
      • 2022-09-02 24524, 2022

      • alastairp
      • 2022-09-02 24546, 2022

      • Pratha-Fish
        Looks like I hardcoded the path
      • 2022-09-02 24504, 2022

      • alastairp
        Pratha-Fish: sure, but that's definitely a big piece of work that you did and so we should keep the script and document it
      • 2022-09-02 24554, 2022

      • Pratha-Fish
        alastairp: yes. In fact it's gonna be the base for our final script :)
      • 2022-09-02 24554, 2022

      • Pratha-Fish
        Also, it's not an interactive file afaik, and uses a config.json file for it's settings :)
      • 2022-09-02 24542, 2022

      • alastairp
        good
      • 2022-09-02 24554, 2022

      • alastairp
        remember though, that json doesn't allow for comments
      • 2022-09-02 24518, 2022

      • alastairp
        so config.py would be much better, because 1) you can write comments in it, 2) you don't have to explicitly load the file and read data from it, you can just import it
      • 2022-09-02 24546, 2022

      • Pratha-Fish
        Definitely. I'll switch it up to .py in the upcoming iteration
      • 2022-09-02 24510, 2022

      • Pratha-Fish
        Do I also include the "main guard" in this config.py file?
      • 2022-09-02 24529, 2022

      • alastairp
        only in files that you launch from the commandline
      • 2022-09-02 24502, 2022

      • Pratha-Fish
        _what if someone runs config.py from the commandline_
      • 2022-09-02 24520, 2022

      • Pratha-Fish
        I mean, no sane person would. Just curious lol
      • 2022-09-02 24525, 2022

      • alastairp
        good question
      • 2022-09-02 24529, 2022

      • alastairp
        it'll "execute" the file
      • 2022-09-02 24548, 2022

      • alastairp
        however, what does the file include? 17 statements of the form "CONFIG_ITEM=1"
      • 2022-09-02 24556, 2022

      • alastairp
        so this has no effect on your system
      • 2022-09-02 24503, 2022

      • Pratha-Fish
        indeed
      • 2022-09-02 24505, 2022

      • alastairp
        it's not going to load a file, delete a file, start a 3 hour computation, etc...
      • 2022-09-02 24533, 2022

      • Pratha-Fish
        Yep
      • 2022-09-02 24557, 2022

      • Pratha-Fish
        Which also makes me wonder if I should include this main guard in the modules
      • 2022-09-02 24532, 2022

      • Pratha-Fish
        Because the module isn't supposed to be run on it's own. We just need to import functions from it ig
      • 2022-09-02 24509, 2022

      • alastairp
        what do you think the behaviour of the module would be if you ran it?
      • 2022-09-02 24521, 2022

      • alastairp
        if you had a main guard, what would you run in this case?
      • 2022-09-02 24533, 2022

      • Pratha-Fish
        well none of them run anything TBH
      • 2022-09-02 24543, 2022

      • Pratha-Fish
        these modules are just function/class declarations
      • 2022-09-02 24551, 2022

      • alastairp
        right, so running the module isn't going to do anything
      • 2022-09-02 24500, 2022

      • alastairp
        and you don't have anything to put inside the main anyway
      • 2022-09-02 24519, 2022

      • Pratha-Fish
        okie
      • 2022-09-02 24531, 2022

      • Pratha-Fish
        So what else should we discuss
      • 2022-09-02 24542, 2022

      • Pratha-Fish
        Maybe I should mention the fact that as you might have noticed, most of the repo is pretty "bootstrappy". i.e. The scripts and notebooks help ME get answer, and test out new stuff.
      • 2022-09-02 24543, 2022

      • Pratha-Fish
        I am kinda confused, what purpose this code might fulfill to it's end user? We could ponder upon that question and maybe re structure the complete repo accordingly
      • 2022-09-02 24555, 2022

      • alastairp
        this is definitely an interesting project because the result of running it is a new dataset and then we never need the code again
      • 2022-09-02 24531, 2022

      • alastairp
        the notebooks are great for the exploration, and they've helped you understand the code well. I see two possible options
      • 2022-09-02 24509, 2022

      • alastairp
        one is that we just "archive" the notebooks. Put them in a folder, make sure that they run, and that there is a description at the top of the notebook explaining what it was for
      • 2022-09-02 24531, 2022

      • alastairp
        the other option is that we start thinking about what our final output of the project is in addition to the new dataset
      • 2022-09-02 24540, 2022

      • alastairp
        I'm thinking reports/blog posts, etc
      • 2022-09-02 24506, 2022

      • alastairp
        do you think it would be possible to turn each notebook into a small post explaining the key things that we learned by running it? I see things like pandas/arrow/csv tests. We also did some zst tests, right? This would be great to turn into a few small posts with code examples and images
      • 2022-09-02 24520, 2022

      • alastairp
        then we'd have a great set of guidelines for other people to learn from
      • 2022-09-02 24542, 2022

      • Pratha-Fish
        That's exactly what I had in mind too :))
      • 2022-09-02 24517, 2022

      • Pratha-Fish
        Which is why I've already cleaned and archived some test notebooks that demonstrate different benchmarks
      • 2022-09-02 24510, 2022

      • alastairp
        excellent
      • 2022-09-02 24547, 2022

      • alastairp
        keep this in mind with the timeline for your project too - it's OK to spend a week writing up these notebooks. this is just as valid as "coding". We don't want to rush this at the end of the project
      • 2022-09-02 24504, 2022

      • alastairp
        btw, did you get notifications of the changed dates? Did we let you know?
      • 2022-09-02 24539, 2022

      • Pratha-Fish
        I did recieve a copy of the mail you forwarded to mayhem regarding date extension
      • 2022-09-02 24558, 2022

      • Pratha-Fish
        I also checked my GSoC dashboard, which is currently showing my project end date as 24th Oct
      • 2022-09-02 24516, 2022

      • alastairp
        yes, that's what we set it to. good. just checking that you have that date in mind too
      • 2022-09-02 24523, 2022

      • Pratha-Fish
        Yes, I did have that one in mind
      • 2022-09-02 24547, 2022

      • Pratha-Fish
        alastairp: Given the remaining work, how long do you think the rest of the project should take?
      • 2022-09-02 24547, 2022

      • Pratha-Fish
        Maybe even stuff that I should improve upon
      • 2022-09-02 24521, 2022

      • alastairp
        well, work fills to expand the time available
      • 2022-09-02 24552, 2022

      • Pratha-Fish
        *Maybe even let me know what I can improve upon since my work time schedule since early august
      • 2022-09-02 24514, 2022

      • alastairp
        I think that your todo list that you mentioned earlier is a good start. As I said at the beginning of the week we should start writing the final scripts with the assumption that the mapping is working, while we keep debugging the mapping
      • 2022-09-02 24518, 2022

      • alastairp
        the other things that we can add to the list are your final written deliverables (be that blog posts about your experiments, or just a final post describing the entire project), and also a possible task about looking at the release mbids
      • 2022-09-02 24526, 2022

      • alastairp
        like we discussed earlier this week
      • 2022-09-02 24525, 2022

      • Pratha-Fish
        Great
      • 2022-09-02 24504, 2022

      • Pratha-Fish
        I'll recollect everything we discussed in the past few days, and update the To-Do list
      • 2022-09-02 24513, 2022

      • alastairp
        that would be great, thanks
      • 2022-09-02 24521, 2022

      • Pratha-Fish
        That would make it easier to set time bound goals and all
      • 2022-09-02 24532, 2022

      • alastairp
        you could also write a rough timeline, of when you think you'll be able to complete each of these steps
      • 2022-09-02 24547, 2022

      • Pratha-Fish
        Yes I'll try