So what's the 1 single command that sets up everything? A shell script I assume?
2022-09-02 24551, 2022
BrainzGit joined the channel
2022-09-02 24521, 2022
alastairp
one advantage of interactive tools is that it could do validation before it writes the config file (e.g. check if a database url is valid, or a directory exists), but in my opinion, I'd prefer to do that test in the code itself
2022-09-02 24532, 2022
Pratha-Fish
I see
2022-09-02 24537, 2022
alastairp
they use a tool called ansible, which is designed for automated system administration
2022-09-02 24544, 2022
Pratha-Fish takes notes
2022-09-02 24509, 2022
Pratha-Fish
Do we use ansible at MetaBrainz?
2022-09-02 24509, 2022
alastairp
anyway, let's not get distracted
2022-09-02 24518, 2022
Pratha-Fish
Right :)
2022-09-02 24524, 2022
alastairp
mm, which command are you talking about?
2022-09-02 24531, 2022
alastairp
LB? or metabrainz setup
2022-09-02 24539, 2022
Pratha-Fish
Both
2022-09-02 24541, 2022
alastairp
LB is a shell script, develop.sh, you ran it yourself
2022-09-02 24548, 2022
alastairp
metabrainz setup uses ansible
2022-09-02 24503, 2022
alastairp
btw, double-check your setup instructions. you got the syntax of the pip install command wrong
2022-09-02 24544, 2022
Pratha-Fish
forgot the ```-r``` flag
2022-09-02 24551, 2022
alastairp
yep
2022-09-02 24555, 2022
Pratha-Fish
```pip install -r requirements.txt```
2022-09-02 24504, 2022
Pratha-Fish
that's better
2022-09-02 24533, 2022
alastairp
(btw, running your instructions from scratch in a new checkout of your code is a great way of testing that everything works)
2022-09-02 24556, 2022
Pratha-Fish
yep. Adding it to the to-do list
2022-09-02 24511, 2022
alastairp
some comments about gen_tables.py:
2022-09-02 24543, 2022
alastairp
I always set up a main guard in my files. because if you don't and you import the file for whatever reason, it's going to run that code
2022-09-02 24514, 2022
Pratha-Fish
How do I setup that guard? ๐
2022-09-02 24503, 2022
Pratha-Fish
And while we're at it, could you also touch upon writing libraries in Python?
2022-09-02 24503, 2022
Pratha-Fish
I was really confused by the __init__.py stuff for module declaration, etc
The above command generates a list of 100 random files from MLHD, upon which the mapper.py works (for now)
2022-09-02 24535, 2022
alastairp
perfect, I have this file now
2022-09-02 24546, 2022
Pratha-Fish
Did it run in the first attempt?
2022-09-02 24549, 2022
alastairp
yes
2022-09-02 24502, 2022
Pratha-Fish
Oh great
2022-09-02 24514, 2022
alastairp
Pratha-Fish: great to see that gen_test_paths uses argparse.
2022-09-02 24523, 2022
alastairp
not so great to see that mapper.py has interactive prompts :)
2022-09-02 24528, 2022
alastairp
this comes back to the previous discussion
2022-09-02 24547, 2022
Pratha-Fish
Started using arg parse since the last time you introduced me to it :)
2022-09-02 24510, 2022
Pratha-Fish
Not proud of the prompts in mapper.py haha. It was just for my personal testing purposes. I'll replace it ASAP
2022-09-02 24513, 2022
alastairp
imagine if I wanted to run this 5 times each with a different configuration option (e.g. once with 1000 items, once with 10000, not running the cache, maybe with a different set of input files)
2022-09-02 24523, 2022
alastairp
yeah, and you already know how to use argparse!
2022-09-02 24543, 2022
Pratha-Fish
๐
2022-09-02 24547, 2022
alastairp
the great thing about argparse, is that you can fill in the help= argument to add_argument, and suddenly you have documentation
I was about to touch that one. The script can't use any cached data on the first attempt because it has to run and create the cache first before using it
2022-09-02 24556, 2022
Pratha-Fish
If you run it without cache once, and then use cache from the next time, it will run
2022-09-02 24502, 2022
Pratha-Fish
I'll add some documentation about it
2022-09-02 24511, 2022
alastairp
so that prompt isn't "generate a cache", it's "use an existing cache"?
2022-09-02 24520, 2022
alastairp
but the cache is automatically generated?
2022-09-02 24523, 2022
alastairp
what is this cache of?
2022-09-02 24554, 2022
Pratha-Fish
The cache is basically a cleaned MLHD file with artist credit name and recording names along with clean canonical rec mbids
looks like you're missing an `os.makedirs("warehouse/mapper_outputs", exist_ok=True)`
2022-09-02 24523, 2022
Pratha-Fish
The big idea is,
2022-09-02 24523, 2022
Pratha-Fish
loading MLHD files, and their respective MB tables (even from parquet) takes ages. So if you just want to test the mapping process, it's better to use previously cleaned MLHD data and then run updated mapper functions on that data instead of creating that data everytime (which takes a few minutes at leaset)
2022-09-02 24527, 2022
alastairp
as I mentioned - great idea to run your instructions on a fresh checkout of the code to ensure that everything works as expected. for now I'm just making the directory now
2022-09-02 24541, 2022
Pratha-Fish
alastairp: Noted that one as well
2022-09-02 24543, 2022
alastairp
right, got it
2022-09-02 24508, 2022
alastairp
so this is definitely a testing thing, rather than something that will be needed for running on the full dataset?
2022-09-02 24544, 2022
Pratha-Fish
yes exactly
2022-09-02 24552, 2022
Pratha-Fish
I also have a full production script ready BTW
2022-09-02 24500, 2022
Pratha-Fish
the one that we used to convert MLHD to zstd
2022-09-02 24531, 2022
Pratha-Fish
The rec_track_checker.py script was used for the same
2022-09-02 24502, 2022
Pratha-Fish
But the script is basically redundant now, since the original MLHD data has been deleted. and we won't need to run the script ever again
Pratha-Fish: sure, but that's definitely a big piece of work that you did and so we should keep the script and document it
2022-09-02 24554, 2022
Pratha-Fish
alastairp: yes. In fact it's gonna be the base for our final script :)
2022-09-02 24554, 2022
Pratha-Fish
Also, it's not an interactive file afaik, and uses a config.json file for it's settings :)
2022-09-02 24542, 2022
alastairp
good
2022-09-02 24554, 2022
alastairp
remember though, that json doesn't allow for comments
2022-09-02 24518, 2022
alastairp
so config.py would be much better, because 1) you can write comments in it, 2) you don't have to explicitly load the file and read data from it, you can just import it
2022-09-02 24546, 2022
Pratha-Fish
Definitely. I'll switch it up to .py in the upcoming iteration
2022-09-02 24510, 2022
Pratha-Fish
Do I also include the "main guard" in this config.py file?
2022-09-02 24529, 2022
alastairp
only in files that you launch from the commandline
2022-09-02 24502, 2022
Pratha-Fish
_what if someone runs config.py from the commandline_
2022-09-02 24520, 2022
Pratha-Fish
I mean, no sane person would. Just curious lol
2022-09-02 24525, 2022
alastairp
good question
2022-09-02 24529, 2022
alastairp
it'll "execute" the file
2022-09-02 24548, 2022
alastairp
however, what does the file include? 17 statements of the form "CONFIG_ITEM=1"
2022-09-02 24556, 2022
alastairp
so this has no effect on your system
2022-09-02 24503, 2022
Pratha-Fish
indeed
2022-09-02 24505, 2022
alastairp
it's not going to load a file, delete a file, start a 3 hour computation, etc...
2022-09-02 24533, 2022
Pratha-Fish
Yep
2022-09-02 24557, 2022
Pratha-Fish
Which also makes me wonder if I should include this main guard in the modules
2022-09-02 24532, 2022
Pratha-Fish
Because the module isn't supposed to be run on it's own. We just need to import functions from it ig
2022-09-02 24509, 2022
alastairp
what do you think the behaviour of the module would be if you ran it?
2022-09-02 24521, 2022
alastairp
if you had a main guard, what would you run in this case?
2022-09-02 24533, 2022
Pratha-Fish
well none of them run anything TBH
2022-09-02 24543, 2022
Pratha-Fish
these modules are just function/class declarations
2022-09-02 24551, 2022
alastairp
right, so running the module isn't going to do anything
2022-09-02 24500, 2022
alastairp
and you don't have anything to put inside the main anyway
2022-09-02 24519, 2022
Pratha-Fish
okie
2022-09-02 24531, 2022
Pratha-Fish
So what else should we discuss
2022-09-02 24542, 2022
Pratha-Fish
Maybe I should mention the fact that as you might have noticed, most of the repo is pretty "bootstrappy". i.e. The scripts and notebooks help ME get answer, and test out new stuff.
2022-09-02 24543, 2022
Pratha-Fish
I am kinda confused, what purpose this code might fulfill to it's end user? We could ponder upon that question and maybe re structure the complete repo accordingly
2022-09-02 24555, 2022
alastairp
this is definitely an interesting project because the result of running it is a new dataset and then we never need the code again
2022-09-02 24531, 2022
alastairp
the notebooks are great for the exploration, and they've helped you understand the code well. I see two possible options
2022-09-02 24509, 2022
alastairp
one is that we just "archive" the notebooks. Put them in a folder, make sure that they run, and that there is a description at the top of the notebook explaining what it was for
2022-09-02 24531, 2022
alastairp
the other option is that we start thinking about what our final output of the project is in addition to the new dataset
2022-09-02 24540, 2022
alastairp
I'm thinking reports/blog posts, etc
2022-09-02 24506, 2022
alastairp
do you think it would be possible to turn each notebook into a small post explaining the key things that we learned by running it? I see things like pandas/arrow/csv tests. We also did some zst tests, right? This would be great to turn into a few small posts with code examples and images
2022-09-02 24520, 2022
alastairp
then we'd have a great set of guidelines for other people to learn from
2022-09-02 24542, 2022
Pratha-Fish
That's exactly what I had in mind too :))
2022-09-02 24517, 2022
Pratha-Fish
Which is why I've already cleaned and archived some test notebooks that demonstrate different benchmarks
2022-09-02 24510, 2022
alastairp
excellent
2022-09-02 24547, 2022
alastairp
keep this in mind with the timeline for your project too - it's OK to spend a week writing up these notebooks. this is just as valid as "coding". We don't want to rush this at the end of the project
2022-09-02 24504, 2022
alastairp
btw, did you get notifications of the changed dates? Did we let you know?
2022-09-02 24539, 2022
Pratha-Fish
I did recieve a copy of the mail you forwarded to mayhem regarding date extension
2022-09-02 24558, 2022
Pratha-Fish
I also checked my GSoC dashboard, which is currently showing my project end date as 24th Oct
2022-09-02 24516, 2022
alastairp
yes, that's what we set it to. good. just checking that you have that date in mind too
2022-09-02 24523, 2022
Pratha-Fish
Yes, I did have that one in mind
2022-09-02 24547, 2022
Pratha-Fish
alastairp: Given the remaining work, how long do you think the rest of the project should take?
2022-09-02 24547, 2022
Pratha-Fish
Maybe even stuff that I should improve upon
2022-09-02 24521, 2022
alastairp
well, work fills to expand the time available
2022-09-02 24552, 2022
Pratha-Fish
*Maybe even let me know what I can improve upon since my work time schedule since early august
2022-09-02 24514, 2022
alastairp
I think that your todo list that you mentioned earlier is a good start. As I said at the beginning of the week we should start writing the final scripts with the assumption that the mapping is working, while we keep debugging the mapping
2022-09-02 24518, 2022
alastairp
the other things that we can add to the list are your final written deliverables (be that blog posts about your experiments, or just a final post describing the entire project), and also a possible task about looking at the release mbids
2022-09-02 24526, 2022
alastairp
like we discussed earlier this week
2022-09-02 24525, 2022
Pratha-Fish
Great
2022-09-02 24504, 2022
Pratha-Fish
I'll recollect everything we discussed in the past few days, and update the To-Do list
2022-09-02 24513, 2022
alastairp
that would be great, thanks
2022-09-02 24521, 2022
Pratha-Fish
That would make it easier to set time bound goals and all
2022-09-02 24532, 2022
alastairp
you could also write a rough timeline, of when you think you'll be able to complete each of these steps