#metabrainz

/

16:29 PM
Pratha-Fish

So what's the 1 single command that sets up everything? A shell script I assume?

2022-09-02 24551, 2022

16:29 PM
BrainzGit joined the channel

2022-09-02 24521, 2022

16:30 PM
alastairp

one advantage of interactive tools is that it could do validation before it writes the config file (e.g. check if a database url is valid, or a directory exists), but in my opinion, I'd prefer to do that test in the code itself

2022-09-02 24532, 2022

16:30 PM
Pratha-Fish

I see

2022-09-02 24537, 2022

16:30 PM
alastairp

they use a tool called ansible, which is designed for automated system administration

2022-09-02 24544, 2022

16:30 PM
Pratha-Fish takes notes

2022-09-02 24509, 2022

16:31 PM
Pratha-Fish

Do we use ansible at MetaBrainz?

2022-09-02 24509, 2022

16:31 PM
alastairp

anyway, let's not get distracted

2022-09-02 24518, 2022

16:31 PM
Pratha-Fish

Right :)

2022-09-02 24524, 2022

16:31 PM
alastairp

mm, which command are you talking about?

2022-09-02 24531, 2022

16:31 PM
alastairp

LB? or metabrainz setup

2022-09-02 24539, 2022

16:31 PM
Pratha-Fish

Both

2022-09-02 24541, 2022

16:31 PM
alastairp

LB is a shell script, develop.sh, you ran it yourself

2022-09-02 24548, 2022

16:31 PM
alastairp

metabrainz setup uses ansible

2022-09-02 24503, 2022

16:32 PM
alastairp

btw, double-check your setup instructions. you got the syntax of the pip install command wrong

2022-09-02 24544, 2022

16:32 PM
Pratha-Fish

forgot the ```-r``` flag

2022-09-02 24551, 2022

16:32 PM
alastairp

yep

2022-09-02 24555, 2022

16:32 PM
Pratha-Fish

```pip install -r requirements.txt```

2022-09-02 24504, 2022

16:33 PM
Pratha-Fish

that's better

2022-09-02 24533, 2022

16:33 PM
alastairp

(btw, running your instructions from scratch in a new checkout of your code is a great way of testing that everything works)

2022-09-02 24556, 2022

16:33 PM
Pratha-Fish

yep. Adding it to the to-do list

2022-09-02 24511, 2022

16:35 PM
alastairp

some comments about gen_tables.py:

2022-09-02 24543, 2022

16:35 PM
alastairp

I always set up a main guard in my files. because if you don't and you import the file for whatever reason, it's going to run that code

2022-09-02 24514, 2022

16:36 PM
Pratha-Fish

How do I setup that guard? 👀

2022-09-02 24503, 2022

16:37 PM
Pratha-Fish

And while we're at it, could you also touch upon writing libraries in Python?

2022-09-02 24503, 2022

16:37 PM
Pratha-Fish

I was really confused by the __init__.py stuff for module declaration, etc

2022-09-02 24544, 2022

16:37 PM
Pratha-Fish

(Check the lib folder for the same)

2022-09-02 24546, 2022

16:37 PM
alastairp

https://stackoverflow.com/questions/19578308/what… has a small description of it

2022-09-02 24502, 2022

16:39 PM
Pratha-Fish

Great! That explains some of it

2022-09-02 24518, 2022

16:39 PM
alastairp

in short, "if you want to import a file in a folder, the folder needs to have __init__.py in it to tell python that it's a module"

2022-09-02 24535, 2022

16:39 PM
alastairp

it gets a lot more complicated than that, but that's the basics

2022-09-02 24552, 2022

16:39 PM
alastairp

Pratha-Fish: OK, I've run gen_tables.py. what's next?

2022-09-02 24500, 2022

16:40 PM
Pratha-Fish

Do I have to include anything in the __init__.py file? I've kept it empty for now. and that seems to do the trick

2022-09-02 24516, 2022

16:40 PM
alastairp

that's part of the "it gets more complicated"

2022-09-02 24536, 2022

16:40 PM
alastairp

in short, if you're always doing "from lib import x" or "import lib.x" then no, you don't have to include anything in it

2022-09-02 24509, 2022

16:41 PM
Pratha-Fish

Got it

2022-09-02 24520, 2022

16:41 PM
Pratha-Fish

alastairp: after running hte gen_tables.py:

2022-09-02 24531, 2022

16:41 PM
Pratha-Fish

Just run any required script.

2022-09-02 24540, 2022

16:41 PM
Pratha-Fish

e.g. ```python mapper.py```

2022-09-02 24516, 2022

16:42 PM
alastairp

what does mapper.py do? I see no comment at the top of the file describing its purpose (and nothing in the readme!)

2022-09-02 24530, 2022

16:42 PM
Pratha-Fish

I'll add it asap :)

2022-09-02 24547, 2022

16:42 PM
Pratha-Fish

It basically grabs a random set of MLHD files

2022-09-02 24551, 2022

16:42 PM
Pratha-Fish

Oh wait.

2022-09-02 24503, 2022

16:43 PM
Pratha-Fish

To generate the random set of files, you'll have to run another module

2022-09-02 24528, 2022

16:46 PM
Pratha-Fish

```python lib/gen_test_paths.py 100 'warehouse/samples/random_file_paths.txt'```

2022-09-02 24502, 2022

16:47 PM
Pratha-Fish

The above command generates a list of 100 random files from MLHD, upon which the mapper.py works (for now)

2022-09-02 24535, 2022

16:47 PM
alastairp

perfect, I have this file now

2022-09-02 24546, 2022

16:47 PM
Pratha-Fish

Did it run in the first attempt?

2022-09-02 24549, 2022

16:47 PM
alastairp

yes

2022-09-02 24502, 2022

16:48 PM
Pratha-Fish

Oh great

2022-09-02 24514, 2022

16:48 PM
alastairp

Pratha-Fish: great to see that gen_test_paths uses argparse.

2022-09-02 24523, 2022

16:48 PM
alastairp

not so great to see that mapper.py has interactive prompts :)

2022-09-02 24528, 2022

16:48 PM
alastairp

this comes back to the previous discussion

2022-09-02 24547, 2022

16:48 PM
Pratha-Fish

Started using arg parse since the last time you introduced me to it :)

2022-09-02 24510, 2022

16:49 PM
Pratha-Fish

Not proud of the prompts in mapper.py haha. It was just for my personal testing purposes. I'll replace it ASAP

2022-09-02 24513, 2022

16:49 PM
alastairp

imagine if I wanted to run this 5 times each with a different configuration option (e.g. once with 1000 items, once with 10000, not running the cache, maybe with a different set of input files)

2022-09-02 24523, 2022

16:49 PM
alastairp

yeah, and you already know how to use argparse!

2022-09-02 24543, 2022

16:49 PM
Pratha-Fish

🎉

2022-09-02 24547, 2022

16:49 PM
alastairp

the great thing about argparse, is that you can fill in the help= argument to add_argument, and suddenly you have documentation

2022-09-02 24554, 2022

16:49 PM
alastairp

with no additional effort

2022-09-02 24502, 2022

16:50 PM
Pratha-Fish

Oooo

2022-09-02 24547, 2022

16:50 PM
Pratha-Fish

I'll check that one out

2022-09-02 24528, 2022

16:51 PM
alastairp

https://www.irccloud.com/pastebin/pSjMmGiV/

2022-09-02 24534, 2022

16:51 PM
alastairp

Pratha-Fish: got this error when running mapper

2022-09-02 24509, 2022

16:52 PM
Pratha-Fish

You probably tried using cached data

2022-09-02 24515, 2022

16:52 PM
alastairp

I did

2022-09-02 24539, 2022

16:52 PM
Pratha-Fish

I was about to touch that one. The script can't use any cached data on the first attempt because it has to run and create the cache first before using it

2022-09-02 24556, 2022

16:52 PM
Pratha-Fish

If you run it without cache once, and then use cache from the next time, it will run

2022-09-02 24502, 2022

16:53 PM
Pratha-Fish

I'll add some documentation about it

2022-09-02 24511, 2022

16:53 PM
alastairp

so that prompt isn't "generate a cache", it's "use an existing cache"?

2022-09-02 24520, 2022

16:53 PM
alastairp

but the cache is automatically generated?

2022-09-02 24523, 2022

16:53 PM
alastairp

what is this cache of?

2022-09-02 24554, 2022

16:53 PM
Pratha-Fish

The cache is basically a cleaned MLHD file with artist credit name and recording names along with clean canonical rec mbids

2022-09-02 24518, 2022

16:54 PM
alastairp

I'm reading the code now

2022-09-02 24521, 2022

16:54 PM
alastairp

btw, another error:

2022-09-02 24538, 2022

16:54 PM
alastairp

https://www.irccloud.com/pastebin/nTuvSB0r/

2022-09-02 24545, 2022

16:54 PM
style- has quit

2022-09-02 24557, 2022

16:54 PM
alastairp

looks like you're missing an `os.makedirs("warehouse/mapper_outputs", exist_ok=True)`

2022-09-02 24523, 2022

16:55 PM
Pratha-Fish

The big idea is,

2022-09-02 24523, 2022

16:55 PM
Pratha-Fish

loading MLHD files, and their respective MB tables (even from parquet) takes ages. So if you just want to test the mapping process, it's better to use previously cleaned MLHD data and then run updated mapper functions on that data instead of creating that data everytime (which takes a few minutes at leaset)

2022-09-02 24527, 2022

16:55 PM
alastairp

as I mentioned - great idea to run your instructions on a fresh checkout of the code to ensure that everything works as expected. for now I'm just making the directory now

2022-09-02 24541, 2022

16:55 PM
Pratha-Fish

alastairp: Noted that one as well

2022-09-02 24543, 2022

16:55 PM
alastairp

right, got it

2022-09-02 24508, 2022

16:56 PM
alastairp

so this is definitely a testing thing, rather than something that will be needed for running on the full dataset?

2022-09-02 24544, 2022

16:56 PM
Pratha-Fish

yes exactly

2022-09-02 24552, 2022

16:56 PM
Pratha-Fish

I also have a full production script ready BTW

2022-09-02 24500, 2022

16:57 PM
Pratha-Fish

the one that we used to convert MLHD to zstd

2022-09-02 24531, 2022

16:57 PM
Pratha-Fish

The rec_track_checker.py script was used for the same

2022-09-02 24502, 2022

16:59 PM
Pratha-Fish

But the script is basically redundant now, since the original MLHD data has been deleted. and we won't need to run the script ever again

2022-09-02 24524, 2022

16:59 PM
alastairp

https://www.irccloud.com/pastebin/U4LbQEa3/

2022-09-02 24546, 2022

16:59 PM
Pratha-Fish

Looks like I hardcoded the path

2022-09-02 24504, 2022

17:00 PM
alastairp

Pratha-Fish: sure, but that's definitely a big piece of work that you did and so we should keep the script and document it

2022-09-02 24554, 2022

17:00 PM
Pratha-Fish

alastairp: yes. In fact it's gonna be the base for our final script :)

2022-09-02 24554, 2022

17:00 PM
Pratha-Fish

Also, it's not an interactive file afaik, and uses a config.json file for it's settings :)

2022-09-02 24542, 2022

17:06 PM
alastairp

good

2022-09-02 24554, 2022

17:06 PM
alastairp

remember though, that json doesn't allow for comments

2022-09-02 24518, 2022

17:07 PM
alastairp

so config.py would be much better, because 1) you can write comments in it, 2) you don't have to explicitly load the file and read data from it, you can just import it

2022-09-02 24546, 2022

17:07 PM
Pratha-Fish

Definitely. I'll switch it up to .py in the upcoming iteration

2022-09-02 24510, 2022

17:08 PM
Pratha-Fish

Do I also include the "main guard" in this config.py file?

2022-09-02 24529, 2022

17:08 PM
alastairp

only in files that you launch from the commandline

2022-09-02 24502, 2022

17:09 PM
Pratha-Fish

_what if someone runs config.py from the commandline_

2022-09-02 24520, 2022

17:09 PM
Pratha-Fish

I mean, no sane person would. Just curious lol

2022-09-02 24525, 2022

17:09 PM
alastairp

good question

2022-09-02 24529, 2022

17:09 PM
alastairp

it'll "execute" the file

2022-09-02 24548, 2022

17:09 PM
alastairp

however, what does the file include? 17 statements of the form "CONFIG_ITEM=1"

2022-09-02 24556, 2022

17:09 PM
alastairp

so this has no effect on your system

2022-09-02 24503, 2022

17:10 PM
Pratha-Fish

indeed

2022-09-02 24505, 2022

17:10 PM
alastairp

it's not going to load a file, delete a file, start a 3 hour computation, etc...

2022-09-02 24533, 2022

17:10 PM
Pratha-Fish

Yep

2022-09-02 24557, 2022

17:10 PM
Pratha-Fish

Which also makes me wonder if I should include this main guard in the modules

2022-09-02 24532, 2022

17:11 PM
Pratha-Fish

Because the module isn't supposed to be run on it's own. We just need to import functions from it ig

2022-09-02 24509, 2022

17:13 PM
alastairp

what do you think the behaviour of the module would be if you ran it?

2022-09-02 24521, 2022

17:13 PM
alastairp

if you had a main guard, what would you run in this case?

2022-09-02 24533, 2022

17:15 PM
Pratha-Fish

well none of them run anything TBH

2022-09-02 24543, 2022

17:15 PM
Pratha-Fish

these modules are just function/class declarations

2022-09-02 24551, 2022

17:18 PM
alastairp

right, so running the module isn't going to do anything

2022-09-02 24500, 2022

17:19 PM
alastairp

and you don't have anything to put inside the main anyway

2022-09-02 24519, 2022

17:20 PM
Pratha-Fish

okie

2022-09-02 24531, 2022

17:20 PM
Pratha-Fish

So what else should we discuss

2022-09-02 24542, 2022

17:22 PM
Pratha-Fish

Maybe I should mention the fact that as you might have noticed, most of the repo is pretty "bootstrappy". i.e. The scripts and notebooks help ME get answer, and test out new stuff.

2022-09-02 24543, 2022

17:22 PM
Pratha-Fish

I am kinda confused, what purpose this code might fulfill to it's end user? We could ponder upon that question and maybe re structure the complete repo accordingly

2022-09-02 24555, 2022

17:25 PM
alastairp

this is definitely an interesting project because the result of running it is a new dataset and then we never need the code again

2022-09-02 24531, 2022

17:26 PM
alastairp

the notebooks are great for the exploration, and they've helped you understand the code well. I see two possible options

2022-09-02 24509, 2022

17:27 PM
alastairp

one is that we just "archive" the notebooks. Put them in a folder, make sure that they run, and that there is a description at the top of the notebook explaining what it was for

2022-09-02 24531, 2022

17:27 PM
alastairp

the other option is that we start thinking about what our final output of the project is in addition to the new dataset

2022-09-02 24540, 2022

17:27 PM
alastairp

I'm thinking reports/blog posts, etc

2022-09-02 24506, 2022

17:29 PM
alastairp

do you think it would be possible to turn each notebook into a small post explaining the key things that we learned by running it? I see things like pandas/arrow/csv tests. We also did some zst tests, right? This would be great to turn into a few small posts with code examples and images

2022-09-02 24520, 2022

17:29 PM
alastairp

then we'd have a great set of guidelines for other people to learn from

2022-09-02 24542, 2022

17:29 PM
Pratha-Fish

That's exactly what I had in mind too :))

2022-09-02 24517, 2022

17:30 PM
Pratha-Fish

Which is why I've already cleaned and archived some test notebooks that demonstrate different benchmarks

2022-09-02 24510, 2022

17:31 PM
alastairp

excellent

2022-09-02 24547, 2022

17:31 PM
alastairp

keep this in mind with the timeline for your project too - it's OK to spend a week writing up these notebooks. this is just as valid as "coding". We don't want to rush this at the end of the project

2022-09-02 24504, 2022

17:32 PM
alastairp

btw, did you get notifications of the changed dates? Did we let you know?

2022-09-02 24539, 2022

17:32 PM
Pratha-Fish

I did recieve a copy of the mail you forwarded to mayhem regarding date extension

2022-09-02 24558, 2022

17:32 PM
Pratha-Fish

I also checked my GSoC dashboard, which is currently showing my project end date as 24th Oct

2022-09-02 24516, 2022

17:33 PM
alastairp

yes, that's what we set it to. good. just checking that you have that date in mind too

2022-09-02 24523, 2022

17:34 PM
Pratha-Fish

Yes, I did have that one in mind

2022-09-02 24547, 2022

17:35 PM
Pratha-Fish

alastairp: Given the remaining work, how long do you think the rest of the project should take?

2022-09-02 24547, 2022

17:35 PM
Pratha-Fish

Maybe even stuff that I should improve upon

2022-09-02 24521, 2022

17:36 PM
alastairp

well, work fills to expand the time available

2022-09-02 24552, 2022

17:36 PM
Pratha-Fish

*Maybe even let me know what I can improve upon since my work time schedule since early august

2022-09-02 24514, 2022

17:37 PM
alastairp

I think that your todo list that you mentioned earlier is a good start. As I said at the beginning of the week we should start writing the final scripts with the assumption that the mapping is working, while we keep debugging the mapping

2022-09-02 24518, 2022

17:38 PM
alastairp

the other things that we can add to the list are your final written deliverables (be that blog posts about your experiments, or just a final post describing the entire project), and also a possible task about looking at the release mbids

2022-09-02 24526, 2022

17:38 PM
alastairp

like we discussed earlier this week

2022-09-02 24525, 2022

17:39 PM
Pratha-Fish

Great

2022-09-02 24504, 2022

17:40 PM
Pratha-Fish

I'll recollect everything we discussed in the past few days, and update the To-Do list

2022-09-02 24513, 2022

17:40 PM
alastairp

that would be great, thanks

2022-09-02 24521, 2022

17:40 PM
Pratha-Fish

That would make it easier to set time bound goals and all

2022-09-02 24532, 2022

17:40 PM
alastairp

you could also write a rough timeline, of when you think you'll be able to complete each of these steps

2022-09-02 24547, 2022

17:40 PM
Pratha-Fish

Yes I'll try