[@holycow23:matrix.org](https://matrix.to/#/@holycow23:matrix.org) what postgres queries and which metadata files?
holycow23[m]
lucifer[m]: Let's say I want to write a postgres query to fetch the listens with their year of release so I need to use the metadata from the HDFS query and for listens I am using the function as defined in the gist
lucifer[m]
[@holycow23:matrix.org](https://matrix.to/#/@holycow23:matrix.org) you won't need to write a postgres query, it would be a spark sql query. the distinction is also important because for some things the syntax of spark sql is different from postgres. for the hdfs metadata you can read the dataframe as in the example gist using dataframe api or spark sql query.
You shouldn't need to check the MB dumps in any case. Any metadata that you need should come from MB db. The format of MB dumps is different from MB DB so if you use the first it would create issues.
I'll check the release year YIM queries in a while and confirm if all the data you need for that one is available or not.
I have added release_metadata_cache to the sample dumps already. (Might need to update the codebase/container to import it successfully though).
holycow23[m]
<lucifer[m]> "You shouldn't need to check..." <- Where can I see the formt of the MD db?
This is the list of metadata files imported into spark
holycow23[m]
sorry this is the metadata part right?
Yeah my bad
So, I gotta use this right?
No need for MB db?
lucifer[m]
Yes this data already exists and you can use it as needed
If there is some metadata that doesn't in these files then you would need to write queries to create these files by reading data from MB db
I'll update the setup on wolf later today to add release_metadata_cache to the table.
*to spark.
holycow23[m]
lucifer[m]: I didn't get this?
lucifer[m]
There is one more metadata file available in production that is missing from your local setup because I only added it to sample dumps last week.
saumon has quit
I'll update your spark setup to add that file.
julian45[m]
reosarevok: a while back we talked about the continued need to mass mail auto-editors for election notifications, even in a post SSO implementation future...... (full message at <https://matrix.chatbrainz.org/_matrix/media/v3/...>)
not urgent, just a few thoughts i had while heading towards bed :)
<julian45[m]> "reosarevok: a while back we..." <- > <@julian45:julian45.net> reosarevok: a while back we talked about the continued need to mass mail auto-editors for election notifications, even in a post SSO implementation future...... (full message at <https://matrix.chatbrainz.org/_matrix/media/v3/...>)
saumon joined the channel
Maxr1998 joined the channel
Maxr1998_ has quit
mayhem[m]
<mayhem[m]> "lucifer: labs.api is running..." <- Did you take a look to see if anything was amiss with the data?
dabeglavins60721 has quit
lucifer[m]
mayhem: missed that message yesterday, will take a look now.
<holycow23[m]> "lucifer: I wrote this small..." <- you can assume it will work fine in production without limiting, we have bigger queries that work fine there. do you still run out of memory with --driver-memory 8g?
pite_ has quit
pite joined the channel
holycow23[m]
Yes I did run out of memory
mayhem[m] uploaded an image: (23KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/QxFQIEVftryIukGiTwUJvmZA/image.png >
mayhem[m]
lucifer: my LB instance is throwing this error on login
keys verified, so without an error message, I am unsure how to proceed.
<lucifer[m]> "you can assume it will work fine..." <- I did run out of memory, also does such type of querying work or do I need SQL queries, that's what I wrote in mock queries in the proposal so either I will have to use that or just pandas filtering
lucifer[m]
holycow23: that type of querying works but i think for consistency sake best to use SQL queries only.
holycow23[m]
lucifer[m]: Okay that's what I thought of too but how do I test those?
lucifer[m]
you can those by passing the query to spark.sql(query)
for running out of memory, i'll take a look at it, there are different kinds of memory configurations in spark and its possible another one needs to be increased to avoid the issue.
mayhem[m]
<lucifer[m]> "to confirm the OAUTH_CLIENT_ID..." <- yes, both match
lucifer[m]
i'll try to reproduce the issue an fix it
mayhem[m]
let me know if you need help.
holycow23[m] sent a from code block: https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/NwWeXkNYsOPcWVKEASzafweS
holycow23[m]
This script ran quite well to map the songs with the genre
outsidecontext[m
reosarevok: is the tagger link fix for taglookup supposed to be deployed on beta? Because I still get the issue there
It no longer opens a localhost tab for me at least...
(and I get the same error on console than on search)
reosarevok[m] sent a code block: https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/UWGxJbpzaowymWnFeQTNlXwt
outsidecontext[m
it does for me (well, it is the same tab for me, but it navigates to http://127.0.0.1:8001)
ok, sorry. was a cache issue. cleared the cache and now it works
holycow23[m]
<lucifer[m]> "for running out of memory, i'..." <- I just wrote a query for count of listens per genre grouped by user, that worked well without any limits
<lucifer[m]> "rayyan_seliya123, suvid, m...." <- hey lucifer gentle reminder can u please review this commit https://github.com/metabrainz/listenbrainz-serv... as we discussed tp get move ahead whats pending or something !!
this will be used to execute your query and generate the results.
holycow23[m]
Okay will look into it
mayhem[m]
lucifer: was this dataset processed withe the Beatles fix in place?
lucifer[m]
mayhem: nope.
holycow23[m]
lucifer[m]: Actually I did go through this in the early days but how do I test this?
lucifer[m]
mayhem: i don't have the link to those video recordings with lfm guys. can you share it again?
mayhem[m]
lucifer[m]: the data looks really nice, from the spot checks I've made. but artists like the beatles are featuring quite prominently in some results. so I would very much love to see this fixed for all of our similar data sets.
lucifer[m]
fwiw, it might not be easily applicable here as to my best recollection their suggestion was to scale items in the collaborative filtering model.
mayhem[m]
ah, yes. ok, in that case, I think this is workable for the start. I can't see any problems from my spot checking, but eventually others might. so, lets keep our ability to regenerate this data alive for the time being.
holycow23: above are the changes needed to add a new stat to spark.
holycow23[m]
Okay will go through them thanks
lucifer[m]
once all of this is in place, you will be able to run the command created in step 5 to send a request to the spark cluster (like we do for requesting existing stats or creating a new dump)
for testing purposes, when your class in step 1 is ready, you can import it in pyspark and run it directly.
the code will be similar to the function linked in step 2.
when you have written step 2, you can just import that function and call it with the desired arguments and test 1 and 2 together so on.
holycow23[m]
Okay, will go through them and if anything will get back toyou
lucifer[m] posted a file: Debugging.ipynb (3703KiB) < https://matrix.chatbrainz.org/_matrix/media/v3/download/chatbrainz.org/ulXxuIqCvyTUGaSAkIwhAMel >
lucifer[m]
this is a notebook that i use for similar testing and debugging of the spark cluster. i'll try to clean it up later and share with you. but for now you can see the raw version and if it helps.
rayyan_seliya123: you can combine the changes from both PRs into a single one and close the other one.
rayyan_seliya123
lucifer[m]: The one in which I have added sql files or tables ??I should close this ? And merge it into the one in which I have the seeder file and indexer script ?
lucifer[m]
sure sounds good to me
rayyan_seliya123
lucifer[m]: Okk will do it and do let me know after that review to move further
lucifer[m]
suvid: i took a look at your PR, there are some changes needed but i don't see any blockers so you can start working on implementing the processing of zip for imports.
also, tests will be needed for every view including file uploads.
lusciouslover has quit
lusciouslover joined the channel
BobSwift[m] joined the channel
BobSwift[m]
<reosarevok[m]> "> <@julian45:julian45.net..." <- And the mailing tool I put together was a simple hack to help support that stopgap measure. It was never intended to be a full-on production type mass mailer.
reosarevok[m]
Tons of thanks for that, by the way! :)
mayhem[m]
lucifer: we had a user discover that playlists that belong to a deleted LB user gave a 500 error when trying to load those pages. I've made a deleted_lb_user that all deleted playlists are ascribed to. like this: