#metabrainz

/

      • ruaok
        I'd like to get this giant PR merged sooner than later to avoid messes down the road.
      • 2020-07-30 21244, 2020

      • ruaok
        (the PR is at least easy to read. :) )
      • 2020-07-30 21245, 2020

      • yvanzo
        alastairp: yes
      • 2020-07-30 21208, 2020

      • alastairp
        yvanzo: perfect. my previous tickets are already there, so nothing more to do for me. thanks
      • 2020-07-30 21214, 2020

      • alastairp
        setup worked well last night
      • 2020-07-30 21241, 2020

      • ruaok
        iliekcomputers: did you just do something? builds passed now. heh.
      • 2020-07-30 21243, 2020

      • ruaok
      • 2020-07-30 21200, 2020

      • iliekcomputers
        i didn't
      • 2020-07-30 21207, 2020

      • ruaok
        how the??
      • 2020-07-30 21225, 2020

      • ruaok
        ok, well, the PR is happy.
      • 2020-07-30 21238, 2020

      • iliekcomputers
        >+3 −1,187
      • 2020-07-30 21240, 2020

      • iliekcomputers
        nice
      • 2020-07-30 21220, 2020

      • alastairp
        who should I talk to about spark setup/scripts? iliekcomputers? ruaok? pristine___?
      • 2020-07-30 21234, 2020

      • ruaok
        not i
      • 2020-07-30 21248, 2020

      • alastairp
        the setup instructions for local dev as they stand don't work, so we'll need to work on that
      • 2020-07-30 21217, 2020

      • iliekcomputers
        they don't?
      • 2020-07-30 21219, 2020

      • iliekcomputers
      • 2020-07-30 21224, 2020

      • iliekcomputers
      • 2020-07-30 21239, 2020

      • alastairp
        and the various scripts lying around the repo? run.sh, config.sh.sample, docker/*.sh ?
      • 2020-07-30 21247, 2020

      • alastairp
        are these all for production? or unused?
      • 2020-07-30 21224, 2020

      • iliekcomputers
        most are for production.
      • 2020-07-30 21252, 2020

      • iliekcomputers
        docker/*.sh could definitely be consolidated, but they don't matter for local dev
      • 2020-07-30 21202, 2020

      • alastairp
        would be nice to use a similar trick that yvanzo has in musicbrainz-docker to add overlay docker-compose files to make the spark reader container
      • 2020-07-30 21206, 2020

      • alastairp
        mmm, ok. I'll look again
      • 2020-07-30 21222, 2020

      • alastairp
        iliekcomputers: did you see my comment about dumps? I recall that you mentioned something a few days ago
      • 2020-07-30 21249, 2020

      • iliekcomputers
        right, i tried importing the public dump using the manage command 2-3 years ago
      • 2020-07-30 21204, 2020

      • iliekcomputers
        and surprise surprise, it didn't work :P
      • 2020-07-30 21223, 2020

      • iliekcomputers
        i think manual COPY FROMs would still work, but I have a ticket open for fixing this.
      • 2020-07-30 21247, 2020

      • alastairp
        so, that is to say it's not possible to load a public dump at the moment?
      • 2020-07-30 21257, 2020

      • alastairp
        2-3 _years_ ?
      • 2020-07-30 21206, 2020

      • iliekcomputers
        a public postgres dump
      • 2020-07-30 21247, 2020

      • iliekcomputers
        yeah, i wrote the initial data dump code at least 2 years ago
      • 2020-07-30 21206, 2020

      • iliekcomputers
        i tried to import a few weeks ago
      • 2020-07-30 21213, 2020

      • iliekcomputers
        sorry, me english bad
      • 2020-07-30 21220, 2020

      • alastairp
        got it
      • 2020-07-30 21201, 2020

      • alastairp
        so, if I'm going to be working through this in the same environment, it'd be great to try and fill in these gaps
      • 2020-07-30 21229, 2020

      • alastairp
        make sure that an external person can do the setup, import some data, push it into local spark, build some models, etc
      • 2020-07-30 21217, 2020

      • ruaok
        ishaanshah: iliekcomputers : https://labs.api.listenbrainz.org
      • 2020-07-30 21230, 2020

      • iliekcomputers
        alastairp: i'm pretty sure that should be possible right now.
      • 2020-07-30 21234, 2020

      • alastairp
        so, I have an lb env set up now. next step is data. can you give a brief description of what data is available, and the process for loading it?
      • 2020-07-30 21235, 2020

      • ruaok
        please update your code to use this URL from now on.
      • 2020-07-30 21241, 2020

      • alastairp
        I'll fill in some missing docs if necessary
      • 2020-07-30 21244, 2020

      • iliekcomputers
        the listens data can get imported into spark.
      • 2020-07-30 21249, 2020

      • alastairp
        (take your time, this evening after work would be fine)
      • 2020-07-30 21212, 2020

      • ishaanshah
        ruaok: noice!
      • 2020-07-30 21218, 2020

      • ishaanshah
        I will update the code
      • 2020-07-30 21257, 2020

      • ishaanshah
        ruaok: we need msid->mbid for artists too
      • 2020-07-30 21259, 2020

      • iliekcomputers
        alastairp: i'd suggest going through the steps here once: https://listenbrainz.readthedocs.io/en/production…, if something doesn't work, we can fix the docs. but that should set you up with a valid data dump with listens in spark.
      • 2020-07-30 21214, 2020

      • ruaok
        ishaanshah: ah, ok.
      • 2020-07-30 21219, 2020

      • ishaanshah
      • 2020-07-30 21228, 2020

      • ishaanshah
        ^ this is not needed
      • 2020-07-30 21235, 2020

      • alastairp
        right, so the process is to load a public spark dump into spark?
      • 2020-07-30 21236, 2020

      • iliekcomputers
        at that point, this explains how we send requests to spark: https://listenbrainz.readthedocs.io/en/production…
      • 2020-07-30 21241, 2020

      • iliekcomputers
        alastairp: yes.
      • 2020-07-30 21251, 2020

      • alastairp
        rather than load a public data dump into timescale and then ship to spark?
      • 2020-07-30 21256, 2020

      • iliekcomputers
        yes.
      • 2020-07-30 21216, 2020

      • alastairp
        ok. I'll try with this process first then
      • 2020-07-30 21226, 2020

      • iliekcomputers
        ishaanshah will also be able to help if you have questions.
      • 2020-07-30 21201, 2020

      • iliekcomputers
        pointing out things missing in the docs etc would be very much appreciated :)
      • 2020-07-30 21232, 2020

      • MajorLurker joined the channel
      • 2020-07-30 21227, 2020

      • ruaok
        zas: another one, matching the previous one: https://github.com/metabrainz/docker-server-confi…
      • 2020-07-30 21245, 2020

      • MajorLurker has quit
      • 2020-07-30 21237, 2020

      • zas
        ruaok: reviewed, lgtm
      • 2020-07-30 21245, 2020

      • ruaok
        thanks!
      • 2020-07-30 21216, 2020

      • alastairp
        iliekcomputers: what creates the metabrainz/hadoop-yarn, metabrainz/spark-master, and metabrainz/spark-worker images?
      • 2020-07-30 21225, 2020

      • alastairp
        ah, hadoop-cluster-docker
      • 2020-07-30 21218, 2020

      • ruaok
        shit. :(
      • 2020-07-30 21256, 2020

      • ruaok
        ishaanshah: I didn't know you need the artist_msid lookup to be in production. that's the messybrainz mapping which is a lot harder to put into production.
      • 2020-07-30 21226, 2020

      • ruaok
        for now, keep using the one on bono until I figure out what to do.
      • 2020-07-30 21215, 2020

      • ishaanshah
        ruaok: sure will do
      • 2020-07-30 21246, 2020

      • BrainzGit
        [listenbrainz-server] mayhem opened pull request #996 (master…update-dump-docs): Update dump docs to reflect new timescale based dumps https://github.com/metabrainz/listenbrainz-server…
      • 2020-07-30 21228, 2020

      • diru1100
        Morning!!!
      • 2020-07-30 21220, 2020

      • diru1100
        yvanzo: i have changed the draft name t o v-0.1. The files needed are stored as assets
      • 2020-07-30 21254, 2020

      • alastairp
        anyone seen the follow server fail to start?
      • 2020-07-30 21254, 2020

      • alastairp
        follow_server_1 | 2020-07-30 11:22:03,996 CRITICAL Could not get addresses to use: [Errno -3] Lookup timed out (rabbitmq)
      • 2020-07-30 21212, 2020

      • alastairp
        looks like it can't lookup `rabbitmq` hostname, but this service is running
      • 2020-07-30 21203, 2020

      • alastairp
      • 2020-07-30 21225, 2020

      • alastairp
        loading a new shell and pinging rabbitmq works as expected... :/
      • 2020-07-30 21213, 2020

      • sumedh joined the channel
      • 2020-07-30 21250, 2020

      • BrainzGit
        [listenbrainz-server] alastair merged pull request #994 (master…develop-improvements): develop.sh improvements https://github.com/metabrainz/listenbrainz-server…
      • 2020-07-30 21202, 2020

      • alastairp
        thanks for the review ruaok
      • 2020-07-30 21236, 2020

      • ruaok
        Thanks for the PR!
      • 2020-07-30 21212, 2020

      • v6lur has quit
      • 2020-07-30 21201, 2020

      • yvanzo
        diru1100: nice
      • 2020-07-30 21235, 2020

      • Major_Lurker has quit
      • 2020-07-30 21254, 2020

      • BrainzGit
        [listenbrainz-server] alastair opened pull request #997 (master…spark-hdfs): Improve HDFS setup and startup scripts https://github.com/metabrainz/listenbrainz-server…
      • 2020-07-30 21201, 2020

      • yvanzo
        diru1100: is bio_tokenizer.pickle still needed? I don't see any reference to it in pr #2.
      • 2020-07-30 21204, 2020

      • supersandro2000 has quit
      • 2020-07-30 21208, 2020

      • alastairp
        iliekcomputers:
      • 2020-07-30 21209, 2020

      • alastairp
        hadoop-master_1 | 2020-07-30 11:53:40,816 WARN hdfs.StateChange: DIR* FSDirectory.unprotectedRenameTo: failed to rename /temp to /data/listenbrainz because destination's parent does not exist
      • 2020-07-30 21214, 2020

      • alastairp
        does this look familiar?
      • 2020-07-30 21219, 2020

      • supersandro2000 joined the channel
      • 2020-07-30 21243, 2020

      • iliekcomputers
        ishaanshah: ^
      • 2020-07-30 21257, 2020

      • alastairp
        when running the spark data importer. I don't see any reference to /data in the docker-compose.spark file
      • 2020-07-30 21205, 2020

      • alastairp
        thanks :)
      • 2020-07-30 21201, 2020

      • diru1100
        yvanzo: it's not needed. I have removed all pickle files in pr #2
      • 2020-07-30 21200, 2020

      • travis-ci joined the channel
      • 2020-07-30 21200, 2020

      • travis-ci
        Project bookbrainz-site build #3281: passed in 4 min 26 sec: https://travis-ci.org/bookbrainz/bookbrainz-site/…
      • 2020-07-30 21200, 2020

      • travis-ci has left the channel
      • 2020-07-30 21218, 2020

      • yvanzo
        diru1100: so does it still need to be in v-0.1 assets?
      • 2020-07-30 21241, 2020

      • diru1100
        yvanzo: it's needed if they want to generate data.
      • 2020-07-30 21254, 2020

      • diru1100
        Not needed to run the model
      • 2020-07-30 21202, 2020

      • diru1100
        We can remove all *_tokenizers actually
      • 2020-07-30 21206, 2020

      • yvanzo
        diru1100: It would probably be more useful to just explain how to use these tokenizer files as the goal is to allow reproducing tests and derivative works.
      • 2020-07-30 21251, 2020

      • yvanzo
        diru1100: For example, how can one use bio_tokenizer.pickle (step by step)?
      • 2020-07-30 21252, 2020

      • diru1100
        yvanzo: Yes, should will help. But in dataset_generation notebook we aren't using the pickle file at all. We are directly using Keras Tokenizer class to do the job.
      • 2020-07-30 21228, 2020

      • diru1100
        I think it is kept maybe to store the tokenizers once we use it in production, but due to online it might change.
      • 2020-07-30 21247, 2020

      • diru1100
        *online learning
      • 2020-07-30 21207, 2020

      • diru1100
        I think the starting paragraph explains the model well. https://github.com/diru1100/spambrainz_ml/blob/gs…
      • 2020-07-30 21258, 2020

      • yvanzo
        my bad, I probably just mistyped the filename
      • 2020-07-30 21259, 2020

      • sumedh has quit
      • 2020-07-30 21203, 2020

      • ishaanshah
        alastairp: Is this the first time you're running the import?
      • 2020-07-30 21247, 2020

      • alastairp
        ishaanshah: yes
      • 2020-07-30 21202, 2020

      • ishaanshah
        Hmm, I think I know why its happening
      • 2020-07-30 21213, 2020

      • ishaanshah
        the /data folder hasn't been created
      • 2020-07-30 21233, 2020

      • ishaanshah
        It worked for me because it had already been created befor
      • 2020-07-30 21241, 2020

      • alastairp
        yes, makes sense
      • 2020-07-30 21243, 2020

      • alastairp
        how did you create it?
      • 2020-07-30 21255, 2020

      • alastairp
        should this happen automatically as part of the import script?
      • 2020-07-30 21201, 2020

      • ishaanshah
        The moving part is a recent addition
      • 2020-07-30 21211, 2020

      • ishaanshah
        > should this happen automatically as part of the import script?
      • 2020-07-30 21211, 2020

      • ishaanshah
        yep
      • 2020-07-30 21219, 2020

      • ishaanshah
        I'll open a PR
      • 2020-07-30 21216, 2020

      • alastairp
        thanks. is there something I can do now to be able to continue without waiting for the PR?
      • 2020-07-30 21221, 2020

      • alastairp
        or will it only take a few minutes?
      • 2020-07-30 21230, 2020

      • ishaanshah
        it should take a few minutes
      • 2020-07-30 21242, 2020

      • ishaanshah
        Otherwise you could use the hdfs cli to create the directory
      • 2020-07-30 21225, 2020

      • ishaanshah
        iliekcomputers maybe able to help you with that
      • 2020-07-30 21236, 2020

      • ishaanshah
        I am not familiar with the hdfs cli
      • 2020-07-30 21258, 2020

      • kieto joined the channel
      • 2020-07-30 21233, 2020

      • alastairp
        hdfs cli sounds like something that we should have at least some basic documentation for if it doesn't exist
      • 2020-07-30 21207, 2020

      • iliekcomputers
        alastairp: hmm, i think something like `hdfs dfs -mkdir /data` from inside the hadoop container should fix the issue
      • 2020-07-30 21217, 2020

      • iliekcomputers
      • 2020-07-30 21224, 2020

      • ishaanshah
        alastairp: can you wait for 2 mins
      • 2020-07-30 21231, 2020

      • ishaanshah
        I am just about to open a PR
      • 2020-07-30 21245, 2020

      • alastairp
        I can wait, I'm just eating lunch :)
      • 2020-07-30 21228, 2020

      • Mr_Monkey
        Hi ruaok ! Do you have a few minutes to talk about LB's search_larger_time_range mechanism?
      • 2020-07-30 21258, 2020

      • shivam-kapila
        Mr_Monkey: hi. I have some idea about it. I may be able to help in case you are in hurry
      • 2020-07-30 21247, 2020

      • Mr_Monkey
        Not in a hurry per se, ni, but you can probably help me understand a bit better. In short, I'm trying to figure out what should be changed now that pagination is done in react
      • 2020-07-30 21212, 2020

      • BrainzGit
        [listenbrainz-server] ishaanshah opened pull request #998 (master…import_fix): Fix bug in spark import code https://github.com/metabrainz/listenbrainz-server…
      • 2020-07-30 21220, 2020

      • Mr_Monkey
        As far as I can tell there's no search_larger_time_range param for the API /user/XXXXX/listens endpoint, so i wonder if that's now needed
      • 2020-07-30 21221, 2020

      • ishaanshah
        iliekcomputers: I created a branch in the metabrainz repo by mistake, I'll delete it after the PR gets merged
      • 2020-07-30 21252, 2020

      • shivam-kapila
        Mr_Monkey: oh yes you are right
      • 2020-07-30 21257, 2020

      • shivam-kapila
        We need that
      • 2020-07-30 21208, 2020

      • shivam-kapila
        I may do it for you
      • 2020-07-30 21220, 2020

      • shivam-kapila
        And we may remove it from other page
      • 2020-07-30 21227, 2020

      • shivam-kapila
        route*
      • 2020-07-30 21235, 2020

      • Mr_Monkey
        What's the idea behind that mechanism again?
      • 2020-07-30 21203, 2020

      • jmp_music_
        @alastair after eating your lunch can we do a small meeting?
      • 2020-07-30 21219, 2020

      • shivam-kapila
        Actually it compares if the length oc listens fetched is less than the minimum no. of listens we set as a threshold