#metabrainz

/

      • antgel has quit
      • CatQuest
        hmm.. listenbrainz: Listen Count: 1,156,786 last.fm Scrobbles 1,157,888
      • what is getting eaten?
      • even if you take in account the stuff playing while I import (last thing is about 6 hours ago) it's still over 1000 songs
      • doing a reimport of the few pages afterthe fact
      • yields Listen Count: 1,156,866
      • so still over 1000 songs :/
      • CatQuest tries reimport
      • Nyanko-sensei joined the channel
      • D4RK-PH0ENiX has quit
      • Nyanko-sensei has quit
      • D4RK-PH0ENiX joined the channel
      • dragonzeron joined the channel
      • dragonzeron
        how do i add the legal name of somebody using a alias
      • gcilou has quit
      • agentsim has quit
      • agentsim joined the channel
      • agentsim has quit
      • agentsim joined the channel
      • Gazooo has quit
      • Mineo joined the channel
      • agentsim has quit
      • Mineo has quit
      • chirlu has quit
      • bedlore joined the channel
      • ruaok
        CatQuest: that is a bug and we need to look more closely. can you please file a bug?
      • drsaunders has quit
      • reosarevok
        dragonzeron: just set the "legal name" alias type when adding a new alias from the aliases tab
      • agentsim joined the channel
      • Kode_ is now known as Kode
      • Kode is now known as KodeStar
      • agentsim has quit
      • Nyanko-sensei joined the channel
      • arbenina joined the channel
      • D4RK-PH0ENiX has quit
      • ruaok
        oh CatQuest, I just thought of one thing.
      • some of the data in last.fm is buggy and incomplete, for whatever reason.
      • and when it is incomplete, that data will get rejected by LB.
      • I could try and do an import onto my own machine and see where it complains.
      • iliekcomputers
        some of the listens are getting detected as duplicates, it seems
      • ruaok
        oh, did you see some error messages?
      • iliekcomputers
        even in my import, there are 9-10 listens less
      • no errors, just the duplicate count in influx-writer is sometimes >0
      • ruaok
        it would be good to dig into this further and verify what is going on.
      • iliekcomputers
        yep
      • doing just that :)
      • ruaok
        <3
      • henadel joined the channel
      • agentsim joined the channel
      • Nyanko-sensei has quit
      • alastairp
        hi ferbncode. what are you up to?
      • ferbncode
        alastairp: hi :), working on the PR, will update it for review by today
      • alastairp
        ok. any problems with mbdata?
      • ferbncode
        no, no problems, the delay is just the result of my many stupid doubts I cleared yesterday :)
      • alastairp
        OK. remember that if you have questions you can ask them here too
      • listenbrainz_bigquery_1 exited with code 0
      • after leaving it running all night. not sure when it stopped
      • ferbncode
        sure, will ask, thanks :)
      • yokel has quit
      • iliekcomputers
        alastairp: if you don't have write_to_bigquery on in config.py, it sleeps for a long time and then quits https://github.com/metabrainz/listenbrainz-serv...
      • alastairp
        that will be it
      • why?
      • (e.g., why not quit immediately?)
      • D4RK-PH0ENiX joined the channel
      • iliekcomputers
        I think ruaok added a sleep because it helped debug some docker issue where credentials weren't getting loaded as a volume correctly
      • yokel joined the channel
      • ruaok
        my goal was to prevent the container from restarting all the time if no big query credentials were present.
      • or having to edit the compose file.
      • agentsim has quit
      • this way it just quietly sits and shuts up.
      • I'm certainly not married to this method.
      • alastairp
        yeah, I've fought with docker restart policies before too :)
      • Sophist-UK has quit
      • D4RK-PH0ENiX has quit
      • Sophist-UK joined the channel
      • Sophist-UK has quit
      • Sophist-UK joined the channel
      • D4RK-PH0ENiX joined the channel
      • iliekcomputers
        caught a bunch of listens we consider duplicates in my last.fm import
      • ruaok
        interesting. they look like valid duplicates, no?
      • how did you generate this? what import/reimport steps did you take?
      • I'd love some help getting me talk title/desc up to scracth
      • scratch, even.
      • reosarevok
        ruaok: you namedrop me in an interview and don't tell me? :D
      • ruaok
        reosarevok: hey you! I dropped your name to get some street cred. Did you hear???
      • :-P
      • reosarevok
        (just got a message from a guy I know "hey this post at the top of the subreddit I follow mentions you wtf" :D )
      • ruaok
        lol
      • iliekcomputers
        ruaok: I did a simple last.fm import, just added a print to influx_writer wherever it thought a listen to be a duplicate
      • ruaok
        did you start with an empty DB?
      • iliekcomputers
        yes
      • ruaok
        oh. well, that is a problem, then.
      • clearly there shouldn't be duplicates on a single import.
      • iliekcomputers
        yeah, I dunno what the problem is, though
      • iliekcomputers goes to take a look
      • ruaok
        do we have an edge condition somewhere
      • ?
      • >= vs > ?
      • which might causes listens at the end of a block to be duplicated/omitted?
      • alastairp
        I failed to import 2
      • out of 400 pages
      • iliekcomputers
        could be, I haven't seen much of the scraper code in detail. Although, the data is a bit weird, I don't understand how a difference of 2 seconds between listens is possible. We couldn't be the ones doing that, because the last.fm api returns timestamps themselves, maybe the last.fm data has listens with differences less than 30 seconds.
      • alastairp
        I would have been more suspicious if I had failed to import 400 ;)
      • iliekcomputers: yeah, it would be worth doing a scrape of an API and checking the values of the timestamps
      • ruaok
        alastairp: it failed to import more than 1000 for CatQuest
      • iliekcomputers
        alastairp: yup, that's what I was gonna do :)
      • alastairp
        CatQuest: how many pages do you have?
      • agentsim joined the channel
      • ferbncode
        alastairp: I did a force push to the PR https://github.com/metabrainz/critiquebrainz/pu... (still WIP), it would be great to know if I am headed in the correct direction. :)
      • agentsim has quit
      • reosarevok
        inb4 flame war: I've been using XFCE forever but I'm getting a bit tired of some of its issues. Out of the flavours Ubuntu does ship with, what's the one I should install when I reinstall it? :p
      • Galaverna joined the channel
      • alastairp
        ferbncode: cool, I'll look this afternoon. thanks
      • honestly, gnome3 is way better than it was 6 years ago
      • Galaverna has quit
      • I don't think anything annoys me on a day-to-day basis, but I installed quite a number of extensions to make it better
      • agentsim joined the channel
      • ferbncode
        alastairp: great, thanks :). I used gnome an year ago, it was a heavy one for me, but then I switched to i3wm, superlight :P
      • agentsim has quit
      • ruaok
        weird feeling of the day: logging into quickbooks and having new companies appear there that I've never heard of or dealt with.
      • I <3 you Quesito!
      • iliekcomputers
        so the weird data is from lastfm itself
      • Quesito
        Lol! Lots of new ones ruaok!
      • ruaok
        Quesito: :)
      • iliekcomputers: I kinda thought so. you and i had been scouring all aspects of data ingestion.
      • iliekcomputers
        there is this one edge case that I think we might have missed in influx-writer, if there are duplicates in the same rabbitmq batch, it might not find out because the timestamps dict here doesn't contain that timestamp?
      • should we do a `timestamps[t] = result` here also? https://github.com/metabrainz/listenbrainz-serv...
      • ruaok
        interesting thought.
      • reosarevok
        haha
      • !m Quesito
      • BrainzBot
        You're doing good work, Quesito!
      • reosarevok
        (and thanks alastairp, maybe it is time to give gnome a chance again)
      • reosarevok guesses Leftmost would agree
      • ruaok
        iliekcomputers: not sure I full understand your suggested solution, but yes we should cover that edge case.
      • iliekcomputers
        ruaok: just adding every new timestamp in the rabbitmq to the dict which contains all the timestamps is what I suggested
      • that way, if there is another listen in the batch with the same timestamp it'll get detected and thrown out
      • *the rabbitmq batch
      • ruaok
        k, sounds good
      • Slurpee has quit
      • Quesito
        :) thanks reosarevok!
      • drsaunders joined the channel
      • arbenina has quit
      • is there a schema available for the json dumps?
      • ruaok
        the data contained in the JSON dumps is the same format as the data returned by our API:
      • github joined the channel
      • github
        [listenbrainz-server] paramsingh opened pull request #204: LB-180: Account for duplicates in same RabbitMQ batch for influx-writer (master...influx-writer/same-batch-dup) https://git.io/vQL0v
      • github has left the channel
      • ruaok
        with the fmt=json flag, of course.