this is interesting, because I have no idea exactly how unique it's going to be
different build of ffmpeg, same file? maybe different
mp3, flac, definitely different
mp3, different mp3? maybe different
ianmcorvidae
well, the notion is that if the data's completely the same it's not worth keeping both
if there's more that should go into that calculation then that's also fine
alastairp
yes, true
ianmcorvidae
just trying to do better than "keep the last 5 that happened to be submitted"
alastairp
right
ianmcorvidae
(or "keep only the first, or the first lossless")
alastairp
it's just that "the same" in terms of features can be different
I agree that "exactly the same" is useless
ianmcorvidae
well, sure, though this is on the whole JSON data
alastairp
but it's a big change (and potentially computationally expensive) for just that
ianmcorvidae
so it should also change for things like build differences etc., with what I have
alastairp
right, I suspect that this will only dedup the exact same person runnig it twice
yeah, it'll change on build
it makes me feel a little funny, so I'll wait for rob to weigh in
(thanks though!)
I'm just fixing the json exporter for you
ianmcorvidae
just in terms of it not creating much uniqueness?
alastairp
yeah
ianmcorvidae
fair enough -- I think something like splitting things up a bit might make sense eventually -- such that smaller things are stored individually (such that each version thing only needs storing once, for example, but also if the whole lowlevel category comes out the same, or so -- not sure exactly where to break it up
which is the way to make this catch things better, I think, isolate "the tags changed" from "the build changed" from "the features changed"
I understand why he didn't escape keys - they're supposed to all come from internal code and you should never name a pool (essentia term for a key) with that
ianmcorvidae
yeah, makes sense
alastairp
but in the case of tags it just gets everything from taglib and dumps it there
ianmcorvidae
I figured it was something like that XD
alastairp
ok, fixed in another branch, you can merge it if you want
I need to redo the branches, one for each of my fixes and one combining everying for us guys - it's because I want dmitry to be able to pull what he wants into master
unfortunately it might mean we end up with hashids in abz that are no longer in the tree. that'll be annoying
next thing I want to do when we have more data is meta-stats over the mb database
how many complete albums, how much of an artist's collection
then meta-meta stats. how many pop albums as determined by lastfm tags (when are we getting genres?)
ijabz1
alistairp the only mb data you are storing is mbrecordingid or are you storing acoustid as well ?
in acoustbrainz ?
alastairp
only recordingid
in the case where there is no recordingid tag in the file I want to do an acoustid lookup, it would be fine to add that as additional metadata
ijabz1
I just wonder because one acoustid can match multiple recordingsids, and when that is the case there is the chace that the mapping to recordingid is wrong
alastairp
once I add that functionality we can also add an option to always submit acoustid if the person has it installed
right
ok, so if we find recordingid by fingerprinting it's a good idea to submit acoustid too
ijabz1
having the acoustid would allow you to postcheck bad data at a later date
imho, matching acoustid isn't a good idea, it would mean some kind of autotagging, which will lead to many errors (acoustid associated with incorrect recording, acoustid matching multiple recordings, etc...), imho you should just encourage people to tag their files using Picard (which is using acoustid, but user can check if correct)
ijabz1
Picard does autotag, i dont think people are going to want to start retagging their collection in order to contribute to acoustbrainz
im just saying that if their files already contain an acoustid its useful to send that as it helps verifies that the data is correct or indeed incorrect
alastairp
agreed. I would propose matching with acoustid but marking the data as such
e.g. "I'm happy to deal with potentially bad files" or "I only want almost certain files"
ijabz1
Maybe, thats not really what Im saying though, Ill try again.
If the songs have an mbrecordingid then they have already been tagged by some method be that Picard, SongKong ectera
You just need the mbrecordingid to serve as the key, but if the user has addtional metadata such as acoustid already in the file then they should send
that as well, this helps verify at a some later stage if see bad data
e.g, The Acoustid for that MBRecordingid matches to many MBRecordingIds, higher risk
or vice versa none of the Acoustids known for that MBRecordingid match the one user sent by the user, higher risk
zas
it remembers me http://musicbrainz.org/release/09186fe9-18af-47... where acoustids on both discs are the same, second disc has tracks without main voices... ;) acoustids are totally messed up on this one, kinda expected
i wonder how track 1-1 and track 2-2 can share the same acoustid (now)
alastairp
ijabz1: we send every tag that taglib finds
zas
i mean track 1-2 and 2-1
alastairp
if taglib tells us there is a tag for acoustid, it'll get sent (I'm not sure if this means that taglib needs to know how to parse an acoustid tag)
alastairp: Always storing AcoustIDs wouldn't be bad either, since recordings do sometimes need to be split up.
Also, caught up with back log: nvm. ;)
ijabz1 joined the channel
Nyanko-sensei joined the channel
ijabz1 joined the channel
ijabz1 joined the channel
Man. Those show stopper bugs are really annoying. :(
ianmcorvidae joined the channel
Leftmost joined the channel
ijabz1 joined the channel
tungol joined the channel
21WABMWUW joined the channel
yeeeargh
i'm a bit curiois what kind of music you guys are scanning. i didn't encounter that chord-bug once yet. the only errors i got where a bunch of replaygain/silence bugs with track which where either literally silence or tracks wich a larg amout of silence between two songs (hidden tracks)
Freso
I've scanned some hip hop, R&B, dancehall, Christmas music stuff, pop, folk/trad., ...
I have one thread hanging right now, but have had several bailing out on a "IOError: [Errno 2] No such file or directory: '/tmp/tmpy5hQly.json'"
alastairp
yeah, that'll be because the extractor fails to write the file, and the submitter tries to blindly open it
Freso
Yep.
alastairp
bug fix for that will be coming in the weekend
Freso
And it's consistent for the files that happens to.
...
Which, in retrospect, I should have probably collected somewhere for easy re-submission...
alastairp
yeah, we have no way of marking a file as submitted, or bad for submitting
also, it'd be nice to know how the extractor failed on those ones
to report bugs if needed
Freso
Yep. But I figure the ones I run into are the ones already reported last night, so I'll wait until those are sorted out before reporting new stuff. :)
It would also be nice if it would continue with the rest of the queued files and then report at the end which ones didn't work...
alastairp
code. patch. etc
seriously, I have about 4 different things going on here