I was looking into ListenBrainz as I was considering switching from Last.fm. Then I took a look at what a database dump looked like out of curiosity...I am very surprised that the listen data is not anonymized. Do not feel great about that at all, really.
Perhaps I had a false impression, but I had a belief that the actual tracks listened to would not be associated with individual users, at least in a public dump. I do think that releasing a listens database could be useful for statistical analysis on recordings but having users attached to that is not really necessary.
>By signing into ListenBrainz, you grant the MetaBrainz Foundation permission to include your listening history in data dumps we make publicly available under the CC0 license. None of your private information from your user profile will be included in these data dumps.
At least for me, I did not get the impression that this public dump of listening data would have my profile attached to it. This is because, you know, often times data that is made available for public research is anonymized so that said data cannot be traced back to users. At least easily.
In this case, I am not sure what the value is for including user information in the dump aside from spooking users and onlookers.
crism joined the channel
binzy joined the channel
aerozol[m] joined the channel
aerozol[m]
Techman: What do you mean by anonymized? ListenBrainz is an open database. And without attaching/grouping the listens to some sort of user entity the data is useless.
Basically, if you don’t want your listens linked together, I would not recommend using ListenBrainz or last.fm. That’s a decision a lot of MetaBrainz contributors make - we are generally pretyt privacy conscious
If you have ideas for something along the lines of replacing explicit user names (or something else?) with random strings, you could open a ticket for that. But AFAIK it would be easy to resolve it to a user on the site, with the same stats. But take my input with a grain of salt, I am a layperson when it comes to the data dumps :)
Techman
aerozol[m]: anonymized as in the data was generated by users but the link between the user and the data is not in the output.
It is not fool proof as this is a public site but do you kinda get what I mean?
Why would the data be useless if it could not be grouped to a user? I feel like you could still gain useful insights without a link. Clients used, popularity of songs, etc.
I do not mind my listening data being available for research but I would not want my username and ID to be linked to it in a public dump of all data. I feel like people should be using a profile page if the care about listens from a specific person. The way it is now, it feels creepy.
aerozol[m]
How would you make the data meaningful, for instance be able to say that x individual users have listened to an artist, without linking all of a users listens together?
Or generate recommendations?
You could calculate total listens of a song or artist, sure. But we'd have to remove user profile pages
rbatty joined the channel
Techman
The way the data is now, I can build a profile for every person on the site whether they know it or not, without requiring me to do anything. I am sure that I am not the only one who may not have a firm grasp on what is going on here.
aerozol[m]
Building a profile for every person on the site is the point of ListenBrainz. You are making a profile of all of your listens on an open source site. It's what last.fm does as well
(they're not open source, but they have an API and anyone can grab your data)
Techman
I am not a data expert when it comes to anonymization so I will defer for a real solution but I am sure there is a way to have some uniqueness in the data without easily tracing it back to a particular person
aerozol[m]
I guess the general idea is that if you don't attach identifiable information to your username you are 'anonymous' in terms of linking your account to your team life person
*real life person
Techman
I am treating having someone's username (e.g. looking at someone's profile page) different from public data dumps. If someone looks at my Last.fm page, then I would expect them to know my specific history. However I would not expect it to be in a public dump for research as there should be no need for my account to be identifiable in that.
aerozol[m]
I don't really understand what you mean - public dump, public page, what's the difference? FYI I can plug your last.fm username into lots of places to scrape out interesting data
I guess I'm not really disagreeing with your sentiment... Just that I don't see how ListenBrainz (even more so than other sites, given we are open source) can work around it
Maybe someone else can think of a middle ground, but I can't 😔
rbatty has quit
If there's anywhere you think we can clarify the language re. What will be public, that's something I can make a ticket for btw
Techman
IRC is perhaps not the best place to articulate thoughts but maybe I can try to make what I am thinking as clear as possible. Or make a forum post.
aerozol[m]
Forum is always good for more discussion and input 👍
Then tickets if something actionable comes out of it!
Techman
I guess I will take it from the top. I was originally going to migrate to ListenBrainz as I generally like open source stuff and I think Last.fm charging for reports is kind of bogus, but then I stopped because I checked out the database dumps and realized that identifiable info for a user is in the dump. I do not think that public data dumps should be traceable to users, at least directly.
For research purposes (often what this kind of data is made available for), there is no need to really link data to user accounts. There should be a way to anonymize the users in the output. I consider this different from visiting someone's profile page because the intent is different. Listening data for one particular user is very specific vs the public data dump which currently includes
everyone, fully traceable.
If the data were to be anonymized, then the people who grab the data dump can do research while not being able to trace it back to individuals unless they then looked up a person to connect it to that output.
The way the dump is now, everyone who has ever contributed a listen is exposed...even if they would rather only be found organically through other users. As someone looking at the data, I do not think I should have everyone's identifiable listening history. It feels creepy and unnecessary. Anonymous user IDs or some generated substitute could take the place of the usernames and user IDs and
the data could be pretty much as useful. If I wanted to know a specific person's history, I can always look them up.
Island_ has quit
binzy has quit
binzy joined the channel
binzy has quit
binzy joined the channel
ApeKattQuest
oh! I think I get where aerozol's and Techman's misscommunication is
basically aerozol thought that users data will be removed completely, but the idea was to just put in something that's not an user name
(a sentimenti kinda get tby)
tbh*
outsidecontext has quit
outsidecontext joined the channel
G0d joined the channel
aerozol[m]
Not quite, I just think that replacing a username with a random number or something is an option, but doesn’t anonymise the data at all. It might just give users a false impression and be even worse, tbh. But we could, I guess. Could just remove usernames and replace them with random strings, like MBID’s (but I don’t see how this is any more anonymous, since listens are still ‘grouped’(
theracermaster has quit
SigHunter has quit
SigHunter joined the channel
rbatty joined the channel
ApeKattQuest
aerozol[m]: I think it means that someone wants to find an user's suername they'd have to mak like an effort for it, rather thna jsut having it thre, maybe?
idk
i don't super care, but i cna also kidna see the point too
slydacyfa has quit
zer0bitz has quit
zer0bitz joined the channel
trolley has quit
trolley joined the channel
trolley has quit
binzy has quit
trolley joined the channel
aerozol[m] has quit
chris8 joined the channel
MeatPupp3t has quit
MeatPupp3t joined the channel
SigHunter has quit
atj
providing a false sense of anonymity is worse than not providing it at all
SigHunter joined the channel
Techman
If the listen data is disassociated from usernames in the public dump, the only way to trace it back to someone would be to...look up that specific user's history.
nobiz joined the channel
Perhaps a better way to describe what I am saying is pseudoanonymization compared to strict anonymization.
nobiz has quit
nobiz joined the channel
nobiz has quit
nobiz joined the channel
nobiz has quit
nobiz joined the channel
nobiz has quit
nobiz joined the channel
nobiz has quit
nobiz joined the channel
ApeKattQuest
I mean I think that's fine, as long as it's spesified that you are not atually *anonymised* in as "if someone tries they can lookup just about any user by comparing listen data to the listen website" but yea, that'd have to be an ctive not passive thing
nobiz has quit
nobiz joined the channel
nobiz has quit
nobiz joined the channel
rbatty has quit
carbolymer has quit
minimal joined the channel
Sciencentistguy
do we have a way to encode "<artist> did mastering on track X of <release>" with the currently relationship system?
Sciencentistguy: it sucks because sometimes there are legitimate reasons to add a mastering credit to a recording. however the underlying reason for the deprecation is that changes in mastering don't require separate recordings...so you can see the issue.