#musicbrainz-devel

/

      • navap joined the channel
      • MightyJay joined the channel
      • luks joined the channel
      • ijabz joined the channel
      • ijabz
        luks: Hi, Im trying again to get my head round your single index luceene search indexing
      • Why do you think it would be quicker to get a match if artist_index, release_index ... were all put into a single index
      • luks
        ijabz: because the indexes would be smaller
      • ijabz: I expect they would be so small that they can fit completely in RAM
      • ijabz
        Why would they be smaller, I don't get it
      • luks
        because you don't store any literal strings
      • do you know more or less how libraries like Lucene index text?
      • ijabz
        Thats a different point, not storing the values (which I was going to come to)
      • luks
        you can ignore the post if you want to store values
      • that completely changes the situation
      • they would be still some size savings
      • ijabz
        Im just trying to worjk through this, lets put it another way then. Why didnt you suggest not storing values but keeping the seperate indexes
      • luks
        but I don't expect searches to be faster
      • because we already have framework for loading this data
      • most of freqnuently used data we have also available on memcached servers
      • oh
      • well, because I want the indexes to share data
      • let's say you have artist index with artist names
      • you need the same data in track index
      • so basically the way it currently is: track index = track data + release index + artist index
      • there are several advantages to this
      • ijabz
        I see, so youre thinking artist name is interned() and can be used by track and artist, but I would think this would apply only to stored data, not on the searched data
      • luks
        like you can use artist aliases in track searches for almost no cost
      • it's not "interned"
      • it's indexes
      • these systems work like this: they tokenize the input, and build a dictionary
      • then the index is a list of pointers to the dictionary
      • lucene is not trying to optimize stored data
      • just look at the track index how many time you will find duplicated artist names there
      • +s
      • seriously, postgresql is a better database than lucene
      • the tickets about not having data available in search results could be trivially solved by this
      • you could even do things like /ws/1/artist?query=Foo&inc=tags
      • there are several advantages, and I can't think of a disadvantage
      • ijabz
        hang on, Im not onto the storing of fields yet
      • luks
        if you are concerned about speed, I meant to use filters, not search query
      • these filters are kept in memory and optimized specifically for cases like this
      • ijabz
        If you take your example yes an artist may be used by 50 tracks, but if each track can reuse the artist token is only in the track index once, the same as if the artist and track index were merged.
      • luks
        if you mean indexed data, then yes it does
      • but you still have the same data in the track index and the artist index
      • ijabz
        But so what
      • luks
        it could de-duplicate even stored data, but it doesn't
      • you are wasting RAM
      • why keep duplicate data in cache?
      • the track index needs dictionary from artist and release indexes, the release index needs dictionary from the artist index
      • why not have it only once?
      • the idea is to have almost everything in RAM
      • so you only rarely touch the disk
      • the track index with stored values is 3GB, as you mentioned in a trac ticket
      • ijabz
        and without ?
      • luks
        I'm sure the whole combined index without stored values would be smaller
      • ijabz
        And you think this is going to improve search speed or not
      • luks
        :)
      • I'm pretty sure that if everything is in RAM, it's going to improve search speed
      • ijabz
        Possibly, I thought that with my testing done a few months ago but it didnt seem to be the case
      • luks
        the stored values mostly cost you disk seek times
      • they are spread around the disk
      • ijabz
        I suppose with 64bit, theres no practical RAM limit
      • luks
        are you seriously suggesting to keep duplicate data in RAM as a better solution?
      • I mean, I'm not asking anybody to do any work on this, but it seems pretty clear to me that the less duplication you have in RAM, the better
      • ijabz
        No Im not, Im only suggesting that this might not make much difference
      • luks
        I'd be ok with that
      • if the performance doesn't change, and I have more features available, it's a win situation anyway
      • ijabz
        Its worth an experiment, but you really need to benchmark it the existing system because It think your underplaying the cost of the database retrievals
      • luks
        ijabz: I'm pretty sure loading data from a replicated DB slave and a memcached server is more scalable solution than having the data on the same machine
      • the indexes are updated daily, one hour off data on the DB slave is fine
      • ijabz
        SO what's in memcache ?
      • luks
        in NGS: artists, labels, artist credits and ARs
      • release groups will be probably added too
      • there are other things, but these are easily reusable
      • ijabz
        But these are small things, the big things are release,, tracks and relationaship details , these arent going to fit are they ?
      • luks
        these are small, but very frequently usede things
      • note that for search results you don't need that much data
      • in release search results, all you have to do is loading the releases from DB and loading artist credits and labels from memcache
      • that's a single DB query, 2 memcached queries
      • both are on separate servers, optimally using all kinds of local caches
      • ijabz
        As popsed to 1 lucene query, and if you allow free searches (as you suggested) it could get a lot more complex
      • luks
        note that the 1 lucene query is more expensive
      • because you need to do 1 lucene query, and then seek all over the index to get the document data
      • if you have 5 lucene indexes on a single machine, they will all fight for IO resources and disk caches
      • ijabz
        Maybe, neither of us know for sure, the new Lucene is much faster than the version you were using before. I don't oppose your idea but don't you think a sensible first stage would be to get some figures for the work I've just finished, then make chnages as you suggested, and get some comparisons.
      • luks
        I was not suggesting you do anything :)
      • (I'm pretty sure actually)
      • I posted the mail a long time ago, and it was just an "would be nice" kind of idea
      • I have lots of those, but no time to implement them :)
      • but murdos wants to work on NGS support in the search server, and that will add even more duplicate data this way
      • ijabz
        Murdos is looking at it, and I'm trying to continue working on NGS but Im very confused about the direction we are heading
      • Nobody having much time was one of my points, I thought this would be better looked after have a workable NGS solution
      • luks
        I was really not trying to suggest that anybody should try to implement this
      • ijabz
        lol , thats exactly what I thought you were suggesting
      • So what do you think should be the approach for NGS ?
      • luks
        I was asking murdos for an opinion on this, because I know he knows the NGS schema well enough
      • I'm not sure about the right approach
      • ijabz
        luks: tell you what I've got to go out and we've been chatting for an hour now, perhaps we could tals a bit more late afternoon ?
      • luks
        ok, I should be here
      • nikki wonders if ijabz is back yet
      • ijabz
        nikki:hi
      • nikki
        hey
      • I was wondering how the search handles punctuation
      • ijabz
        I haven't looked at this in any detail, I was concentrating on replicating the existing systems
      • What in particulr
      • nikki
        ah right
      • well when I was testing I noticed http://musicbrainz.org/search/textsearch.html?q... doesn't work when I would expect it to
      • doesn't work on the live one either
      • ijabz
        Whats this Hebrew, Arabic ?
      • nikki
        but ・ is punctuation that should be treated like a space, hence why I wondered
      • japanese
      • katakana to be specific
      • ijabz
        is that character ONLY used in Katakana Japanese
      • nikki
        it says "won fei" and I'd expect it to find faye wong, since she has "フェイ・ウォン" (fei won) as an alias
      • as far as I'm aware, yes
      • ijabz
        Because I think it could be converted like the i and ı http://bugs.musicbrainz.org/ticket/3916 fix
      • But lucene itself won't out of the box do special processing for Japanese unless we tell it we are using Japanese and use a Japanese Anlyser that we dont
      • nikki
        hmm... I would have expected it to treat all punctuation the same unless told to do something special with it :/
      • seems like a total pita to manually define how every punctuation character should work
      • ijabz
        Actually, you could be right
      • nikki
        hm, unless for some reason it's not actually defined as punctuation...
      • nikki checks out unicode
      • luks
        unac would be the wrong place to fix this
      • it should be fixed in the tokenizer
      • nikki
        ah, "punctuation, other", good to know
      • ijabz
        nikki:OK, can you raise a seperate issue and I'll look at it when I can
      • nikki
        sure
      • ijabz
        luks:Can you point to some code in NGS that converts info from database into XML
      • luks
        there isn't such code yet
      • ijabz
        So what is done wrt to this thrift /json idea
      • luks
        nothing
      • ijabz
        is this aCID2s remit
      • luks
        well, before mb_server can support anything, it must be implemented in the search server
      • ijabz
        What do you want in the search_server, please step me through the concept
      • luks
        that's the part I'm not sure about
      • the obvious starting point would be updating the index code to work with the NGS database
      • I'm not sure how do you work with the mmd package for generating XML to suggest whether it's easier to adapt the code or return raw data (in any format)
      • ijabz
        Yes, that would me my first step, that is completely uncontroversial and can be done once murdos has created a new barnch
      • mmd is VERY easy, if I had a Relax NG schema for the new enities I could generate classs that adhered to the new schema within an hour
      • You've gone quiet, so I'll continue. I would create a series of classes from the Schema using JAXB , this would be loaded into project as MMD2, then the xml code would use these classes and the search results to populate a Metadata class that is then marshaled as xml
      • The issue was you still need code on mb_server for converting output to xml when looking up single entity because mb_server doesnt use search server in this case
      • and if you didn't store data in lucene then search server would'nt have the data to create the xml, UNLESS the search server could talk to the database
      • luks
        of course we need code to produce XML on the search server
      • ijabz
        which is one reason why I posted the idea about the search_server actaully being expanded to become the webservice
      • luks: I don't get your comment
      • luks
        "The issue was you still need code on mb_server for converting output to xml ..."
      • the xml webservice does more than static xml per entity
      • you can ask it to include various data
      • the code doesn't belong to the search server
      • I understand it would be useful to have an application to run standalone WS sever on a DB slave