luks: Hi, Im trying again to get my head round your single index luceene search indexing
Why do you think it would be quicker to get a match if artist_index, release_index ... were all put into a single index
luks
ijabz: because the indexes would be smaller
ijabz: I expect they would be so small that they can fit completely in RAM
ijabz
Why would they be smaller, I don't get it
luks
because you don't store any literal strings
do you know more or less how libraries like Lucene index text?
ijabz
Thats a different point, not storing the values (which I was going to come to)
luks
you can ignore the post if you want to store values
that completely changes the situation
they would be still some size savings
ijabz
Im just trying to worjk through this, lets put it another way then. Why didnt you suggest not storing values but keeping the seperate indexes
luks
but I don't expect searches to be faster
because we already have framework for loading this data
most of freqnuently used data we have also available on memcached servers
oh
well, because I want the indexes to share data
let's say you have artist index with artist names
you need the same data in track index
so basically the way it currently is: track index = track data + release index + artist index
there are several advantages to this
ijabz
I see, so youre thinking artist name is interned() and can be used by track and artist, but I would think this would apply only to stored data, not on the searched data
luks
like you can use artist aliases in track searches for almost no cost
it's not "interned"
it's indexes
these systems work like this: they tokenize the input, and build a dictionary
then the index is a list of pointers to the dictionary
lucene is not trying to optimize stored data
just look at the track index how many time you will find duplicated artist names there
+s
seriously, postgresql is a better database than lucene
the tickets about not having data available in search results could be trivially solved by this
you could even do things like /ws/1/artist?query=Foo&inc=tags
there are several advantages, and I can't think of a disadvantage
ijabz
hang on, Im not onto the storing of fields yet
luks
if you are concerned about speed, I meant to use filters, not search query
these filters are kept in memory and optimized specifically for cases like this
ijabz
If you take your example yes an artist may be used by 50 tracks, but if each track can reuse the artist token is only in the track index once, the same as if the artist and track index were merged.
luks
if you mean indexed data, then yes it does
but you still have the same data in the track index and the artist index
ijabz
But so what
luks
it could de-duplicate even stored data, but it doesn't
you are wasting RAM
why keep duplicate data in cache?
the track index needs dictionary from artist and release indexes, the release index needs dictionary from the artist index
why not have it only once?
the idea is to have almost everything in RAM
so you only rarely touch the disk
the track index with stored values is 3GB, as you mentioned in a trac ticket
ijabz
and without ?
luks
I'm sure the whole combined index without stored values would be smaller
ijabz
And you think this is going to improve search speed or not
luks
:)
I'm pretty sure that if everything is in RAM, it's going to improve search speed
ijabz
Possibly, I thought that with my testing done a few months ago but it didnt seem to be the case
luks
the stored values mostly cost you disk seek times
they are spread around the disk
ijabz
I suppose with 64bit, theres no practical RAM limit
luks
are you seriously suggesting to keep duplicate data in RAM as a better solution?
I mean, I'm not asking anybody to do any work on this, but it seems pretty clear to me that the less duplication you have in RAM, the better
ijabz
No Im not, Im only suggesting that this might not make much difference
luks
I'd be ok with that
if the performance doesn't change, and I have more features available, it's a win situation anyway
ijabz
Its worth an experiment, but you really need to benchmark it the existing system because It think your underplaying the cost of the database retrievals
luks
ijabz: I'm pretty sure loading data from a replicated DB slave and a memcached server is more scalable solution than having the data on the same machine
the indexes are updated daily, one hour off data on the DB slave is fine
ijabz
SO what's in memcache ?
luks
in NGS: artists, labels, artist credits and ARs
release groups will be probably added too
there are other things, but these are easily reusable
ijabz
But these are small things, the big things are release,, tracks and relationaship details , these arent going to fit are they ?
luks
these are small, but very frequently usede things
note that for search results you don't need that much data
in release search results, all you have to do is loading the releases from DB and loading artist credits and labels from memcache
that's a single DB query, 2 memcached queries
both are on separate servers, optimally using all kinds of local caches
ijabz
As popsed to 1 lucene query, and if you allow free searches (as you suggested) it could get a lot more complex
luks
note that the 1 lucene query is more expensive
because you need to do 1 lucene query, and then seek all over the index to get the document data
if you have 5 lucene indexes on a single machine, they will all fight for IO resources and disk caches
ijabz
Maybe, neither of us know for sure, the new Lucene is much faster than the version you were using before. I don't oppose your idea but don't you think a sensible first stage would be to get some figures for the work I've just finished, then make chnages as you suggested, and get some comparisons.
luks
I was not suggesting you do anything :)
(I'm pretty sure actually)
I posted the mail a long time ago, and it was just an "would be nice" kind of idea
I have lots of those, but no time to implement them :)
but murdos wants to work on NGS support in the search server, and that will add even more duplicate data this way
ijabz
Murdos is looking at it, and I'm trying to continue working on NGS but Im very confused about the direction we are heading
Nobody having much time was one of my points, I thought this would be better looked after have a workable NGS solution
luks
I was really not trying to suggest that anybody should try to implement this
ijabz
lol , thats exactly what I thought you were suggesting
So what do you think should be the approach for NGS ?
luks
I was asking murdos for an opinion on this, because I know he knows the NGS schema well enough
I'm not sure about the right approach
ijabz
luks: tell you what I've got to go out and we've been chatting for an hour now, perhaps we could tals a bit more late afternoon ?
luks
ok, I should be here
nikki wonders if ijabz is back yet
ijabz
nikki:hi
nikki
hey
I was wondering how the search handles punctuation
ijabz
I haven't looked at this in any detail, I was concentrating on replicating the existing systems
But lucene itself won't out of the box do special processing for Japanese unless we tell it we are using Japanese and use a Japanese Anlyser that we dont
nikki
hmm... I would have expected it to treat all punctuation the same unless told to do something special with it :/
seems like a total pita to manually define how every punctuation character should work
ijabz
Actually, you could be right
nikki
hm, unless for some reason it's not actually defined as punctuation...
nikki checks out unicode
luks
unac would be the wrong place to fix this
it should be fixed in the tokenizer
nikki
ah, "punctuation, other", good to know
ijabz
nikki:OK, can you raise a seperate issue and I'll look at it when I can
nikki
sure
ijabz
luks:Can you point to some code in NGS that converts info from database into XML
luks
there isn't such code yet
ijabz
So what is done wrt to this thrift /json idea
luks
nothing
ijabz
is this aCID2s remit
luks
well, before mb_server can support anything, it must be implemented in the search server
ijabz
What do you want in the search_server, please step me through the concept
luks
that's the part I'm not sure about
the obvious starting point would be updating the index code to work with the NGS database
I'm not sure how do you work with the mmd package for generating XML to suggest whether it's easier to adapt the code or return raw data (in any format)
ijabz
Yes, that would me my first step, that is completely uncontroversial and can be done once murdos has created a new barnch
mmd is VERY easy, if I had a Relax NG schema for the new enities I could generate classs that adhered to the new schema within an hour
You've gone quiet, so I'll continue. I would create a series of classes from the Schema using JAXB , this would be loaded into project as MMD2, then the xml code would use these classes and the search results to populate a Metadata class that is then marshaled as xml
The issue was you still need code on mb_server for converting output to xml when looking up single entity because mb_server doesnt use search server in this case
and if you didn't store data in lucene then search server would'nt have the data to create the xml, UNLESS the search server could talk to the database
luks
of course we need code to produce XML on the search server
ijabz
which is one reason why I posted the idea about the search_server actaully being expanded to become the webservice
luks: I don't get your comment
luks
"The issue was you still need code on mb_server for converting output to xml ..."
the xml webservice does more than static xml per entity
you can ask it to include various data
the code doesn't belong to the search server
I understand it would be useful to have an application to run standalone WS sever on a DB slave