luks: Hi, Im trying again to get my head round your single index luceene search indexing
2009-09-01 24433, 2009
ijabz
Why do you think it would be quicker to get a match if artist_index, release_index ... were all put into a single index
2009-09-01 24434, 2009
luks
ijabz: because the indexes would be smaller
2009-09-01 24456, 2009
luks
ijabz: I expect they would be so small that they can fit completely in RAM
2009-09-01 24435, 2009
ijabz
Why would they be smaller, I don't get it
2009-09-01 24448, 2009
luks
because you don't store any literal strings
2009-09-01 24403, 2009
luks
do you know more or less how libraries like Lucene index text?
2009-09-01 24418, 2009
ijabz
Thats a different point, not storing the values (which I was going to come to)
2009-09-01 24403, 2009
luks
you can ignore the post if you want to store values
2009-09-01 24413, 2009
luks
that completely changes the situation
2009-09-01 24446, 2009
luks
they would be still some size savings
2009-09-01 24451, 2009
ijabz
Im just trying to worjk through this, lets put it another way then. Why didnt you suggest not storing values but keeping the seperate indexes
2009-09-01 24451, 2009
luks
but I don't expect searches to be faster
2009-09-01 24412, 2009
luks
because we already have framework for loading this data
2009-09-01 24440, 2009
luks
most of freqnuently used data we have also available on memcached servers
2009-09-01 24400, 2009
luks
oh
2009-09-01 24408, 2009
luks
well, because I want the indexes to share data
2009-09-01 24418, 2009
luks
let's say you have artist index with artist names
2009-09-01 24425, 2009
luks
you need the same data in track index
2009-09-01 24456, 2009
luks
so basically the way it currently is: track index = track data + release index + artist index
2009-09-01 24424, 2009
luks
there are several advantages to this
2009-09-01 24426, 2009
ijabz
I see, so youre thinking artist name is interned() and can be used by track and artist, but I would think this would apply only to stored data, not on the searched data
2009-09-01 24440, 2009
luks
like you can use artist aliases in track searches for almost no cost
2009-09-01 24455, 2009
luks
it's not "interned"
2009-09-01 24458, 2009
luks
it's indexes
2009-09-01 24421, 2009
luks
these systems work like this: they tokenize the input, and build a dictionary
2009-09-01 24429, 2009
luks
then the index is a list of pointers to the dictionary
2009-09-01 24450, 2009
luks
lucene is not trying to optimize stored data
2009-09-01 24404, 2009
luks
just look at the track index how many time you will find duplicated artist names there
2009-09-01 24409, 2009
luks
+s
2009-09-01 24447, 2009
luks
seriously, postgresql is a better database than lucene
2009-09-01 24433, 2009
luks
the tickets about not having data available in search results could be trivially solved by this
2009-09-01 24450, 2009
luks
you could even do things like /ws/1/artist?query=Foo&inc=tags
2009-09-01 24420, 2009
luks
there are several advantages, and I can't think of a disadvantage
2009-09-01 24433, 2009
ijabz
hang on, Im not onto the storing of fields yet
2009-09-01 24445, 2009
luks
if you are concerned about speed, I meant to use filters, not search query
2009-09-01 24401, 2009
luks
these filters are kept in memory and optimized specifically for cases like this
2009-09-01 24403, 2009
ijabz
If you take your example yes an artist may be used by 50 tracks, but if each track can reuse the artist token is only in the track index once, the same as if the artist and track index were merged.
2009-09-01 24400, 2009
luks
if you mean indexed data, then yes it does
2009-09-01 24418, 2009
luks
but you still have the same data in the track index and the artist index
2009-09-01 24432, 2009
ijabz
But so what
2009-09-01 24433, 2009
luks
it could de-duplicate even stored data, but it doesn't
2009-09-01 24445, 2009
luks
you are wasting RAM
2009-09-01 24452, 2009
luks
why keep duplicate data in cache?
2009-09-01 24431, 2009
luks
the track index needs dictionary from artist and release indexes, the release index needs dictionary from the artist index
2009-09-01 24436, 2009
luks
why not have it only once?
2009-09-01 24435, 2009
luks
the idea is to have almost everything in RAM
2009-09-01 24443, 2009
luks
so you only rarely touch the disk
2009-09-01 24424, 2009
luks
the track index with stored values is 3GB, as you mentioned in a trac ticket
2009-09-01 24439, 2009
ijabz
and without ?
2009-09-01 24439, 2009
luks
I'm sure the whole combined index without stored values would be smaller
2009-09-01 24454, 2009
ijabz
And you think this is going to improve search speed or not
2009-09-01 24417, 2009
luks
:)
2009-09-01 24436, 2009
luks
I'm pretty sure that if everything is in RAM, it's going to improve search speed
2009-09-01 24457, 2009
ijabz
Possibly, I thought that with my testing done a few months ago but it didnt seem to be the case
2009-09-01 24458, 2009
luks
the stored values mostly cost you disk seek times
2009-09-01 24410, 2009
luks
they are spread around the disk
2009-09-01 24425, 2009
ijabz
I suppose with 64bit, theres no practical RAM limit
2009-09-01 24402, 2009
luks
are you seriously suggesting to keep duplicate data in RAM as a better solution?
2009-09-01 24440, 2009
luks
I mean, I'm not asking anybody to do any work on this, but it seems pretty clear to me that the less duplication you have in RAM, the better
2009-09-01 24403, 2009
ijabz
No Im not, Im only suggesting that this might not make much difference
2009-09-01 24447, 2009
luks
I'd be ok with that
2009-09-01 24418, 2009
luks
if the performance doesn't change, and I have more features available, it's a win situation anyway
2009-09-01 24430, 2009
ijabz
Its worth an experiment, but you really need to benchmark it the existing system because It think your underplaying the cost of the database retrievals
2009-09-01 24427, 2009
luks
ijabz: I'm pretty sure loading data from a replicated DB slave and a memcached server is more scalable solution than having the data on the same machine
2009-09-01 24456, 2009
luks
the indexes are updated daily, one hour off data on the DB slave is fine
2009-09-01 24444, 2009
ijabz
SO what's in memcache ?
2009-09-01 24426, 2009
luks
in NGS: artists, labels, artist credits and ARs
2009-09-01 24434, 2009
luks
release groups will be probably added too
2009-09-01 24456, 2009
luks
there are other things, but these are easily reusable
2009-09-01 24425, 2009
ijabz
But these are small things, the big things are release,, tracks and relationaship details , these arent going to fit are they ?
2009-09-01 24445, 2009
luks
these are small, but very frequently usede things
2009-09-01 24453, 2009
luks
note that for search results you don't need that much data
2009-09-01 24429, 2009
luks
in release search results, all you have to do is loading the releases from DB and loading artist credits and labels from memcache
2009-09-01 24449, 2009
luks
that's a single DB query, 2 memcached queries
2009-09-01 24419, 2009
luks
both are on separate servers, optimally using all kinds of local caches
2009-09-01 24446, 2009
ijabz
As popsed to 1 lucene query, and if you allow free searches (as you suggested) it could get a lot more complex
2009-09-01 24403, 2009
luks
note that the 1 lucene query is more expensive
2009-09-01 24415, 2009
luks
because you need to do 1 lucene query, and then seek all over the index to get the document data
2009-09-01 24412, 2009
luks
if you have 5 lucene indexes on a single machine, they will all fight for IO resources and disk caches
2009-09-01 24412, 2009
ijabz
Maybe, neither of us know for sure, the new Lucene is much faster than the version you were using before. I don't oppose your idea but don't you think a sensible first stage would be to get some figures for the work I've just finished, then make chnages as you suggested, and get some comparisons.
2009-09-01 24447, 2009
luks
I was not suggesting you do anything :)
2009-09-01 24405, 2009
luks
(I'm pretty sure actually)
2009-09-01 24442, 2009
luks
I posted the mail a long time ago, and it was just an "would be nice" kind of idea
2009-09-01 24455, 2009
luks
I have lots of those, but no time to implement them :)
2009-09-01 24429, 2009
luks
but murdos wants to work on NGS support in the search server, and that will add even more duplicate data this way
2009-09-01 24448, 2009
ijabz
Murdos is looking at it, and I'm trying to continue working on NGS but Im very confused about the direction we are heading
2009-09-01 24440, 2009
ijabz
Nobody having much time was one of my points, I thought this would be better looked after have a workable NGS solution
2009-09-01 24420, 2009
luks
I was really not trying to suggest that anybody should try to implement this
2009-09-01 24409, 2009
ijabz
lol , thats exactly what I thought you were suggesting
2009-09-01 24431, 2009
ijabz
So what do you think should be the approach for NGS ?
2009-09-01 24441, 2009
luks
I was asking murdos for an opinion on this, because I know he knows the NGS schema well enough
2009-09-01 24401, 2009
luks
I'm not sure about the right approach
2009-09-01 24422, 2009
ijabz
luks: tell you what I've got to go out and we've been chatting for an hour now, perhaps we could tals a bit more late afternoon ?
2009-09-01 24454, 2009
luks
ok, I should be here
2009-09-01 24426, 2009
nikki wonders if ijabz is back yet
2009-09-01 24459, 2009
ijabz
nikki:hi
2009-09-01 24408, 2009
nikki
hey
2009-09-01 24443, 2009
nikki
I was wondering how the search handles punctuation
2009-09-01 24442, 2009
ijabz
I haven't looked at this in any detail, I was concentrating on replicating the existing systems
But lucene itself won't out of the box do special processing for Japanese unless we tell it we are using Japanese and use a Japanese Anlyser that we dont
2009-09-01 24448, 2009
nikki
hmm... I would have expected it to treat all punctuation the same unless told to do something special with it :/
2009-09-01 24415, 2009
nikki
seems like a total pita to manually define how every punctuation character should work
2009-09-01 24435, 2009
ijabz
Actually, you could be right
2009-09-01 24403, 2009
nikki
hm, unless for some reason it's not actually defined as punctuation...
2009-09-01 24409, 2009
nikki checks out unicode
2009-09-01 24440, 2009
luks
unac would be the wrong place to fix this
2009-09-01 24449, 2009
luks
it should be fixed in the tokenizer
2009-09-01 24409, 2009
nikki
ah, "punctuation, other", good to know
2009-09-01 24437, 2009
ijabz
nikki:OK, can you raise a seperate issue and I'll look at it when I can
2009-09-01 24443, 2009
nikki
sure
2009-09-01 24446, 2009
ijabz
luks:Can you point to some code in NGS that converts info from database into XML
2009-09-01 24413, 2009
luks
there isn't such code yet
2009-09-01 24428, 2009
ijabz
So what is done wrt to this thrift /json idea
2009-09-01 24451, 2009
luks
nothing
2009-09-01 24407, 2009
ijabz
is this aCID2s remit
2009-09-01 24453, 2009
luks
well, before mb_server can support anything, it must be implemented in the search server
2009-09-01 24405, 2009
ijabz
What do you want in the search_server, please step me through the concept
2009-09-01 24409, 2009
luks
that's the part I'm not sure about
2009-09-01 24431, 2009
luks
the obvious starting point would be updating the index code to work with the NGS database
2009-09-01 24420, 2009
luks
I'm not sure how do you work with the mmd package for generating XML to suggest whether it's easier to adapt the code or return raw data (in any format)
2009-09-01 24424, 2009
ijabz
Yes, that would me my first step, that is completely uncontroversial and can be done once murdos has created a new barnch
2009-09-01 24423, 2009
ijabz
mmd is VERY easy, if I had a Relax NG schema for the new enities I could generate classs that adhered to the new schema within an hour
2009-09-01 24441, 2009
ijabz
You've gone quiet, so I'll continue. I would create a series of classes from the Schema using JAXB , this would be loaded into project as MMD2, then the xml code would use these classes and the search results to populate a Metadata class that is then marshaled as xml
2009-09-01 24418, 2009
ijabz
The issue was you still need code on mb_server for converting output to xml when looking up single entity because mb_server doesnt use search server in this case
2009-09-01 24436, 2009
ijabz
and if you didn't store data in lucene then search server would'nt have the data to create the xml, UNLESS the search server could talk to the database
2009-09-01 24445, 2009
luks
of course we need code to produce XML on the search server
2009-09-01 24447, 2009
ijabz
which is one reason why I posted the idea about the search_server actaully being expanded to become the webservice
2009-09-01 24455, 2009
ijabz
luks: I don't get your comment
2009-09-01 24412, 2009
luks
"The issue was you still need code on mb_server for converting output to xml ..."
2009-09-01 24415, 2009
luks
the xml webservice does more than static xml per entity
2009-09-01 24425, 2009
luks
you can ask it to include various data
2009-09-01 24435, 2009
luks
the code doesn't belong to the search server
2009-09-01 24406, 2009
luks
I understand it would be useful to have an application to run standalone WS sever on a DB slave