#musicbrainz-devel

/

2:36 AM
navap joined the channel

2009-09-01 24420, 2009

7:06 AM
MightyJay joined the channel

2009-09-01 24455, 2009

8:37 AM
luks joined the channel

2009-09-01 24435, 2009

10:06 AM
ijabz joined the channel

2009-09-01 24438, 2009

10:07 AM
ijabz

luks: Hi, Im trying again to get my head round your single index luceene search indexing

2009-09-01 24433, 2009

10:08 AM
ijabz

Why do you think it would be quicker to get a match if artist_index, release_index ... were all put into a single index

2009-09-01 24434, 2009

10:16 AM
luks

ijabz: because the indexes would be smaller

2009-09-01 24456, 2009

10:16 AM
luks

ijabz: I expect they would be so small that they can fit completely in RAM

2009-09-01 24435, 2009

10:17 AM
ijabz

Why would they be smaller, I don't get it

2009-09-01 24448, 2009

10:17 AM
luks

because you don't store any literal strings

2009-09-01 24403, 2009

10:18 AM
luks

do you know more or less how libraries like Lucene index text?

2009-09-01 24418, 2009

10:18 AM
ijabz

Thats a different point, not storing the values (which I was going to come to)

2009-09-01 24403, 2009

10:19 AM
luks

you can ignore the post if you want to store values

2009-09-01 24413, 2009

10:19 AM
luks

that completely changes the situation

2009-09-01 24446, 2009

10:19 AM
luks

they would be still some size savings

2009-09-01 24451, 2009

10:19 AM
ijabz

Im just trying to worjk through this, lets put it another way then. Why didnt you suggest not storing values but keeping the seperate indexes

2009-09-01 24451, 2009

10:19 AM
luks

but I don't expect searches to be faster

2009-09-01 24412, 2009

10:20 AM
luks

because we already have framework for loading this data

2009-09-01 24440, 2009

10:20 AM
luks

most of freqnuently used data we have also available on memcached servers

2009-09-01 24400, 2009

10:21 AM
luks

oh

2009-09-01 24408, 2009

10:21 AM
luks

well, because I want the indexes to share data

2009-09-01 24418, 2009

10:21 AM
luks

let's say you have artist index with artist names

2009-09-01 24425, 2009

10:21 AM
luks

you need the same data in track index

2009-09-01 24456, 2009

10:21 AM
luks

so basically the way it currently is: track index = track data + release index + artist index

2009-09-01 24424, 2009

10:23 AM
luks

there are several advantages to this

2009-09-01 24426, 2009

10:23 AM
ijabz

I see, so youre thinking artist name is interned() and can be used by track and artist, but I would think this would apply only to stored data, not on the searched data

2009-09-01 24440, 2009

10:23 AM
luks

like you can use artist aliases in track searches for almost no cost

2009-09-01 24455, 2009

10:23 AM
luks

it's not "interned"

2009-09-01 24458, 2009

10:23 AM
luks

it's indexes

2009-09-01 24421, 2009

10:24 AM
luks

these systems work like this: they tokenize the input, and build a dictionary

2009-09-01 24429, 2009

10:24 AM
luks

then the index is a list of pointers to the dictionary

2009-09-01 24450, 2009

10:24 AM
luks

lucene is not trying to optimize stored data

2009-09-01 24404, 2009

10:25 AM
luks

just look at the track index how many time you will find duplicated artist names there

2009-09-01 24409, 2009

10:25 AM
luks

+s

2009-09-01 24447, 2009

10:25 AM
luks

seriously, postgresql is a better database than lucene

2009-09-01 24433, 2009

10:26 AM
luks

the tickets about not having data available in search results could be trivially solved by this

2009-09-01 24450, 2009

10:26 AM
luks

you could even do things like /ws/1/artist?query=Foo&inc=tags

2009-09-01 24420, 2009

10:27 AM
luks

there are several advantages, and I can't think of a disadvantage

2009-09-01 24433, 2009

10:28 AM
ijabz

hang on, Im not onto the storing of fields yet

2009-09-01 24445, 2009

10:29 AM
luks

if you are concerned about speed, I meant to use filters, not search query

2009-09-01 24401, 2009

10:30 AM
luks

these filters are kept in memory and optimized specifically for cases like this

2009-09-01 24403, 2009

10:30 AM
ijabz

If you take your example yes an artist may be used by 50 tracks, but if each track can reuse the artist token is only in the track index once, the same as if the artist and track index were merged.

2009-09-01 24400, 2009

10:31 AM
luks

if you mean indexed data, then yes it does

2009-09-01 24418, 2009

10:31 AM
luks

but you still have the same data in the track index and the artist index

2009-09-01 24432, 2009

10:31 AM
ijabz

But so what

2009-09-01 24433, 2009

10:31 AM
luks

it could de-duplicate even stored data, but it doesn't

2009-09-01 24445, 2009

10:31 AM
luks

you are wasting RAM

2009-09-01 24452, 2009

10:31 AM
luks

why keep duplicate data in cache?

2009-09-01 24431, 2009

10:32 AM
luks

the track index needs dictionary from artist and release indexes, the release index needs dictionary from the artist index

2009-09-01 24436, 2009

10:32 AM
luks

why not have it only once?

2009-09-01 24435, 2009

10:33 AM
luks

the idea is to have almost everything in RAM

2009-09-01 24443, 2009

10:33 AM
luks

so you only rarely touch the disk

2009-09-01 24424, 2009

10:34 AM
luks

the track index with stored values is 3GB, as you mentioned in a trac ticket

2009-09-01 24439, 2009

10:34 AM
ijabz

and without ?

2009-09-01 24439, 2009

10:34 AM
luks

I'm sure the whole combined index without stored values would be smaller

2009-09-01 24454, 2009

10:35 AM
ijabz

And you think this is going to improve search speed or not

2009-09-01 24417, 2009

10:36 AM
luks

:)

2009-09-01 24436, 2009

10:36 AM
luks

I'm pretty sure that if everything is in RAM, it's going to improve search speed

2009-09-01 24457, 2009

10:37 AM
ijabz

Possibly, I thought that with my testing done a few months ago but it didnt seem to be the case

2009-09-01 24458, 2009

10:37 AM
luks

the stored values mostly cost you disk seek times

2009-09-01 24410, 2009

10:38 AM
luks

they are spread around the disk

2009-09-01 24425, 2009

10:39 AM
ijabz

I suppose with 64bit, theres no practical RAM limit

2009-09-01 24402, 2009

10:40 AM
luks

are you seriously suggesting to keep duplicate data in RAM as a better solution?

2009-09-01 24440, 2009

10:40 AM
luks

I mean, I'm not asking anybody to do any work on this, but it seems pretty clear to me that the less duplication you have in RAM, the better

2009-09-01 24403, 2009

10:42 AM
ijabz

No Im not, Im only suggesting that this might not make much difference

2009-09-01 24447, 2009

10:42 AM
luks

I'd be ok with that

2009-09-01 24418, 2009

10:43 AM
luks

if the performance doesn't change, and I have more features available, it's a win situation anyway

2009-09-01 24430, 2009

10:47 AM
ijabz

Its worth an experiment, but you really need to benchmark it the existing system because It think your underplaying the cost of the database retrievals

2009-09-01 24427, 2009

10:48 AM
luks

ijabz: I'm pretty sure loading data from a replicated DB slave and a memcached server is more scalable solution than having the data on the same machine

2009-09-01 24456, 2009

10:48 AM
luks

the indexes are updated daily, one hour off data on the DB slave is fine

2009-09-01 24444, 2009

10:50 AM
ijabz

SO what's in memcache ?

2009-09-01 24426, 2009

10:51 AM
luks

in NGS: artists, labels, artist credits and ARs

2009-09-01 24434, 2009

10:51 AM
luks

release groups will be probably added too

2009-09-01 24456, 2009

10:51 AM
luks

there are other things, but these are easily reusable

2009-09-01 24425, 2009

10:52 AM
ijabz

But these are small things, the big things are release,, tracks and relationaship details , these arent going to fit are they ?

2009-09-01 24445, 2009

10:52 AM
luks

these are small, but very frequently usede things

2009-09-01 24453, 2009

10:53 AM
luks

note that for search results you don't need that much data

2009-09-01 24429, 2009

10:54 AM
luks

in release search results, all you have to do is loading the releases from DB and loading artist credits and labels from memcache

2009-09-01 24449, 2009

10:54 AM
luks

that's a single DB query, 2 memcached queries

2009-09-01 24419, 2009

10:55 AM
luks

both are on separate servers, optimally using all kinds of local caches

2009-09-01 24446, 2009

10:55 AM
ijabz

As popsed to 1 lucene query, and if you allow free searches (as you suggested) it could get a lot more complex

2009-09-01 24403, 2009

10:56 AM
luks

note that the 1 lucene query is more expensive

2009-09-01 24415, 2009

10:56 AM
luks

because you need to do 1 lucene query, and then seek all over the index to get the document data

2009-09-01 24412, 2009

10:58 AM
luks

if you have 5 lucene indexes on a single machine, they will all fight for IO resources and disk caches

2009-09-01 24412, 2009

10:58 AM
ijabz

Maybe, neither of us know for sure, the new Lucene is much faster than the version you were using before. I don't oppose your idea but don't you think a sensible first stage would be to get some figures for the work I've just finished, then make chnages as you suggested, and get some comparisons.

2009-09-01 24447, 2009

10:58 AM
luks

I was not suggesting you do anything :)

2009-09-01 24405, 2009

10:59 AM
luks

(I'm pretty sure actually)

2009-09-01 24442, 2009

11:00 AM
luks

I posted the mail a long time ago, and it was just an "would be nice" kind of idea

2009-09-01 24455, 2009

11:00 AM
luks

I have lots of those, but no time to implement them :)

2009-09-01 24429, 2009

11:01 AM
luks

but murdos wants to work on NGS support in the search server, and that will add even more duplicate data this way

2009-09-01 24448, 2009

11:01 AM
ijabz

Murdos is looking at it, and I'm trying to continue working on NGS but Im very confused about the direction we are heading

2009-09-01 24440, 2009

11:02 AM
ijabz

Nobody having much time was one of my points, I thought this would be better looked after have a workable NGS solution

2009-09-01 24420, 2009

11:03 AM
luks

I was really not trying to suggest that anybody should try to implement this

2009-09-01 24409, 2009

11:04 AM
ijabz

lol , thats exactly what I thought you were suggesting

2009-09-01 24431, 2009

11:04 AM
ijabz

So what do you think should be the approach for NGS ?

2009-09-01 24441, 2009

11:05 AM
luks

I was asking murdos for an opinion on this, because I know he knows the NGS schema well enough

2009-09-01 24401, 2009

11:06 AM
luks

I'm not sure about the right approach

2009-09-01 24422, 2009

11:06 AM
ijabz

luks: tell you what I've got to go out and we've been chatting for an hour now, perhaps we could tals a bit more late afternoon ?

2009-09-01 24454, 2009

11:06 AM
luks

ok, I should be here

2009-09-01 24426, 2009

13:51 PM
nikki wonders if ijabz is back yet

2009-09-01 24459, 2009

13:51 PM
ijabz

nikki:hi

2009-09-01 24408, 2009

13:52 PM
nikki

hey

2009-09-01 24443, 2009

13:52 PM
nikki

I was wondering how the search handles punctuation

2009-09-01 24442, 2009

13:53 PM
ijabz

I haven't looked at this in any detail, I was concentrating on replicating the existing systems

2009-09-01 24446, 2009

13:53 PM
ijabz

What in particulr

2009-09-01 24449, 2009

13:53 PM
nikki

ah right

2009-09-01 24409, 2009

13:54 PM
nikki

well when I was testing I noticed http://musicbrainz.org/search/textsearch.html?que… doesn't work when I would expect it to

2009-09-01 24414, 2009

13:54 PM
nikki

doesn't work on the live one either

2009-09-01 24450, 2009

13:54 PM
ijabz

Whats this Hebrew, Arabic ?

2009-09-01 24450, 2009

13:54 PM
nikki

but ・ is punctuation that should be treated like a space, hence why I wondered

2009-09-01 24452, 2009

13:54 PM
nikki

japanese

2009-09-01 24404, 2009

13:55 PM
nikki

katakana to be specific

2009-09-01 24447, 2009

13:55 PM
ijabz

is that character ONLY used in Katakana Japanese

2009-09-01 24447, 2009

13:56 PM
nikki

it says "won fei" and I'd expect it to find faye wong, since she has "フェイ・ウォン" (fei won) as an alias

2009-09-01 24456, 2009

13:56 PM
nikki

as far as I'm aware, yes

2009-09-01 24419, 2009

13:57 PM
ijabz

Because I think it could be converted like the i and ı http://bugs.musicbrainz.org/ticket/3916 fix

2009-09-01 24415, 2009

13:58 PM
ijabz

But lucene itself won't out of the box do special processing for Japanese unless we tell it we are using Japanese and use a Japanese Anlyser that we dont

2009-09-01 24448, 2009

13:58 PM
nikki

hmm... I would have expected it to treat all punctuation the same unless told to do something special with it :/

2009-09-01 24415, 2009

13:59 PM
nikki

seems like a total pita to manually define how every punctuation character should work

2009-09-01 24435, 2009

13:59 PM
ijabz

Actually, you could be right

2009-09-01 24403, 2009

14:00 PM
nikki

hm, unless for some reason it's not actually defined as punctuation...

2009-09-01 24409, 2009

14:00 PM
nikki checks out unicode

2009-09-01 24440, 2009

14:00 PM
luks

unac would be the wrong place to fix this

2009-09-01 24449, 2009

14:00 PM
luks

it should be fixed in the tokenizer

2009-09-01 24409, 2009

14:02 PM
nikki

ah, "punctuation, other", good to know

2009-09-01 24437, 2009

14:02 PM
ijabz

nikki:OK, can you raise a seperate issue and I'll look at it when I can

2009-09-01 24443, 2009

14:02 PM
nikki

sure

2009-09-01 24446, 2009

14:03 PM
ijabz

luks:Can you point to some code in NGS that converts info from database into XML

2009-09-01 24413, 2009

14:04 PM
luks

there isn't such code yet

2009-09-01 24428, 2009

14:05 PM
ijabz

So what is done wrt to this thrift /json idea

2009-09-01 24451, 2009

14:05 PM
luks

nothing

2009-09-01 24407, 2009

14:06 PM
ijabz

is this aCID2s remit

2009-09-01 24453, 2009

14:06 PM
luks

well, before mb_server can support anything, it must be implemented in the search server

2009-09-01 24405, 2009

14:08 PM
ijabz

What do you want in the search_server, please step me through the concept

2009-09-01 24409, 2009

14:09 PM
luks

that's the part I'm not sure about

2009-09-01 24431, 2009

14:09 PM
luks

the obvious starting point would be updating the index code to work with the NGS database

2009-09-01 24420, 2009

14:10 PM
luks

I'm not sure how do you work with the mmd package for generating XML to suggest whether it's easier to adapt the code or return raw data (in any format)

2009-09-01 24424, 2009

14:10 PM
ijabz

Yes, that would me my first step, that is completely uncontroversial and can be done once murdos has created a new barnch

2009-09-01 24423, 2009

14:11 PM
ijabz

mmd is VERY easy, if I had a Relax NG schema for the new enities I could generate classs that adhered to the new schema within an hour

2009-09-01 24441, 2009

14:23 PM
ijabz

You've gone quiet, so I'll continue. I would create a series of classes from the Schema using JAXB , this would be loaded into project as MMD2, then the xml code would use these classes and the search results to populate a Metadata class that is then marshaled as xml

2009-09-01 24418, 2009

14:26 PM
ijabz

The issue was you still need code on mb_server for converting output to xml when looking up single entity because mb_server doesnt use search server in this case

2009-09-01 24436, 2009

14:27 PM
ijabz

and if you didn't store data in lucene then search server would'nt have the data to create the xml, UNLESS the search server could talk to the database

2009-09-01 24445, 2009

14:28 PM
luks

of course we need code to produce XML on the search server

2009-09-01 24447, 2009

14:28 PM
ijabz

which is one reason why I posted the idea about the search_server actaully being expanded to become the webservice

2009-09-01 24455, 2009

14:29 PM
ijabz

luks: I don't get your comment

2009-09-01 24412, 2009

14:30 PM
luks

"The issue was you still need code on mb_server for converting output to xml ..."

2009-09-01 24415, 2009

14:31 PM
luks

the xml webservice does more than static xml per entity

2009-09-01 24425, 2009

14:31 PM
luks

you can ask it to include various data

2009-09-01 24435, 2009

14:31 PM
luks

the code doesn't belong to the search server

2009-09-01 24406, 2009

14:32 PM
luks

I understand it would be useful to have an application to run standalone WS sever on a DB slave