spill tree beats best-bin-first in toy data

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

spill tree beats best-bin-first in toy data

liuliu_0503
1. will there be any difference in real world data?
2. just notice that cvFindFeatures can find KNN with several points in
one-time, but further comparison didn't show that can do any good to
speed up (as for two points knn search, it takes 1400 ms, for one it
takes 700ms), is it just a convenience way or there should be any
speed-up?

following is my console output, I wrote cvSpillTree myself, the
best-bin-first search is the cvFeatureTree which implemented in OpenCV
1.1 pre.

--------console output--------

data generated
spill tree created in 28761 ms.
test drive: 36.95 -50.20 -21.44 60.88 -99.70 57.32 -41.65 13.21 30.16
37.56 -65.32 85.90 38.36 14.80 -11.74 -29.48 17.82 -54.00 -23.25 83.86
-80.23 75.74 12.73 -31.26 8.19 1.30 75.46 22.00 33.67 -5.34 31.16
70.05 -7.26 62.56 -9.95 -79.65 -72.97 -51.89 -28.61 -57.30 -44.76
26.83 15.79 -70.19 33.88 2.58 -43.35 -44.49 -60.56 -74.65 95.15 -64.19
-69.85 -79.42 54.68 51.46 28.14 -96.81 -33.27 -18.26 -55.45 10.10
-22.27 -26.07 -83.79 64.43 25.43 3.18 26.76 23.15 -42.34 -98.87 8.10
39.83 -4.40 -65.59 12.55 52.30 32.67 68.17 59.00 1.31 -14.74 1.25 4.78
-7.32 -91.16 12.97 20.03 -58.71 98.49 -16.75 -49.29 -97.95 -21.78
56.08 19.80 74.28 52.48 65.04 90.63 -64.17 -84.10 -62.01 86.81 14.18
25.22 82.14 -68.55 -80.29 77.05 4.64 59.25 96.77 -55.53 -20.53 98.34
-68.35 24.97 20.76 13.70 -17.51 74.02 59.67 -68.12 -72.85 78.00 91.67
spill tree approximation result in 672 ms:
---sim: 677.365960---user data: 152510---
---sim: 689.908410---user data: 66521---
---sim: 690.705895---user data: 479280---
---sim: 699.213164---user data: 488635---
---sim: 701.736654---user data: 58338---
---sim: 703.449575---user data: 390538---
---sim: 705.601652---user data: 185668---
---sim: 705.801436---user data: 498081---
---sim: 708.643109---user data: 22241---
---sim: 710.035490---user data: 74977---
kd tree created in 54792 ms.
best-bin-first approximation result in 706 ms:
---sim: 710.035490---user data: 74977---
---sim: 708.643109---user data: 22241---
---sim: 705.801436---user data: 498081---
---sim: 701.736654---user data: 58338---
---sim: 705.601652---user data: 185668---
---sim: 699.213164---user data: 488635---
---sim: 690.705895---user data: 479280---
---sim: 689.908410---user data: 66521---
---sim: 677.365960---user data: 152510---
---sim: 703.449575---user data: 390538---

--------console output end--------

Reply | Threaded
Open this post in threaded view
|

Re: spill tree beats best-bin-first in toy data

Xavier Delacour
On Thu, Jan 8, 2009 at 10:07 PM, liuliu_0503
<[hidden email]> wrote:

Cool!

> 1. will there be any difference in real world data?

I can't say without having a closer look at how spill trees work. Off
hand it seems that performance on random data should approximate real
world performance pretty well, but of course that may not be true with
a particular feature generator.

> 2. just notice that cvFindFeatures can find KNN with several points in
> one-time, but further comparison didn't show that can do any good to
> speed up (as for two points knn search, it takes 1400 ms, for one it
> takes 700ms), is it just a convenience way or there should be any
> speed-up?

If you're talking about input points, then yes it is just for
convenience and time will grow linearly with their number. If output
points, then no, time should not vary with what you specify for k--
only with emax and the number of items in the tree.

>
> following is my console output, I wrote cvSpillTree myself, the
> best-bin-first search is the cvFeatureTree which implemented in OpenCV
> 1.1 pre.

It would be great if you could contribute your implementation, and/or
an LSH implementation if you have one.

Can you explain a bit more the numbers below, and what input data
you're using/generating? (what is test drive, sim, user data? are the
result times an aggregation of sim?)

Xavier

>
> --------console output--------
>
> data generated
> spill tree created in 28761 ms.
> test drive: 36.95 -50.20 -21.44 60.88 -99.70 57.32 -41.65 13.21 30.16
> 37.56 -65.32 85.90 38.36 14.80 -11.74 -29.48 17.82 -54.00 -23.25 83.86
> -80.23 75.74 12.73 -31.26 8.19 1.30 75.46 22.00 33.67 -5.34 31.16
> 70.05 -7.26 62.56 -9.95 -79.65 -72.97 -51.89 -28.61 -57.30 -44.76
> 26.83 15.79 -70.19 33.88 2.58 -43.35 -44.49 -60.56 -74.65 95.15 -64.19
> -69.85 -79.42 54.68 51.46 28.14 -96.81 -33.27 -18.26 -55.45 10.10
> -22.27 -26.07 -83.79 64.43 25.43 3.18 26.76 23.15 -42.34 -98.87 8.10
> 39.83 -4.40 -65.59 12.55 52.30 32.67 68.17 59.00 1.31 -14.74 1.25 4.78
> -7.32 -91.16 12.97 20.03 -58.71 98.49 -16.75 -49.29 -97.95 -21.78
> 56.08 19.80 74.28 52.48 65.04 90.63 -64.17 -84.10 -62.01 86.81 14.18
> 25.22 82.14 -68.55 -80.29 77.05 4.64 59.25 96.77 -55.53 -20.53 98.34
> -68.35 24.97 20.76 13.70 -17.51 74.02 59.67 -68.12 -72.85 78.00 91.67
> spill tree approximation result in 672 ms:
> ---sim: 677.365960---user data: 152510---
> ---sim: 689.908410---user data: 66521---
> ---sim: 690.705895---user data: 479280---
> ---sim: 699.213164---user data: 488635---
> ---sim: 701.736654---user data: 58338---
> ---sim: 703.449575---user data: 390538---
> ---sim: 705.601652---user data: 185668---
> ---sim: 705.801436---user data: 498081---
> ---sim: 708.643109---user data: 22241---
> ---sim: 710.035490---user data: 74977---
> kd tree created in 54792 ms.
> best-bin-first approximation result in 706 ms:
> ---sim: 710.035490---user data: 74977---
> ---sim: 708.643109---user data: 22241---
> ---sim: 705.801436---user data: 498081---
> ---sim: 701.736654---user data: 58338---
> ---sim: 705.601652---user data: 185668---
> ---sim: 699.213164---user data: 488635---
> ---sim: 690.705895---user data: 479280---
> ---sim: 689.908410---user data: 66521---
> ---sim: 677.365960---user data: 152510---
> ---sim: 703.449575---user data: 390538---
>
> --------console output end--------
>
>
Reply | Threaded
Open this post in threaded view
|

Re: spill tree beats best-bin-first in toy data

liuliu_0503
"sim" means the distance of the two vector.
"test drive" is the vector to search with.
"user data" is the index of similar vector to verify the result.

p.s. I hope that cvFeatureTree bundle can be a collection of several
different tree-based feature search methods, ideally, we can do such:

CvFeatureTree* sp = cvCreateSpillTree( blabalalal );
CvFeatureTree* bbf = cvCreateKDTree( lalalallala );

cvFindFeatures( sp, desc, sp_results, sp_dist, k );
cvFindFeatures( bbf, desc, sp_results, sp_dist, k );

/* do comparison then */

If that implemented in OpenCV would be great.

However, if someone come up with something like cvMergeFeatureTree(),
how to decide which put into "results" in cvFindFeatures? (another
words, how to decide the index which point to the original matrix)

happy to have a discussion with cvFeatureTree original contributor.

--- In [hidden email], "Xavier Delacour" <xavier.delacour@...>
wrote:

>
> On Thu, Jan 8, 2009 at 10:07 PM, liuliu_0503
> <liuliu.1987+opencv@...> wrote:
>
> Cool!
>
> > 1. will there be any difference in real world data?
>
> I can't say without having a closer look at how spill trees work. Off
> hand it seems that performance on random data should approximate real
> world performance pretty well, but of course that may not be true with
> a particular feature generator.
>
> > 2. just notice that cvFindFeatures can find KNN with several points in
> > one-time, but further comparison didn't show that can do any good to
> > speed up (as for two points knn search, it takes 1400 ms, for one it
> > takes 700ms), is it just a convenience way or there should be any
> > speed-up?
>
> If you're talking about input points, then yes it is just for
> convenience and time will grow linearly with their number. If output
> points, then no, time should not vary with what you specify for k--
> only with emax and the number of items in the tree.
>
> >
> > following is my console output, I wrote cvSpillTree myself, the
> > best-bin-first search is the cvFeatureTree which implemented in OpenCV
> > 1.1 pre.
>
> It would be great if you could contribute your implementation, and/or
> an LSH implementation if you have one.
>
> Can you explain a bit more the numbers below, and what input data
> you're using/generating? (what is test drive, sim, user data? are the
> result times an aggregation of sim?)
>
> Xavier
>
> >
> > --------console output--------
> >
> > data generated
> > spill tree created in 28761 ms.
> > test drive: 36.95 -50.20 -21.44 60.88 -99.70 57.32 -41.65 13.21 30.16
> > 37.56 -65.32 85.90 38.36 14.80 -11.74 -29.48 17.82 -54.00 -23.25 83.86
> > -80.23 75.74 12.73 -31.26 8.19 1.30 75.46 22.00 33.67 -5.34 31.16
> > 70.05 -7.26 62.56 -9.95 -79.65 -72.97 -51.89 -28.61 -57.30 -44.76
> > 26.83 15.79 -70.19 33.88 2.58 -43.35 -44.49 -60.56 -74.65 95.15 -64.19
> > -69.85 -79.42 54.68 51.46 28.14 -96.81 -33.27 -18.26 -55.45 10.10
> > -22.27 -26.07 -83.79 64.43 25.43 3.18 26.76 23.15 -42.34 -98.87 8.10
> > 39.83 -4.40 -65.59 12.55 52.30 32.67 68.17 59.00 1.31 -14.74 1.25 4.78
> > -7.32 -91.16 12.97 20.03 -58.71 98.49 -16.75 -49.29 -97.95 -21.78
> > 56.08 19.80 74.28 52.48 65.04 90.63 -64.17 -84.10 -62.01 86.81 14.18
> > 25.22 82.14 -68.55 -80.29 77.05 4.64 59.25 96.77 -55.53 -20.53 98.34
> > -68.35 24.97 20.76 13.70 -17.51 74.02 59.67 -68.12 -72.85 78.00 91.67
> > spill tree approximation result in 672 ms:
> > ---sim: 677.365960---user data: 152510---
> > ---sim: 689.908410---user data: 66521---
> > ---sim: 690.705895---user data: 479280---
> > ---sim: 699.213164---user data: 488635---
> > ---sim: 701.736654---user data: 58338---
> > ---sim: 703.449575---user data: 390538---
> > ---sim: 705.601652---user data: 185668---
> > ---sim: 705.801436---user data: 498081---
> > ---sim: 708.643109---user data: 22241---
> > ---sim: 710.035490---user data: 74977---
> > kd tree created in 54792 ms.
> > best-bin-first approximation result in 706 ms:
> > ---sim: 710.035490---user data: 74977---
> > ---sim: 708.643109---user data: 22241---
> > ---sim: 705.801436---user data: 498081---
> > ---sim: 701.736654---user data: 58338---
> > ---sim: 705.601652---user data: 185668---
> > ---sim: 699.213164---user data: 488635---
> > ---sim: 690.705895---user data: 479280---
> > ---sim: 689.908410---user data: 66521---
> > ---sim: 677.365960---user data: 152510---
> > ---sim: 703.449575---user data: 390538---
> >
> > --------console output end--------
> >
> >
>


Reply | Threaded
Open this post in threaded view
|

Re: Re: spill tree beats best-bin-first in toy data

Xavier Delacour
On Fri, Jan 9, 2009 at 6:16 AM, liuliu_0503
<[hidden email]> wrote:

> "sim" means the distance of the two vector.
> "test drive" is the vector to search with.
> "user data" is the index of similar vector to verify the result.
>
> p.s. I hope that cvFeatureTree bundle can be a collection of several
> different tree-based feature search methods, ideally, we can do such:
>
> CvFeatureTree* sp = cvCreateSpillTree( blabalalal );
> CvFeatureTree* bbf = cvCreateKDTree( lalalallala );
>
> cvFindFeatures( sp, desc, sp_results, sp_dist, k );
> cvFindFeatures( bbf, desc, sp_results, sp_dist, k );
>
> /* do comparison then */
>
> If that implemented in OpenCV would be great.

Yes that was the original idea, and also to provide some extra
functions/structures to permit external storage of the vectors
themselves (which becomes relevant with large databases).

I can do the integration (I am the author of these functions) if you
send me your files or publish them online somewhere. If you want to do
it, you should just prepare a patch and email it to the
opencvlibrary-devel list.

>
> However, if someone come up with something like cvMergeFeatureTree(),
> how to decide which put into "results" in cvFindFeatures? (another
> words, how to decide the index which point to the original matrix)

You mean chain the results together? so

CvFeatureTree* sp = cvCreateSpillTree( blabalalal );
CvFeatureTree* bbf = cvCreateKDTree( lalalallala );
CvFeatureTree* m = cvMergeFeatureTree( sp, bbf );
cvFindFeatures( m, desc, sp_results, sp_dist, k );

in that case we could just sort the results by distance. Is that
really a useful function in practice?

Xavier

>
> happy to have a discussion with cvFeatureTree original contributor.
>
> --- In [hidden email], "Xavier Delacour" <xavier.delacour@...>
> wrote:
>
>>
>> On Thu, Jan 8, 2009 at 10:07 PM, liuliu_0503
>> <liuliu.1987+opencv@...> wrote:
>>
>> Cool!
>>
>> > 1. will there be any difference in real world data?
>>
>> I can't say without having a closer look at how spill trees work. Off
>> hand it seems that performance on random data should approximate real
>> world performance pretty well, but of course that may not be true with
>> a particular feature generator.
>>
>> > 2. just notice that cvFindFeatures can find KNN with several points in
>> > one-time, but further comparison didn't show that can do any good to
>> > speed up (as for two points knn search, it takes 1400 ms, for one it
>> > takes 700ms), is it just a convenience way or there should be any
>> > speed-up?
>>
>> If you're talking about input points, then yes it is just for
>> convenience and time will grow linearly with their number. If output
>> points, then no, time should not vary with what you specify for k--
>> only with emax and the number of items in the tree.
>>
>> >
>> > following is my console output, I wrote cvSpillTree myself, the
>> > best-bin-first search is the cvFeatureTree which implemented in OpenCV
>> > 1.1 pre.
>>
>> It would be great if you could contribute your implementation, and/or
>> an LSH implementation if you have one.
>>
>> Can you explain a bit more the numbers below, and what input data
>> you're using/generating? (what is test drive, sim, user data? are the
>> result times an aggregation of sim?)
>>
>> Xavier
>>
>> >
>> > --------console output--------
>> >
>> > data generated
>> > spill tree created in 28761 ms.
>> > test drive: 36.95 -50.20 -21.44 60.88 -99.70 57.32 -41.65 13.21 30.16
>> > 37.56 -65.32 85.90 38.36 14.80 -11.74 -29.48 17.82 -54.00 -23.25 83.86
>> > -80.23 75.74 12.73 -31.26 8.19 1.30 75.46 22.00 33.67 -5.34 31.16
>> > 70.05 -7.26 62.56 -9.95 -79.65 -72.97 -51.89 -28.61 -57.30 -44.76
>> > 26.83 15.79 -70.19 33.88 2.58 -43.35 -44.49 -60.56 -74.65 95.15 -64.19
>> > -69.85 -79.42 54.68 51.46 28.14 -96.81 -33.27 -18.26 -55.45 10.10
>> > -22.27 -26.07 -83.79 64.43 25.43 3.18 26.76 23.15 -42.34 -98.87 8.10
>> > 39.83 -4.40 -65.59 12.55 52.30 32.67 68.17 59.00 1.31 -14.74 1.25 4.78
>> > -7.32 -91.16 12.97 20.03 -58.71 98.49 -16.75 -49.29 -97.95 -21.78
>> > 56.08 19.80 74.28 52.48 65.04 90.63 -64.17 -84.10 -62.01 86.81 14.18
>> > 25.22 82.14 -68.55 -80.29 77.05 4.64 59.25 96.77 -55.53 -20.53 98.34
>> > -68.35 24.97 20.76 13.70 -17.51 74.02 59.67 -68.12 -72.85 78.00 91.67
>> > spill tree approximation result in 672 ms:
>> > ---sim: 677.365960---user data: 152510---
>> > ---sim: 689.908410---user data: 66521---
>> > ---sim: 690.705895---user data: 479280---
>> > ---sim: 699.213164---user data: 488635---
>> > ---sim: 701.736654---user data: 58338---
>> > ---sim: 703.449575---user data: 390538---
>> > ---sim: 705.601652---user data: 185668---
>> > ---sim: 705.801436---user data: 498081---
>> > ---sim: 708.643109---user data: 22241---
>> > ---sim: 710.035490---user data: 74977---
>> > kd tree created in 54792 ms.
>> > best-bin-first approximation result in 706 ms:
>> > ---sim: 710.035490---user data: 74977---
>> > ---sim: 708.643109---user data: 22241---
>> > ---sim: 705.801436---user data: 498081---
>> > ---sim: 701.736654---user data: 58338---
>> > ---sim: 705.601652---user data: 185668---
>> > ---sim: 699.213164---user data: 488635---
>> > ---sim: 690.705895---user data: 479280---
>> > ---sim: 689.908410---user data: 66521---
>> > ---sim: 677.365960---user data: 152510---
>> > ---sim: 703.449575---user data: 390538---
>> >
>> > --------console output end--------
>> >
>> >
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: spill tree beats best-bin-first in toy data

liuliu_0503
try this link to see if it works.

http://tech.groups.yahoo.com/group/OpenCV/files/spilltree.tar.bz2

--- In [hidden email], "Xavier Delacour" <xavier.delacour@...>
wrote:

>
> On Fri, Jan 9, 2009 at 6:16 AM, liuliu_0503
> <liuliu.1987+opencv@...> wrote:
> > "sim" means the distance of the two vector.
> > "test drive" is the vector to search with.
> > "user data" is the index of similar vector to verify the result.
> >
> > p.s. I hope that cvFeatureTree bundle can be a collection of several
> > different tree-based feature search methods, ideally, we can do such:
> >
> > CvFeatureTree* sp = cvCreateSpillTree( blabalalal );
> > CvFeatureTree* bbf = cvCreateKDTree( lalalallala );
> >
> > cvFindFeatures( sp, desc, sp_results, sp_dist, k );
> > cvFindFeatures( bbf, desc, sp_results, sp_dist, k );
> >
> > /* do comparison then */
> >
> > If that implemented in OpenCV would be great.
>
> Yes that was the original idea, and also to provide some extra
> functions/structures to permit external storage of the vectors
> themselves (which becomes relevant with large databases).
>
> I can do the integration (I am the author of these functions) if you
> send me your files or publish them online somewhere. If you want to do
> it, you should just prepare a patch and email it to the
> opencvlibrary-devel list.
>
> >
> > However, if someone come up with something like cvMergeFeatureTree(),
> > how to decide which put into "results" in cvFindFeatures? (another
> > words, how to decide the index which point to the original matrix)
>
> You mean chain the results together? so
>
> CvFeatureTree* sp = cvCreateSpillTree( blabalalal );
> CvFeatureTree* bbf = cvCreateKDTree( lalalallala );
> CvFeatureTree* m = cvMergeFeatureTree( sp, bbf );
> cvFindFeatures( m, desc, sp_results, sp_dist, k );
>
> in that case we could just sort the results by distance. Is that
> really a useful function in practice?
>
> Xavier
>
> >
> > happy to have a discussion with cvFeatureTree original contributor.
> >
> > --- In [hidden email], "Xavier Delacour" <xavier.delacour@>
> > wrote:
> >
> >>
> >> On Thu, Jan 8, 2009 at 10:07 PM, liuliu_0503
> >> <liuliu.1987+opencv@> wrote:
> >>
> >> Cool!
> >>
> >> > 1. will there be any difference in real world data?
> >>
> >> I can't say without having a closer look at how spill trees work. Off
> >> hand it seems that performance on random data should approximate real
> >> world performance pretty well, but of course that may not be true
with
> >> a particular feature generator.
> >>
> >> > 2. just notice that cvFindFeatures can find KNN with several
points in
> >> > one-time, but further comparison didn't show that can do any
good to
> >> > speed up (as for two points knn search, it takes 1400 ms, for
one it

> >> > takes 700ms), is it just a convenience way or there should be any
> >> > speed-up?
> >>
> >> If you're talking about input points, then yes it is just for
> >> convenience and time will grow linearly with their number. If output
> >> points, then no, time should not vary with what you specify for k--
> >> only with emax and the number of items in the tree.
> >>
> >> >
> >> > following is my console output, I wrote cvSpillTree myself, the
> >> > best-bin-first search is the cvFeatureTree which implemented in
OpenCV

> >> > 1.1 pre.
> >>
> >> It would be great if you could contribute your implementation, and/or
> >> an LSH implementation if you have one.
> >>
> >> Can you explain a bit more the numbers below, and what input data
> >> you're using/generating? (what is test drive, sim, user data? are the
> >> result times an aggregation of sim?)
> >>
> >> Xavier
> >>
> >> >
> >> > --------console output--------
> >> >
> >> > data generated
> >> > spill tree created in 28761 ms.
> >> > test drive: 36.95 -50.20 -21.44 60.88 -99.70 57.32 -41.65 13.21
30.16
> >> > 37.56 -65.32 85.90 38.36 14.80 -11.74 -29.48 17.82 -54.00
-23.25 83.86
> >> > -80.23 75.74 12.73 -31.26 8.19 1.30 75.46 22.00 33.67 -5.34 31.16
> >> > 70.05 -7.26 62.56 -9.95 -79.65 -72.97 -51.89 -28.61 -57.30 -44.76
> >> > 26.83 15.79 -70.19 33.88 2.58 -43.35 -44.49 -60.56 -74.65 95.15
-64.19
> >> > -69.85 -79.42 54.68 51.46 28.14 -96.81 -33.27 -18.26 -55.45 10.10
> >> > -22.27 -26.07 -83.79 64.43 25.43 3.18 26.76 23.15 -42.34 -98.87
8.10
> >> > 39.83 -4.40 -65.59 12.55 52.30 32.67 68.17 59.00 1.31 -14.74
1.25 4.78
> >> > -7.32 -91.16 12.97 20.03 -58.71 98.49 -16.75 -49.29 -97.95 -21.78
> >> > 56.08 19.80 74.28 52.48 65.04 90.63 -64.17 -84.10 -62.01 86.81
14.18
> >> > 25.22 82.14 -68.55 -80.29 77.05 4.64 59.25 96.77 -55.53 -20.53
98.34
> >> > -68.35 24.97 20.76 13.70 -17.51 74.02 59.67 -68.12 -72.85 78.00
91.67

> >> > spill tree approximation result in 672 ms:
> >> > ---sim: 677.365960---user data: 152510---
> >> > ---sim: 689.908410---user data: 66521---
> >> > ---sim: 690.705895---user data: 479280---
> >> > ---sim: 699.213164---user data: 488635---
> >> > ---sim: 701.736654---user data: 58338---
> >> > ---sim: 703.449575---user data: 390538---
> >> > ---sim: 705.601652---user data: 185668---
> >> > ---sim: 705.801436---user data: 498081---
> >> > ---sim: 708.643109---user data: 22241---
> >> > ---sim: 710.035490---user data: 74977---
> >> > kd tree created in 54792 ms.
> >> > best-bin-first approximation result in 706 ms:
> >> > ---sim: 710.035490---user data: 74977---
> >> > ---sim: 708.643109---user data: 22241---
> >> > ---sim: 705.801436---user data: 498081---
> >> > ---sim: 701.736654---user data: 58338---
> >> > ---sim: 705.601652---user data: 185668---
> >> > ---sim: 699.213164---user data: 488635---
> >> > ---sim: 690.705895---user data: 479280---
> >> > ---sim: 689.908410---user data: 66521---
> >> > ---sim: 677.365960---user data: 152510---
> >> > ---sim: 703.449575---user data: 390538---
> >> >
> >> > --------console output end--------
> >> >
> >> >
> >>
> >
> >
>


Reply | Threaded
Open this post in threaded view
|

Re: spill tree beats best-bin-first in toy data

Xavier Delacour
In reply to this post by liuliu_0503
I am mostly done integrating your implementation into opencv.. While
doing that, I noticed that your test is not correct. You are giving
the bbf code emax=n (n=500000 for your results), which effectively
implies an exhaustive search of the kd tree -- and _that_ is of course
slower than spill trees.

In practice values of emax less than 100, even as low as 20, work
quite well.. eg, for matching say two images worth of SIFT/SURF
features.

Here are new results with n=10000 (my machine becomes unhappy with
500000) and emax at various values. As you can see, the quality of the
approximation improves as emax gets larger, as does the search time.
But at values that work well in practice, bbf search is much faster.

Xavier

note that neither implementation is compiled with optimizations here.

spill tree created in 1279 ms.
spill tree approximation result in 73 ms:
---sim: 698.308344---i: 9378---
---sim: 710.789429---i: 3102---
---sim: 733.939280---i: 2541---
---sim: 746.311721---i: 6685---
---sim: 747.415649---i: 6651---
---sim: 749.885740---i: 4522---
---sim: 753.267176---i: 4360---
---sim: 754.028963---i: 1256---
---sim: 755.353420---i: 1196---
---sim: 758.771161---i: 7719---

kd tree created in 3828 ms.
--------------------------------------------
emax = 16
best-bin-first approximation result in 0 ms:
---sim: 841.819824---i: 272---
---sim: 834.904562---i: 2060---
---sim: 820.374227---i: 5135---
---sim: 813.149933---i: 1638---
---sim: 829.187638---i: 2971---
---sim: 791.345031---i: 2866---
---sim: 800.839859---i: 5530---
---sim: 799.235517---i: 5820---
---sim: 767.408904---i: 4606---
---sim: 828.631318---i: 5699---

--------------------------------------------
emax = 128
best-bin-first approximation result in 3 ms:
---sim: 802.672758---i: 6899---
---sim: 802.575934---i: 6228---
---sim: 800.839859---i: 5530---
---sim: 799.235517---i: 5820---
---sim: 789.219760---i: 3013---
---sim: 791.345031---i: 2866---
---sim: 789.498863---i: 9743---
---sim: 797.380549---i: 1310---
---sim: 767.408904---i: 4606---
---sim: 764.757569---i: 7551---

--------------------------------------------
emax = 1024
best-bin-first approximation result in 30 ms:
---sim: 767.408904---i: 4606---
---sim: 766.102348---i: 623---
---sim: 765.401511---i: 1395---
---sim: 764.757569---i: 7551---
---sim: 762.791749---i: 1175---
---sim: 733.939280---i: 2541---
---sim: 758.771161---i: 7719---
---sim: 764.727846---i: 5586---
---sim: 747.415649---i: 6651---
---sim: 762.428061---i: 795---

Here are the changes to the test:

6c6
< int n = 500000;
---
> int n = 10000;
40,47c40,52
< t = cvGetTickCount();
< cvFindFeatures( bbf, vector, res, dist, k, n );
<         t = cvGetTickCount() - t;
< printf( "best-bin-first approximation result in %lld ms:\n", t/1000000 );
< vv = dist->data.db;
< idx = res->data.i;
< for ( int i = 0; i < k; i++ )
< printf( "---sim: %f---i: %d---\n", vv[i], idx[i] );
---

>
> for (int emax=16;emax<2000;emax*=8) {
>  printf( "\n--------------------------------------------\n", emax );
>  printf( "emax = %i\n", emax );
>  t = cvGetTickCount();
>  cvFindFeatures( bbf, vector, res, dist, k, emax );
>  t = cvGetTickCount() - t;
>  printf( "best-bin-first approximation result in %lld ms:\n", t/1000000 );
>  vv = dist->data.db;
>  idx = res->data.i;
>  for ( int i = 0; i < k; i++ )
>    printf( "---sim: %f---i: %d---\n", vv[i], idx[i] );
> }

On Thu, Jan 8, 2009 at 10:07 PM, liuliu_0503
<[hidden email]> wrote:

> 1. will there be any difference in real world data?
> 2. just notice that cvFindFeatures can find KNN with several points in
> one-time, but further comparison didn't show that can do any good to
> speed up (as for two points knn search, it takes 1400 ms, for one it
> takes 700ms), is it just a convenience way or there should be any
> speed-up?
>
> following is my console output, I wrote cvSpillTree myself, the
> best-bin-first search is the cvFeatureTree which implemented in OpenCV
> 1.1 pre.
>
> --------console output--------
>
> data generated
> spill tree created in 28761 ms.
> test drive: 36.95 -50.20 -21.44 60.88 -99.70 57.32 -41.65 13.21 30.16
> 37.56 -65.32 85.90 38.36 14.80 -11.74 -29.48 17.82 -54.00 -23.25 83.86
> -80.23 75.74 12.73 -31.26 8.19 1.30 75.46 22.00 33.67 -5.34 31.16
> 70.05 -7.26 62.56 -9.95 -79.65 -72.97 -51.89 -28.61 -57.30 -44.76
> 26.83 15.79 -70.19 33.88 2.58 -43.35 -44.49 -60.56 -74.65 95.15 -64.19
> -69.85 -79.42 54.68 51.46 28.14 -96.81 -33.27 -18.26 -55.45 10.10
> -22.27 -26.07 -83.79 64.43 25.43 3.18 26.76 23.15 -42.34 -98.87 8.10
> 39.83 -4.40 -65.59 12.55 52.30 32.67 68.17 59.00 1.31 -14.74 1.25 4.78
> -7.32 -91.16 12.97 20.03 -58.71 98.49 -16.75 -49.29 -97.95 -21.78
> 56.08 19.80 74.28 52.48 65.04 90.63 -64.17 -84.10 -62.01 86.81 14.18
> 25.22 82.14 -68.55 -80.29 77.05 4.64 59.25 96.77 -55.53 -20.53 98.34
> -68.35 24.97 20.76 13.70 -17.51 74.02 59.67 -68.12 -72.85 78.00 91.67
> spill tree approximation result in 672 ms:
> ---sim: 677.365960---user data: 152510---
> ---sim: 689.908410---user data: 66521---
> ---sim: 690.705895---user data: 479280---
> ---sim: 699.213164---user data: 488635---
> ---sim: 701.736654---user data: 58338---
> ---sim: 703.449575---user data: 390538---
> ---sim: 705.601652---user data: 185668---
> ---sim: 705.801436---user data: 498081---
> ---sim: 708.643109---user data: 22241---
> ---sim: 710.035490---user data: 74977---
> kd tree created in 54792 ms.
> best-bin-first approximation result in 706 ms:
> ---sim: 710.035490---user data: 74977---
> ---sim: 708.643109---user data: 22241---
> ---sim: 705.801436---user data: 498081---
> ---sim: 701.736654---user data: 58338---
> ---sim: 705.601652---user data: 185668---
> ---sim: 699.213164---user data: 488635---
> ---sim: 690.705895---user data: 479280---
> ---sim: 689.908410---user data: 66521---
> ---sim: 677.365960---user data: 152510---
> ---sim: 703.449575---user data: 390538---
>
> --------console output end--------
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Re: spill tree beats best-bin-first in toy data

Xavier Delacour
In reply to this post by liuliu_0503
Alright, this has been added to SVN.. now it's possible to do

CvFeatureTree* tr1 = CvCreateSpillTree(xyz);
CvFeatureTree* tr2 = CvCreateKDTree(xyz);

cvFindFeatures(tr1, ...);
cvFindFeatures(tr2, ...);

However, the (now disabled) test in tests/cv/aspilltree.cpp generates
a segfault when run, even though the implementation is essentially an
unmodified copy of what you sent. The problem seems to be something
nontrivial in your code -- the number of elements in the leaf list (in
icvSpillTreeDFSearch) does not agree with node->cc, and that is
assumed. It may be triggered by compilation switches used in the
opencv tree (??).

Can you have a look at that please? (try enabling and running the
test, or updating your test program to use the opencv tree)

Thanks,
Xavier

On Fri, Jan 9, 2009 at 11:11 AM, liuliu_0503
<[hidden email]> wrote:

> try this link to see if it works.
>
> http://tech.groups.yahoo.com/group/OpenCV/files/spilltree.tar.bz2
>
> --- In [hidden email], "Xavier Delacour" <xavier.delacour@...>
> wrote:
>>
>> On Fri, Jan 9, 2009 at 6:16 AM, liuliu_0503
>> <liuliu.1987+opencv@...> wrote:
>> > "sim" means the distance of the two vector.
>> > "test drive" is the vector to search with.
>> > "user data" is the index of similar vector to verify the result.
>> >
>> > p.s. I hope that cvFeatureTree bundle can be a collection of several
>> > different tree-based feature search methods, ideally, we can do such:
>> >
>> > CvFeatureTree* sp = cvCreateSpillTree( blabalalal );
>> > CvFeatureTree* bbf = cvCreateKDTree( lalalallala );
>> >
>> > cvFindFeatures( sp, desc, sp_results, sp_dist, k );
>> > cvFindFeatures( bbf, desc, sp_results, sp_dist, k );
>> >
>> > /* do comparison then */
>> >
>> > If that implemented in OpenCV would be great.
>>
>> Yes that was the original idea, and also to provide some extra
>> functions/structures to permit external storage of the vectors
>> themselves (which becomes relevant with large databases).
>>
>> I can do the integration (I am the author of these functions) if you
>> send me your files or publish them online somewhere. If you want to do
>> it, you should just prepare a patch and email it to the
>> opencvlibrary-devel list.
>>
>> >
>> > However, if someone come up with something like cvMergeFeatureTree(),
>> > how to decide which put into "results" in cvFindFeatures? (another
>> > words, how to decide the index which point to the original matrix)
>>
>> You mean chain the results together? so
>>
>> CvFeatureTree* sp = cvCreateSpillTree( blabalalal );
>> CvFeatureTree* bbf = cvCreateKDTree( lalalallala );
>> CvFeatureTree* m = cvMergeFeatureTree( sp, bbf );
>> cvFindFeatures( m, desc, sp_results, sp_dist, k );
>>
>> in that case we could just sort the results by distance. Is that
>> really a useful function in practice?
>>
>> Xavier
>>
>> >
>> > happy to have a discussion with cvFeatureTree original contributor.
>> >
>> > --- In [hidden email], "Xavier Delacour" <xavier.delacour@>
>> > wrote:
>> >
>> >>
>> >> On Thu, Jan 8, 2009 at 10:07 PM, liuliu_0503
>> >> <liuliu.1987+opencv@> wrote:
>> >>
>> >> Cool!
>> >>
>> >> > 1. will there be any difference in real world data?
>> >>
>> >> I can't say without having a closer look at how spill trees work. Off
>> >> hand it seems that performance on random data should approximate real
>> >> world performance pretty well, but of course that may not be true
> with
>> >> a particular feature generator.
>> >>
>> >> > 2. just notice that cvFindFeatures can find KNN with several
> points in
>> >> > one-time, but further comparison didn't show that can do any
> good to
>> >> > speed up (as for two points knn search, it takes 1400 ms, for
> one it
>> >> > takes 700ms), is it just a convenience way or there should be any
>> >> > speed-up?
>> >>
>> >> If you're talking about input points, then yes it is just for
>> >> convenience and time will grow linearly with their number. If output
>> >> points, then no, time should not vary with what you specify for k--
>> >> only with emax and the number of items in the tree.
>> >>
>> >> >
>> >> > following is my console output, I wrote cvSpillTree myself, the
>> >> > best-bin-first search is the cvFeatureTree which implemented in
> OpenCV
>> >> > 1.1 pre.
>> >>
>> >> It would be great if you could contribute your implementation, and/or
>> >> an LSH implementation if you have one.
>> >>
>> >> Can you explain a bit more the numbers below, and what input data
>> >> you're using/generating? (what is test drive, sim, user data? are the
>> >> result times an aggregation of sim?)
>> >>
>> >> Xavier
>> >>
>> >> >
>> >> > --------console output--------
>> >> >
>> >> > data generated
>> >> > spill tree created in 28761 ms.
>> >> > test drive: 36.95 -50.20 -21.44 60.88 -99.70 57.32 -41.65 13.21
> 30.16
>> >> > 37.56 -65.32 85.90 38.36 14.80 -11.74 -29.48 17.82 -54.00
> -23.25 83.86
>> >> > -80.23 75.74 12.73 -31.26 8.19 1.30 75.46 22.00 33.67 -5.34 31.16
>> >> > 70.05 -7.26 62.56 -9.95 -79.65 -72.97 -51.89 -28.61 -57.30 -44.76
>> >> > 26.83 15.79 -70.19 33.88 2.58 -43.35 -44.49 -60.56 -74.65 95.15
> -64.19
>> >> > -69.85 -79.42 54.68 51.46 28.14 -96.81 -33.27 -18.26 -55.45 10.10
>> >> > -22.27 -26.07 -83.79 64.43 25.43 3.18 26.76 23.15 -42.34 -98.87
> 8.10
>> >> > 39.83 -4.40 -65.59 12.55 52.30 32.67 68.17 59.00 1.31 -14.74
> 1.25 4.78
>> >> > -7.32 -91.16 12.97 20.03 -58.71 98.49 -16.75 -49.29 -97.95 -21.78
>> >> > 56.08 19.80 74.28 52.48 65.04 90.63 -64.17 -84.10 -62.01 86.81
> 14.18
>> >> > 25.22 82.14 -68.55 -80.29 77.05 4.64 59.25 96.77 -55.53 -20.53
> 98.34
>> >> > -68.35 24.97 20.76 13.70 -17.51 74.02 59.67 -68.12 -72.85 78.00
> 91.67
>> >> > spill tree approximation result in 672 ms:
>> >> > ---sim: 677.365960---user data: 152510---
>> >> > ---sim: 689.908410---user data: 66521---
>> >> > ---sim: 690.705895---user data: 479280---
>> >> > ---sim: 699.213164---user data: 488635---
>> >> > ---sim: 701.736654---user data: 58338---
>> >> > ---sim: 703.449575---user data: 390538---
>> >> > ---sim: 705.601652---user data: 185668---
>> >> > ---sim: 705.801436---user data: 498081---
>> >> > ---sim: 708.643109---user data: 22241---
>> >> > ---sim: 710.035490---user data: 74977---
>> >> > kd tree created in 54792 ms.
>> >> > best-bin-first approximation result in 706 ms:
>> >> > ---sim: 710.035490---user data: 74977---
>> >> > ---sim: 708.643109---user data: 22241---
>> >> > ---sim: 705.801436---user data: 498081---
>> >> > ---sim: 701.736654---user data: 58338---
>> >> > ---sim: 705.601652---user data: 185668---
>> >> > ---sim: 699.213164---user data: 488635---
>> >> > ---sim: 690.705895---user data: 479280---
>> >> > ---sim: 689.908410---user data: 66521---
>> >> > ---sim: 677.365960---user data: 152510---
>> >> > ---sim: 703.449575---user data: 390538---
>> >> >
>> >> > --------console output end--------
>> >> >
>> >> >
>> >>
>> >
>> >
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: spill tree beats best-bin-first in toy data

liuliu_0503
In reply to this post by Xavier Delacour
just noticed the big flaw in my implementation. thank you for point out.

because it is a high-dimensional euclidean space and the random data
tend to have relatively same distance between each other where the
spill tree basically do exhausted search. In that case, where d=128,
my test case is slower than naive search.

I modified the test case and did some small changes in params. in
lower dimension, such as d=10~30, spill-tree have same speed comparing
to bbf and get better approximation.

http://tech.groups.yahoo.com/group/OpenCV/files/cvspilltree.tar.bz2

the tree generating process is still slow, looking for some speed-up
advice..

--- In [hidden email], "Xavier Delacour" <xavier.delacour@...>
wrote:

>
> I am mostly done integrating your implementation into opencv.. While
> doing that, I noticed that your test is not correct. You are giving
> the bbf code emax=n (n=500000 for your results), which effectively
> implies an exhaustive search of the kd tree -- and _that_ is of course
> slower than spill trees.
>
> In practice values of emax less than 100, even as low as 20, work
> quite well.. eg, for matching say two images worth of SIFT/SURF
> features.
>
> Here are new results with n=10000 (my machine becomes unhappy with
> 500000) and emax at various values. As you can see, the quality of the
> approximation improves as emax gets larger, as does the search time.
> But at values that work well in practice, bbf search is much faster.
>
> Xavier
>
> note that neither implementation is compiled with optimizations here.
>
> spill tree created in 1279 ms.
> spill tree approximation result in 73 ms:
> ---sim: 698.308344---i: 9378---
> ---sim: 710.789429---i: 3102---
> ---sim: 733.939280---i: 2541---
> ---sim: 746.311721---i: 6685---
> ---sim: 747.415649---i: 6651---
> ---sim: 749.885740---i: 4522---
> ---sim: 753.267176---i: 4360---
> ---sim: 754.028963---i: 1256---
> ---sim: 755.353420---i: 1196---
> ---sim: 758.771161---i: 7719---
>
> kd tree created in 3828 ms.
> --------------------------------------------
> emax = 16
> best-bin-first approximation result in 0 ms:
> ---sim: 841.819824---i: 272---
> ---sim: 834.904562---i: 2060---
> ---sim: 820.374227---i: 5135---
> ---sim: 813.149933---i: 1638---
> ---sim: 829.187638---i: 2971---
> ---sim: 791.345031---i: 2866---
> ---sim: 800.839859---i: 5530---
> ---sim: 799.235517---i: 5820---
> ---sim: 767.408904---i: 4606---
> ---sim: 828.631318---i: 5699---
>
> --------------------------------------------
> emax = 128
> best-bin-first approximation result in 3 ms:
> ---sim: 802.672758---i: 6899---
> ---sim: 802.575934---i: 6228---
> ---sim: 800.839859---i: 5530---
> ---sim: 799.235517---i: 5820---
> ---sim: 789.219760---i: 3013---
> ---sim: 791.345031---i: 2866---
> ---sim: 789.498863---i: 9743---
> ---sim: 797.380549---i: 1310---
> ---sim: 767.408904---i: 4606---
> ---sim: 764.757569---i: 7551---
>
> --------------------------------------------
> emax = 1024
> best-bin-first approximation result in 30 ms:
> ---sim: 767.408904---i: 4606---
> ---sim: 766.102348---i: 623---
> ---sim: 765.401511---i: 1395---
> ---sim: 764.757569---i: 7551---
> ---sim: 762.791749---i: 1175---
> ---sim: 733.939280---i: 2541---
> ---sim: 758.771161---i: 7719---
> ---sim: 764.727846---i: 5586---
> ---sim: 747.415649---i: 6651---
> ---sim: 762.428061---i: 795---
>
> Here are the changes to the test:
>
> 6c6
> < int n = 500000;
> ---
> > int n = 10000;
> 40,47c40,52
> < t = cvGetTickCount();
> < cvFindFeatures( bbf, vector, res, dist, k, n );
> <         t = cvGetTickCount() - t;
> < printf( "best-bin-first approximation result in %lld ms:\n",
t/1000000 );

> < vv = dist->data.db;
> < idx = res->data.i;
> < for ( int i = 0; i < k; i++ )
> < printf( "---sim: %f---i: %d---\n", vv[i], idx[i] );
> ---
> >
> > for (int emax=16;emax<2000;emax*=8) {
> >  printf( "\n--------------------------------------------\n", emax );
> >  printf( "emax = %i\n", emax );
> >  t = cvGetTickCount();
> >  cvFindFeatures( bbf, vector, res, dist, k, emax );
> >  t = cvGetTickCount() - t;
> >  printf( "best-bin-first approximation result in %lld ms:\n",
t/1000000 );

> >  vv = dist->data.db;
> >  idx = res->data.i;
> >  for ( int i = 0; i < k; i++ )
> >    printf( "---sim: %f---i: %d---\n", vv[i], idx[i] );
> > }
>
> On Thu, Jan 8, 2009 at 10:07 PM, liuliu_0503
> <liuliu.1987+opencv@...> wrote:
> > 1. will there be any difference in real world data?
> > 2. just notice that cvFindFeatures can find KNN with several points in
> > one-time, but further comparison didn't show that can do any good to
> > speed up (as for two points knn search, it takes 1400 ms, for one it
> > takes 700ms), is it just a convenience way or there should be any
> > speed-up?
> >
> > following is my console output, I wrote cvSpillTree myself, the
> > best-bin-first search is the cvFeatureTree which implemented in OpenCV
> > 1.1 pre.
> >
> > --------console output--------
> >
> > data generated
> > spill tree created in 28761 ms.
> > test drive: 36.95 -50.20 -21.44 60.88 -99.70 57.32 -41.65 13.21 30.16
> > 37.56 -65.32 85.90 38.36 14.80 -11.74 -29.48 17.82 -54.00 -23.25 83.86
> > -80.23 75.74 12.73 -31.26 8.19 1.30 75.46 22.00 33.67 -5.34 31.16
> > 70.05 -7.26 62.56 -9.95 -79.65 -72.97 -51.89 -28.61 -57.30 -44.76
> > 26.83 15.79 -70.19 33.88 2.58 -43.35 -44.49 -60.56 -74.65 95.15 -64.19
> > -69.85 -79.42 54.68 51.46 28.14 -96.81 -33.27 -18.26 -55.45 10.10
> > -22.27 -26.07 -83.79 64.43 25.43 3.18 26.76 23.15 -42.34 -98.87 8.10
> > 39.83 -4.40 -65.59 12.55 52.30 32.67 68.17 59.00 1.31 -14.74 1.25 4.78
> > -7.32 -91.16 12.97 20.03 -58.71 98.49 -16.75 -49.29 -97.95 -21.78
> > 56.08 19.80 74.28 52.48 65.04 90.63 -64.17 -84.10 -62.01 86.81 14.18
> > 25.22 82.14 -68.55 -80.29 77.05 4.64 59.25 96.77 -55.53 -20.53 98.34
> > -68.35 24.97 20.76 13.70 -17.51 74.02 59.67 -68.12 -72.85 78.00 91.67
> > spill tree approximation result in 672 ms:
> > ---sim: 677.365960---user data: 152510---
> > ---sim: 689.908410---user data: 66521---
> > ---sim: 690.705895---user data: 479280---
> > ---sim: 699.213164---user data: 488635---
> > ---sim: 701.736654---user data: 58338---
> > ---sim: 703.449575---user data: 390538---
> > ---sim: 705.601652---user data: 185668---
> > ---sim: 705.801436---user data: 498081---
> > ---sim: 708.643109---user data: 22241---
> > ---sim: 710.035490---user data: 74977---
> > kd tree created in 54792 ms.
> > best-bin-first approximation result in 706 ms:
> > ---sim: 710.035490---user data: 74977---
> > ---sim: 708.643109---user data: 22241---
> > ---sim: 705.801436---user data: 498081---
> > ---sim: 701.736654---user data: 58338---
> > ---sim: 705.601652---user data: 185668---
> > ---sim: 699.213164---user data: 488635---
> > ---sim: 690.705895---user data: 479280---
> > ---sim: 689.908410---user data: 66521---
> > ---sim: 677.365960---user data: 152510---
> > ---sim: 703.449575---user data: 390538---
> >
> > --------console output end--------
> >
> >
>