Step By Step Data for Recognition Tests

Henrik Stewénius

This page is based on the data found on our UK-Bench Recognition Homepage.

When comparing performance between different recognition retrieval systems and ideas it is important to separate the effects of different components of the different systems on the final outcome of the test.

Out goal with this page is to walk through our recognition pipeline step-by-step and provide as much data as possible at every step so that authors with a different pipeline can replace as little as possible of the pipeline in order to minimally hurt the comparisons. Our current pipeline architecture which we share with several other systems is:

  • Detect MSER. From this step we get center-points and ellipses. We could also detect if it was dark-to-bright or bright to dark.
  • Detect SIFT and extract descriptor vectors. This will give a 128dim vector and a direction.
  • Quantize to visual words. Also in order to make it possible to train different quantizers. It would be good to supply are large amount of SIFT-data. Here we will try to present data from several different stages and different options.

    Results

    Running all 10200 images and different quantizers and scoring strategies give the following results (all quantizers at 1M leaves):
      Scoring Strategy 
    Quantizer 
    Training set
    Flat 10 100 1000 
    cd  2.895588   2.574118   3.139706   3.161275 
    moving  2.828529   2.161275   3.014216   3.083824 
    moving+cd  2.884412   2.551078   3.139902   3.157157 
    flip  3.014412   2.534902   3.135098   3.188333 
    test  3.166373   3.070098   3.294314   3.286863 
    As seen in the last row the scores imporve significantly by using the test-set during the training-phase.

    The Training sets

    Download sift-features
    cd A large collection of CD-covers. 31.486.448 features(4GB).
    moving Images from moving vehicles. 21.750.941 features (2.7GB).
    moving+cd Concatenation of the two above sets53.237.389 features
    flip Flipped versions of the test-set 5.327.083 features (0.68GB)
    test The test-set 5.328.755 features (0.68GB)

    The Scoring Strategies

    flat Using only the leaf-nodes. No scoring limit.
    10 Using hierarchical scoring with scoring limit 10.
    100 Using hierarchical scoring with scoring limit 10.
    100 Using hierarchical scoring with scoring limit 10.

    The Test Set

    Download extracted data
    Extracted features and geometry from UK-bench. WARNING: Only x and y are valid in the geometry data.SIFT and geometry (1.1GB)
    Quantized using cd Visual Words
    Quantized using moving Visual Words
    Quantized using cd+moving Visual Words
    Quantized using flip Visual Words
    Quantized using normal Visual Words

    FAQ:

  • What is the average number of features?
  • There are a total of 7034780 features, the average is 689 features per image. We permit multiple SIFT per MSER-point and I do not know the original number of MSER-points (you can find this number by counting the number of distinct points in the geometric part of the data).
     
  • Why is there a lower number of features in the training data compared to the test-set?
  • For the training data all features with an ellipse partially outside the image have been removed. Corresponding test-data can be found, along with results, can be found here. Table with results and values for plotting a curve.
    Quantizer Flat101001000
    Trained on the datasets given on this page
    cd 2.906667 2.651471 3.112647 3.120980
    moving+cd 2.895784 2.626176 3.125686 3.126078
    moving 2.824608 2.244314 3.002549 3.047843
    flip 3.015784 2.615980 3.122647 3.125294
    test 3.180882 3.110392 3.276569 3.241373
     
  • I the source-code available?
  • The code is not freely available and is the property of the University of Kentucky. Please contact the Center for Visualization and Virtual Environments for licensing. The code exist in both Windows and Linux versions.
     
  • What is your framerate for extraction?
  • Around 12 frames per second. For the 10200 frames we get:
    real    14m30.650s
    user    13m40.619s
    
    Approximate distribution:
    MSER 57%
    Affine Warp 24%
    SIFT 13%
    Quantize 3%
    Sum 97%
    We do not have a very good profiler but this should give an indication of where the problems lie. The quantizer is almost for free compared to the feature extraction pipeline.
     
  • What is your query time?
  • Depends on the scoring strategy. There is a trade-off between speed and quality. Time for all 10200 queries:
    StrategyTime for 10200Avg time
    Flat9.3s9.14e-4s
    109.9s 9.72e-4s
    10027.7s 2.71e-3s
    1000225s2.21e-2s
    For 100 times using "flat"
    real    16m7.531s
    user    15m32.642s
    
    For 100 times using "10"
    real    16m59.350s
    user    16m31.338s
    
    For 10 times using "100"
    real    4m46.278s
    user    4m36.717s
    
    For 1 time using "1000"
    real    3m45.801s
    user    3m35.857s
    
    These times includes reading all the text-based forward files, building the inverted index and computing the weights once.
     
  • On what machine was this measured?
  • On an AMD bought spring 2006, it is running 32-bit Linux (2.6.17-2-k7, Debian).
    # cat /proc/cpuinfo
    processor       : 0
    vendor_id       : AuthenticAMD
    cpu family      : 15
    model           : 39
    model name      : AMD Athlon(tm) 64 Processor 4000+
    stepping        : 1
    cpu MHz         : 2418.239
    cache size      : 1024 KB
    fdiv_bug        : no
    hlt_bug         : no
    f00f_bug        : no
    coma_bug        : no
    fpu             : yes
    fpu_exception   : yes
    cpuid level     : 1
    wp              : yes
    flags           : fpu vme de pse tsc msr pae mce 
                      cx8 apic sep mtrr pge mca cmov 
                      pat pse36 clflush mmx fxsr sse 
                      sse2 syscall nx mmxext fxsr_opt 
                      lm 3dnowext 3dnow up pni lahf_lm 
                      ts fid vid ttp tm stc
    bogomips        : 4841.36
    
     
  • How do I refer to this?
  • Please mention the homepage. The only paper we have published on this is:

    D. Nistér and H. Stewénius. Scalable recognition with a vocabulary tree. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 2161-2168, June 2006. [ bib | .ppt | http | .pdf ]
     
  • What size vocabulary should I use?
  • If you are using hierarchical scoring there is never a problem using a too large vocabulary. For flat scoring the picture is different:
    (This plot is on a different dataset. The quantizers where trained on images from a video-camera.)
    Anyway, choosing vocabulary size for flat scoring is a choice where several factors have to be balanced:
  • Memory availibilty.
  • Time requirements.
  • Quality Requirements.
  •  
  • What are the optimal settings of <component name>
  • The question is ill posed. Different use scenarios have different optimal settings. We have choosen to use settings which do not give optimal quality but gives a good trade-off between quality and speed. Using a higher number of features per image does improve the quality but slows down queries.
    For optimal performance on a toy-scale example such as this benchmark I would advice being very agressive with the number of features, after all the cost is only linear in memory and bilinear in query-time.
     
  • How does the vocabulary tree compare to <other method>?
  • Ask the other authors to try our set. I have made some trials but i am not sure enough of my implementation.

    Henrik Stewénius Stewenius