Name String

    cl_intel_advanced_motion_estimation

Contributors

    Biju George
    James Holland
    RaghuKrishnan Embar
    Adam Herr
    Tomasz Olejniczak
    Scott Pillow
    Ben Ashbaugh

Contact
   
    Biju George (biju.george 'at' intel.com)

IP Status

    TBD

Version

    Version 2, February 15, 2016

Number

    OpenCL Extension TBD

Status

    Final Draft

Extension Type

    OpenCL platform extension

Dependencies

    OpenCL 1.2
    cl_intel_accelerator version 1
    cl_intel_motion_estimation version 1

Overview

    This document presents the advanced motion estimation extension for OpenCL. This 
    extension builds upon the cl_intel_motion_estimation extension by providing block-
    based estimation and greater control over the estimation algorithm. 

    This extension reuses the set of host-callable functions and "motion estimation 
    accelerator objects" defined in the cl_intel_motion_estimation extension version 1.
    This extension depends on the OpenCL 1.2 built-in kernel infrastructure and on the 
    cl_intel_accelerator extension version 1, which provides an abstraction for domain-
    specific acceleration in the OpenCL runtime. 
 
New Procedures and Functions

    None

New Tokens

    Accepted as arguments to clGetDeviceInfo

    CL_DEVICE_ME_VERSION_INTEL                                      0x407E

    Accepted as flags passed to the kernel:
       
    CL_ME_CHROMA_INTRA_PREDICT_ENABLED_INTEL                        0x1
    CL_ME_LUMA_INTRA_PREDICT_ENABLED_INTEL                          0x2

    CL_ME_SKIP_BLOCK_TYPE_16x16_INTEL                               0x0
    CL_ME_SKIP_BLOCK_TYPE_8x8_INTEL                                 0x4

    CL_ME_COST_PENALTY_NONE_INTEL                                   0x0
    CL_ME_COST_PENALTY_LOW_INTEL                                    0x1
    CL_ME_COST_PENALTY_NORMAL_INTEL                                 0x2
    CL_ME_COST_PENALTY_HIGH_INTEL                                   0x3

    CL_ME_COST_PRECISION_QPEL_INTEL                                 0x0
    CL_ME_COST_PRECISION_HEL_INTEL                                  0x1
    CL_ME_COST_PRECISION_PEL_INTEL                                  0x2
    CL_ME_COST_PRECISION_DPEL_INTEL                                 0x3

    Valid intra-search predictor mode constants:
    
    CL_ME_LUMA_PREDICTOR_MODE_VERTICAL_INTEL                        0x0
    CL_ME_LUMA_PREDICTOR_MODE_HORIZONTAL_INTEL                      0x1
    CL_ME_LUMA_PREDICTOR_MODE_DC_INTEL                              0x2
    CL_ME_LUMA_PREDICTOR_MODE_DIAGONAL_DOWN_LEFT_INTEL              0x3

    CL_ME_LUMA_PREDICTOR_MODE_DIAGONAL_DOWN_RIGHT_INTEL             0x4
    CL_ME_LUMA_PREDICTOR_MODE_PLANE_INTEL                           0x4
    CL_ME_LUMA_PREDICTOR_MODE_VERTICAL_RIGHT_INTEL                  0x5
    CL_ME_LUMA_PREDICTOR_MODE_HORIZONTAL_DOWN_INTEL                 0x6    
    CL_ME_LUMA_PREDICTOR_MODE_VERTICAL_LEFT_INTEL                   0x7
    CL_ME_LUMA_PREDICTOR_MODE_HORIZONTAL_UP_INTEL                   0x8

    CL_ME_CHROMA_PREDICTOR_MODE_DC_INTEL                            0x0
    CL_ME_CHROMA_PREDICTOR_MODE_HORIZONTAL_INTEL                    0x1
    CL_ME_CHROMA_PREDICTOR_MODE_VERTICAL_INTEL                      0x2
    CL_ME_CHROMA_PREDICTOR_MODE_PLANE_INTEL                         0x3

    Valid constant values returned by clGetDeviceInfo:

    CL_ME_VERSION_ADVANCED_VER_1_INTEL                              0x1
    CL_ME_VERSION_ADVANCED_VER_2_INTEL                              0x2

    Valid macroblock type constants:

    CL_ME_MB_TYPE_16x16_INTEL                                       0x0
    CL_ME_MB_TYPE_8x8_INTEL                                         0x1
    CL_ME_MB_TYPE_4x4_INTEL                                         0x2

Valid skip mode constants:

    CL_ME_FORWARD_INPUT_MODE_INTEL                                  0x1
    CL_ME_BACKWARD_INPUT_MODE_INTEL                                 0x2
    CL_ME_BIDIRECTION_INPUT_MODE_INTEL                              0x3

Valid bidirectional weight constants:

    CL_ME_BIDIR_WEIGHT_QUARTER_INTEL                                0x10
    CL_ME_BIDIR_WEIGHT_THIRD_INTEL                                  0x15
    CL_ME_BIDIR_WEIGHT_HALF_INTEL                                   0x20
    CL_ME_BIDIR_WEIGHT_TWO_THIRD_INTEL                              0x2B
    CL_ME_BIDIR_WEIGHT_THREE_QUARTER_INTEL                          0x30

New Types

    None

Additions to Chapter 4 of the OpenCL Specification:

    Modify the description of function clGetDeviceInfo

    Table 4.3 must be extended to include the following enumeration constants:

    cl_device_info             Return Type      Description
    --------------             -----------      ---------------
    CL_DEVICE_ME_VERSION_INTEL  cl_uint         The motion estimation API version number
                                                supported by the device and driver.

    This extension requires a minimum motion estimation device version number of
    CL_ME_VERSION_ADVANCED_VER_1_INTEL.

    The cl_intel_motion_estimation extension defines a motion estimation accelerator
    object. This object is used without modification in this extension.

    The tokens defined in the "New Tokens" section of the cl_intel_motion_estimation
    extension are used by this extension. 

    This extension includes two new built-in kernels for block-based motion estimation. 
    The second built-in kernel is supported only for devices that report a motion estimation
    device version number of CL_ME_VERSION_ADVANCED_VER_2_INTEL. A program object for one or
    both kernels is obtained via clCreateProgramWithBuiltInKernels, passing the kernel name
    as a string to the kernel_names argument. A kernel object is obtained from this program
    by calling the clCreateKernel function passing the kernel name as a string to the 
    kernel_names argument. 

    Each kernel operates on 16x16 pixel blocks (macroblocks) on the source and 
    reference images. The number of macroblocks (MBs) in a given image is determined by
    number of 16x16 regions that evenly divide the global_work_size[0] (width) and
    global_work_size[1] (height) arguments passed to the clEnqueNDRangeKernel function. If
    the image dimensions are not evenly divisible by 16, a partial MB is defined for the 
    remaining pixels. The kernel references macroblocks sequentially using contiguous 
    row-major ordering. For example, a 128x128 source image would have the following 
    macroblock ordering:
  
    -------------------------
    | 0| 1| 2| 3| 4| 5| 6| 7|
    +--+--+--+--+--+--+--+--+
    | 8| 9|10|11|12|13|14|15|
    +--+--+--+--+--+--+--+--+
    |16|17|18|19|20|21|22|23|
    +--+--+--+--+--+--+--+--+
    |24|25|26|27|28|29|30|31|
    +--+--+--+--+--+--+--+--+
    |32|33|34|35|36|37|38|39|
    +--+--+--+--+--+--+--+--+
    |40|41|42|43|44|45|46|47|
    +--+--+--+--+--+--+--+--+
    |48|49|50|51|52|53|54|55|
    +--+--+--+--+--+--+--+--+
    |56|57|58|59|60|61|62|63|
    -------------------------

    The data-layout of the kernel's input and output arrays are based on this ordering and 
    require a specific data layout per macroblock as described below. 

    Inter-prediction is the process of determining the best inter-frame motion vectors 
    that describe the transform from a 2D reference image to another 2D source image. This
    is done by searching for temporal patterns, usually in adjacent frames in a video
    sequence. The estimation algorithm operates on 16x16 macroblocks, with either 4x4, 
    8x8 or 16x16 sub-block sizes. Each of these sub-block sizes has a corresponding 
    number of motion vectors within a given macroblock: 

    Sub-block Size     MVs per MB
    --------------     --------------
    4x4                16
    8x8                 4
    16x16               1
 
    The algorithm searches for the best match of each pixel block in the source image by
    searching an image region in the reference image, centered on the coordinates of that
    pixel block in the source image. This center coordinate can be offset by a set of 
    prediction motion vectors (MVs). The predictor_motion_vector_buffer argument is used 
    to define up to eight prediction MVs per macroblock. The count_motion_vector_buffer 
    argument is used to configure the number of actual prediction motion vectors used 
    within each macroblock. 

    A cost function scheme can be specified for motion search. Distortion for a MV is 
    computed as a sum of the SAD and the MV cost penalty. Cost penalty is computed based 
    on the distance between the computed MV and a specific cost-center. This cost-center
    is specified as the first predictor motion vector configured for a given MB. The 
    search_cost_penalty argument specifies the cost penalty function and can be configured
    for low, normal or high penalty. The search_cost_precision argument is used to 
    configure the range of the cost function by specifying the precision of control 
    points at which the cost penalties are applied to quarter, half, full, or double pixel
    precision. The cost penalties at in-between control points are linearly interpolated.
    Generally, a low penalty can be used when using low quantization parameter values
    during encoding and a high penalty can be used when using high quantization parameter
    values during encoding.

    Search results are populated in the search_motion_vector_buffer array. This array
    contains a set of best-search motion vectors per MB; the number of MVs per MB is 
    determined by the sub-block size. It is also possible to obtain the SAD-adjusted
    residual values corresponding to the best search MVs via the array specified via the
    search_residuals argument.  

    The kernel can perform skip-checks to produce distortion values based on the skip-
    check MVs specified for each macroblock. Skip-checks may be configured with either 8x8
    or 16x16 sub-block sizes, via the skip_block_type argument. The 
    skip_motion_vector_buffer is used to configure multiple sets of skip-check MVs per
    MB. The number of vectors in each set is determined by the sub-block size:

    Sub-block Size     MVs per MB
    --------------     --------------
    8x8                4
    16x16              1

    Results are obtained via the skip_residuals argument as SAD-adjusted distortion values
    corresponding to each skip-check MV defined for each macroblock. 

    Intra-prediction describes the transform from previous adjacent macroblocks to 
    subsequent macroblocks within the same 2D source frame by searching for spatial
    patterns and produces the predictor modes from previous adjacent macroblocks within
    the same frame. The kernel may be configured to report the intra-prediction modes via
    the  intra_search_prediction_modes_buffer argument. This array contains, for each
    macroblock, a record containing the predictor mode constants for 1 16x16, 4 8x8 and 16
    4x4 luma blocks. There is also a predictor mode entry reserved for an 8x8 chroma 
    block. Residual values derived during intra-prediction process are accessible via the
    intra_search_residuals argument.  
 
    The first kernel

    block_advanced_motion_estimate_check_intel(
         accelerator_intel_t   accelerator,
         __read_only image2d_t src_image,
         __read_only image2d_t ref_image,
         uint                  flags,
         uint                  skip_block_type,
         uint                  search_cost_penalty,
         uchar                 search_cost_precision,
     __global short2           *count_motion_vector_buffer,
     __global short2           *predictor_motion_vector_buffer,
     __global short2           *skip_motion_vector_buffer,
     __global short2           *search_motion_vector_buffer,
     __global char             *intra_search_predictor_modes,
     __global ushort           *search_residuals,
     __global ushort           *skip_residuals,
     __global ushort          *intra_residuals );

    defines a kernel that provides various block-based motion estimation computations.
    There are three basic use cases for this kernel:

    1.) Perform inter-prediction motion estimation on the source and reference images to
    obtain the best search motion vectors and their associated distortion values. 

    2.) Perform skip-checks on the source and reference images by providing a set of
    motion vectors, then obtain the corresponding distortion values. 
 
    3.) Perform intra-prediction computations to obtain the best-search prediction modes 
    between adjacent macroblocks and associated residual values. 

    This kernel can be set up to do some or all of these operations in a single enqueue.     

    block_advanced_motion_estimate_check_intel arguments:

    accelerator is a valid accelerator object created by clCreateAcceleratorINTEL, where
    the type of the accelerator must be CL_ACCELERATOR_TYPE_MOTION_ESTIMATION_INTEL. 
    Refer to the cl_intel_motion_estimation extension for a detailed description of 
    configuring accelerator object with the cl_motion_estimation_desc_intel structure.

    src_image is the input source image, typically representing 8-bit luminance
    information. Currently, the image_channel_order and the image_data_type of src_image
    are restricted as follows:

    Channel Order      Src Channel Data Type
    --------------     ---------------------
    CL_R               CL_UNORM_INT8

    Additional formats will be support by future extensions. The host program is 
    responsible for populating the tiled image using the clEnqueueWriteImage or other
    appropriate API function.

    ref_image is the input reference image, representing 8-bit luminance information. 
    image_channel_order and the image_data_type must match src_image, as follows:
    
    Channel Order      Src Channel Data Type
    --------------     ---------------------
    CL_R               CL_UNORM_INT8

    Additional formats will be support by future extensions. The host program is
    responsible for populating the tiled image using the clEnqueueWriteImage or other
    appropriate API function function.

    flags defines any optional modes or behaviors used in computing motion estimation,
    skip check and/or intra-prediction algorithms. Currently supported are: 

    Type                                      Description
    --------------------------------------    -----------
    CL_ME_LUMA_INTRA_PREDICT_ENABLED_INTEL    Enabled Luma-based intra-prediction.

    The following additional token are reserved for future support:  

    Type                                      Description
    --------------------------------------    -----------
    CL_ME_CHROMA_INTRA_PREDICT_ENABLED_INTEL  Enabled chroma- based intra-prediction.

    skip_block_type flag specifies the sub-block size used in evaluating skip checks. The 
    specified sub-block size will determine the data layout of the 
    skip_motion_vector_buffer array: 

    Type                       Sub-block size   MVs per MB entry
    ---------------------      ---------------  ----------------
    CL_ME_MB_TYPE_16x16_INTEL  16x16            1
    CL_ME_MB_TYPE_8x8_INTEL    8x8              4 

    search_cost_penalty defines the cost function scheme used in computing cost penalties.

    Type                                      Description
    --------------------------------------    -----------
    CL_ME_COST_PENALTY_NONE_INTEL             penalty is zero
    CL_ME_COST_PENALTY_LOW_INTEL              penalty for low motion
    CL_ME_COST_PENALTY_NORMAL_INTEL           penalty for normal motion
    CL_ME_COST_PENALTY_HIGH_INTEL             penalty for high motion
    
    search_cost_precision defines the pixel precision of the cost penalty calculations. If
    the search_cost_penalty flag is set to CL_ME_COST_PENALTY_NONE_INTEL, this argument
    is ignored. Possible values are:

    Type                                      Description
    --------------------------------------    -----------
    CL_ME_COST_PRECISION_QPEL_INTEL           quarter pixel
    CL_ME_COST_PRECISION_HPEL_INTEL           half pixel
    CL_ME_COST_PRECISION_PEL_INTEL            full pixel
    CL_ME_COST_PRECISION_DPEL_INTEL           double pixel

    count_motion_vector_buffer defines the number of predictor motion vectors and skip-
    check motion vectors defined for each macroblock. The buffer contains an array of
    short integer pairs, one pair per MB. The indices of the array correspond to the
    contiguous row-major block layout of the input frame. The first value in each pair
    defines the number of predictor motion vectors for a given MB; this value defines the
    range of valid entries for the MB contained within the predictor_motion_vector_buffer
    array. The second value in each pair defines the number of skip-check motion vectors
    for the MB; this value defines the range of valid entries in the 
    skip_motion_vector_buffer array. All size values must be between 0 and 8 inclusive; 
    size values greater than 8 result in undefined behavior. 

    predictor_motion_vector_buffer defines an input array of signed short integer 
    predictor MVs with quarter-pixel resolution. The array is partitioned into clusters of
    8 motion vectors per MB in contiguous row-major ordering. The buffer layout assumes
    the maximum size of 8 predictor MVs per MB even if the count_motion_vector_buffer
    array specifies a smaller predictor count. If the value of the search_cost_penalty
    argument does not equal CL_ME_COST_PENALTY_NONE_INTEL, the first predictor MV for
    each MB is used as the cost center for cost penalty calculations. If the array passed
    to count_motion_vector_buffer argument specifies a predictor size of zero for all
    macroblocks this argument can be NULL.

    skip_motion_vector_buffer defines an input array of signed short integer skip-check
    MVs. The array is partitioned into clusters of 8 sets of motion vectors per MB, in
    contiguous row-major ordering. The value of skip_block_type determines the number of
    MVs for each of the 8 entries: 

    Value of skip_block_type    Number MVs in each entry    
    -------------------------   ---------------------------
    CL_ME_MB_TYPE_16x16_INTEL   1 MVs per entry
    CL_ME_MB_TYPE_8x8_INTEL     4 MVs per entry

    The buffer layout assumes the maximum size of 8 MV entries per MB, even if the 
    count_motion_vector_buffer array specifies a smaller skip-check count. If the array
    passed to count_motion_vector_buffer specifies a skip-check size of zero for all macro
    blocks, no skip check computation is performed and this argument can be NULL. 

    search_motion_vector_buffer defines an output array of signed short integers pairs 
    defining the best search motion vectors per macro block. The array contains 1, 4 or 16 
    motion vectors per MB in contiguous row-major ordering. The number of vectors per MB
    is determined by the value of mb_block_type specified during the creation of the
    accelerator object: 

    Value of mb_block_type       Number of MVs    
    ----------------------       --------------
    CL_ME_MB_TYPE_16x16_INTEL     1 MVs per MB
    CL_ME_MB_TYPE_8x8_INTEL       4 MVs per MB
    CL_ME_MB_TYPE_4x4_INTEL      16 MVs per MB

    intra_search_prediction_modes_buffer specifies an output buffer containing a sequence
    of signed chars describing the predictor modes used during motion estimation. The
    array is divided into a sequence of 22 bytes per MB in contiguous row-major ordering.
    Each entry in the array has the following form:

    struct search_predictor_modes
    {
        char luma_16x16_block;
        char luma_8x8_block[4];
        char luma_4x4_block[16];
        char chroma_8x8_block; 
    };

    The luma_16x16_block, luma_8x8_block and luma_4x4_block fields contain valid values
    only when the CL_ME_LUMA_INTRA_PREDICT_ENABLED_INTEL flag is set.

    Each value in the luma_8x8_block and luma_4x4_block arrays contains one of the
    following constants:

    CL_ME_LUMA_PREDICTOR_MODE_VERTICAL_INTEL            
    CL_ME_LUMA_PREDICTOR_MODE_HORIZONTAL_INTEL        
    CL_ME_LUMA_PREDICTOR_MODE_DC_INTEL
    CL_ME_LUMA_PREDICTOR_MODE_DIAGONAL_DOWN_LEFT_INTEL
    CL_ME_LUMA_PREDICTOR_MODE_DIAGONAL_DOWN_RIGHT_INTEL
    CL_ME_LUMA_PREDICTOR_MODE_VERTICAL_RIGHT_INTEL
    CL_ME_LUMA_PREDICTOR_MODE_HORIZONTAL_DOWN_INTEL
    CL_ME_LUMA_PREDICTOR_MODE_VERTICAL_LEFT_INTEL
    CL_ME_LUMA_PREDICTOR_MODE_HORIZONTAL_UP_INTEL

    The value of luma_16x16_block contains one of the following constants:

    CL_ME_LUMA_PREDICTOR_MODE_VERTICAL_INTEL             
    CL_ME_LUMA_PREDICTOR_MODE_HORIZONTAL_INTEL            
    CL_ME_LUMA_PREDICTOR_MODE_DC_INTEL        
    CL_ME_LUMA_PREDICTOR_MODE_PLANE_INTEL    

    The chroma_8x8_block field only contain valid values if the 
    CL_ME_CHROMA_INTRA_PREDICT_ENABLED_INTEL flag is set. If enabled, the chroma_8x8_block 
    contains one of the following constants:

    CL_ME_CHROMA_PREDICTOR_MODE_VERTICAL_INTEL             
    CL_ME_CHROMA_PREDICTOR_MODE_HORIZONTAL_INTEL            
    CL_ME_CHROMA_PREDICTOR_MODE_DC_INTEL        
    CL_ME_CHROMA_PREDICTOR_MODE_PLANE_INTEL    

    This argument can be NULL.

    search_residuals defines an output buffer containing vectors of unsigned short SAD-
    adjusted values corresponding to the best search motion vectors populated in the 
    search_motion_vector_buffer array. The array is divided into one vector per MB in 
    contiguous row-major block ordering. Each vector contains 1, 4, or 16 components
    depending on the value of mb_block_type specified during the creation of the 
    accelerator object: 

    Value of mb_block_type     Vector Size    
    -------------------------  -------------
    CL_ME_MB_TYPE_16x16_INTEL    1
    CL_ME_MB_TYPE_8x8_INTEL      4
    CL_ME_MB_TYPE_4x4_INTEL     16

    This argument can be NULL.

    skip_residuals defines an output buffer containing vectors of unsigned short SAD-
    adjusted values corresponding to the skip-check MVs defined by 
    skip_motion_vector_buffer. The array is partitioned into clusters of 8 sets of 
    residual values per MB, in contiguous row-major ordering. The value of skip_block_type
    determines the number of values in each of the 8 entries: 

    Value of skip_block_type    Number MVs in each entry    
    ------------------------    ------------------------
    CL_ME_MB_TYPE_16x16_INTEL   1 residual per entry
    CL_ME_MB_TYPE_8x8_INTEL     4 residuals per entry

    The buffer layout assumes the maximum size of 8 residual values per MB, however the
    number of valid residual entries corresponds to the skip-check MV count specified in 
    count_motion_vector_buffer for each MB. This argument can be NULL.

    intra_search_residuals defines an output buffer of unsigned short SAD-adjusted vectors
    that correspond to the residual values used during intra-prediction. The buffer
    contains 4 values per MB in contiguous row-major ordering using the following layout:

    struct intra_search_residuals
    {
        short luma_16x16_block_residual;
        short luma_8x8_block_residual;
        short luma_4x4_block_residual;
        short chroma_8x8_block_residual; 
    };

    The chroma_8x8_block_residuals value is only valid if the 
    CL_ME_CHROMA_INTRA_PREDICT_ENABLED flag is set. This argument can be NULL.

    The second built-in kernel extends upon the functionality provided by the first 
    kernel by additionally supporting bidirectional skip checks and minor additional 
    control over the estimation algorithm. The additional control includes the 
    ability to perform skip checks on a different set of source and reference frames
    as used for motion search, and the option to specify a pair of scalar counts for
    input search predictors and skip motion vectors globally for all macro-blocks in
    the frame instead of having to specify such counts on a per macroblock basis. This
    kernel is only supported for devices that report a motion estimation device version
    number of CL_ME_VERSION_ADVANCED_VER_2_INTEL.

    The second kernel

    block_advanced_motion_estimate_bidirectional_check_intel(
        accelerator_intel_t    accelerator,
        __read_only image2d_t  src_search_image,
        __read_only image2d_t  ref_search_image,                                  
        __read_only image2d_t  src_check_image,                                                   
        __read_only image2d_t  ref0_check_image,                                                  
        __read_only image2d_t  ref1_check_image,                                                  
        uint                   flags,                                         
        uint                   search_cost_penalty,                                               
        uint                   search_cost_precision,                                             
        short2                 count_global,                                                      
        uchar                  bidir_weight,                             
    __global short2 *          count_motion_vector_buffer,                             
    __global short2 *          prediction_motion_vector_buffer,                              
    __global char             *skip_input_mode_buffer,                                            
    __global short2 *          skip_motion_vector_buffer,                              
    __global short2           *search_motion_vector_buffer,                                       
    __global char             *intra_search_predictor_modes,                                      
    __global ushort           *search_residuals,                                                  
    __global ushort *          skip_residuals,                       
    __global ushort *          intra_residuals );
    
    defines a kernel that provides various block-based motion estimation computations.
    There are three basic use cases for this kernel:
    
    1.) Perform unidirectional inter-prediction motion estimation on the source and
    reference images to obtain the best search motion vectors and their associated 
    distortion values. 
    
    2.) Perform unidirectional or bidirectional skip-checks on the source and 
    reference images by providing a set of motion vectors, then obtain the 
    corresponding distortion values.
 
    3.) Perform intra-prediction computations to obtain the best-search prediction
    modes between adjacent macroblocks and associated residual values. 

    This kernel can be set up to do some or all of these operations in a single 
    enqueue.         
    
    The kernel can perform unidirectional or bidirectional skip-checks to produce
    distortion values based on the input skip-check motion vectors specified for 
    each sub-block in the macroblock. Skip-checks for each motion vector for a 
    sub-block may be independently configured for either     unidirectional or 
    bidirectional skip-checks by means of the skip_input_mode_buffer argument. 
    A sub-block configured for bidirectional skip-check will have two component
    input motion vectors; one specifying a rectangular region in the forward 
    reference frame and the other in the backward     reference frame. A 
    bidir_weight argument is additionally specified for sub-blocks configured 
    for bidirectional skip checks. The effective reference     region is a 
    weighted blend of the forward and backward reference region as specified by
    the bidirectional skip motion vector and bidir_weight arguments. If a motion
    vector is configured for bidirectional skip check, then the ref0_check_image
    argument is taken as the forward reference image     and ref1_check_image is
    taken as the backward reference image, else only ref0_check_image is taken 
    as the single reference image. Skip-checks may be     configured with either 
    8x8 or 16x16 sub-block sizes, via the flags argument. The 
    skip_motion_vector_buffer is used to configure up to four sets of bidirectional 
    skip-check MVs pairs per MB. The number of vectors in each set is determined 
    by the sub-block size:
    
    Sub-block Size    Bidirectional MV pairs per MB
    --------------    -----------------------------
         8x8                    4
        16x16                   1

    Results are obtained via the skip_residuals argument as SAD-adjusted distortion
    values corresponding to each skip-check MV defined for each macroblock. 
    
    block_advanced_motion_estimate_bidirectional_check_intel arguments:    
    
    accelerator is a valid accelerator object created by clCreateAcceleratorINTEL,
    where the type of the accelerator must be 
    CL_ACCELERATOR_TYPE_MOTION_ESTIMATION_INTEL.
    Refer to the cl_intel_motion_estimation extension for a detailed description 
    of configuring accelerator object with the cl_motion_estimation_desc_intel 
    structure.
    
    src_search_image is the input source image for motion search operations, 
    typically representing 8-bit luminance information. Currently, the 
    image_channel_order and the image_data_type of src_search_image are 
    restricted as follows:
        
    Channel Order     Src Channel Data Type
    -------------     ---------------------
    CL_R              CL_UNORM_INT8
    
    Additional formats will be support by future extensions. The host program 
    is responsible for populating the tiled image using the clEnqueueWriteImage 
    function or other appropriate API function.
    
    ref_search_image is the input reference image for motion search operations, 
    representing 8-bit luminance information. The image_channel_order and the 
    image_data_type must match src_search_image. The host program is responsible 
    for populating the tiled image using the clEnqueueWriteImage function or other
    appropriate API function.
    
    src_check_image is the input source image for skip checks operations, 
    typically representing 8-bit luminance information. It has the same 
    restrictions as src_search_image. The host program is responsible for 
    populating the tiled image using the clEnqueueWriteImage function or other
    appropriate API function.
    
    ref0_check_image is the input forward reference image for unidirectional 
    and bidirectional skip check operations, representing 8-bit luminance
    information. The image_channel_order and the image_data_type must match 
    src_check_image. The host program is responsible for populating the tiled
    image using the clEnqueueWriteImage function or other
    appropriate API function.
    
    ref1_check_image is the input backward reference image for bidirectional
    skip check operations, representing 8-bit luminance information. The 
    image_channel_order and the image_data_type must match src_check_image. 
    If bidirectional skip checks are not used then, this must be set to
    the same image as ref0_check_image. The host program is responsible for
    populating the tiled image using the clEnqueueWriteImage function or other
    appropriate API function.
    
    flags defines any optional modes or behaviors used in computing motion 
    estimation, skip check and/or intra-prediction algorithms. 
    Currently     supported are: 
    
    Type                                  Description
    ----                                   -----------
    CL_ME_SKIP_BLOCK_TYPE_16x16_INTEL      Specifies a 16x16
                                           skip check sub-block
                                           type.
      
    CL_ME_SKIP_BLOCK_TYPE_8x8_INTEL        Specifies a 8x8
                                           skip check sub-block
                                           type.  
    
    CL_ME_LUMA_INTRA_PREDICT_ENABLED_INTEL Enables Luma-
                                           based intra-
                                           prediction
    
    The following additional token is reserved for future support:  
    
    Type                                     Description
    ----                                      -----------
    CL_ME_CHROMA_INTRA_PREDICT_ENABLED_INTEL  Enables chroma-based intra-
                                              prediction.
    
    The CL_ME_SKIP_BLOCK_TYPE_16x16_INTEL flag cannot be set along with 
    CL_ME_SKIP_BLOCK_TYPE_8x8_INTEL as skip checks can either be configured
    for a 16x16 sub-block size or an 8x8 sub-block size. The behavior is
    undefined if both flags are set.
    
    search_cost_penalty defines the cost function scheme used in computing 
    cost penalties. 
    Refer to the description of argument search_cost_penalty in the first 
    built-in kernel block_advanced_motion_estimate_check_intel for details.
    
    search_cost_precision defines the pixel precision of the cost penalty 
    calculations. 
    Refer to the description of argument search_cost_precision in the first 
    built-in kernel block_advanced_motion_estimate_check_intel for details.
    
    count_global can be used to specify the scalar counts of predictor 
    motion vectors and skip-check motion vectors globally for all 
    macroblocks. This can be used in lieu of setting the 
    count_motion_vector_buffer if the number of predictor motion vectors
    and skip-check motion vectors are uniform for all macro-blocks. 
    It is specified as a pair of short integers. The first value in the 
    pair is the count of predictor MVs and the second value is the count
    of skip check MVs. If either value is -1, then the corresponding count
    is taken from the per macroblock entry for each macroblock as 
    specified in count_motion_vector_buffer.
    
    count_motion_vector_buffer defines the number of predictor motion 
    vectors and skip-check motion vectors defined for each macroblock. 
    The buffer contains an array of short integer pairs, one pair per 
    MB. The indices of the array correspond to the contiguous row-major
    block layout of the input frame. The first value in each pair 
    defines the number of predictor motion vectors for a given MB; this
    value defines the range of valid entries for the MB contained within
    the predictor_motion_vector_buffer array. This value is used only if
    the predictor count in count_global is -1. The second value in each 
    pair defines the number of skip-check motion vectors for the MB; 
    this value defines the range of valid entries in the 
    skip_motion_vector_buffer array. This value is used only if the 
    skip-check MV count in count_global is -1. All size values must be
    between 0 and 4 inclusive; size values greater than 4 result in 
    undefined behavior. If both the pair values in count_global is -1,
    then this argument can be NULL.
    
    bidir_weight defined the implicit bidirectional weight to be used
    when performing bidirectional skip checks. This is used to obtain 
    the weighted reference pixels from the forward and backward blocks. 
    
    There are 5 possible weights: 
    
    Weight        Description
    ------        -----------
    CL_ME_BIDIR_WEIGHT_QUARTER_INTEL          quarter distance from 
                                              forward and three-
                                              quarters from 
                                              backward

    CL_ME_BIDIR_WEIGHT_THIRD_INTEL            one-third distance
                                              from forward and two-
                                              thirds from backward
    
    CL_ME_BIDIR_WEIGHT_HALF_INTEL             half distance from 
                                              forward and backward
    
    CL_ME_BIDIR_WEIGHT_TWO_THIRD_INTEL        two-third distance
                                              from forward and one-
                                              quarters from
                                              backward
    
    CL_ME_BIDIR_WEIGHT_THREE_QUARTER_INTEL    quarter distance from 
                                              forward and three-
                                              quarters from 
                                              backward
    
    predictor_motion_vector_buffer defines an input array of signed 
    short integer predictor MVs with quarter-pixel resolution. The array
    is  partitioned into clusters of 4 motion vectors per MB in 
    contiguous row-major ordering. The buffer layout assumes the maximum
    size of 4 predictor MVs per MB even if the count_motion_vector_buffer 
    array specifies a smaller predictor count. If the value of the 
    search_cost_penalty argument does not equal 
    CL_ME_COST_PENALTY_NONE_INTEL, the first predictor MV for each MB is
    used as the cost center for cost penalty calculations. If the array
    passed to count_motion_vector_buffer argument specifies a predictor 
    size of zero for all macroblocks this argument can be NULL.
    
    skip_input_mode_buffer defines an input array of unsigned char 
    integers defining the skip modes for each macroblock. The indices of
    the     array     correspond to the contiguous row-major block layout
    of the input frame. There is one unsigned char integer per macroblock.
    The following     2-bit skip mode enumeration values are defined for
    each macroblock sub-block:
    
    Skip Mode Enumeration                 Description
    ---------------------                 -----------
    CL_ME_FORWARD_INPUT_MODE_INTEL        Unidirectional skip MV 
                                          from forward frame

    CL_ME_BACKWARD_INPUT_MODE_INTEL       Unidirectional skip MV 
                                          from backward frame

    CL_ME_BIRECTIONAL_INPUT_MODE_INTEL    Bidirectional skip MV
                                          from forward and 
                                          backward frames
    
    The format of each macroblock unsigned char entry depends on the
    skip block type specified in the flags parameter.
    
    Skip Block Type    Format
    ---------------    ------
    CL_ME_SKIP_BLOCK_TYPE_16x16_INTEL     1 2-bit skip mode 
                                          enumeration value for
                                          one sub-block
    CL_ME_SKIP_BLOCK_TYPE_8x8_INTEL       4 2-bit skip modes 
                                          enumeration values (one
                                          for each sub-block 
                                          component MV) packed
                                          into successive two
                                          bits of char
    
    skip_motion_vector_buffer defines an input array of pairs of signed
    short integer skip-check MVs � one component MV for the forward 
    reference image and the other component MV for the backward reference
    image. The buffer layout assumes the MVs for both the forward and 
    backward reference images are specified. If unidirectional forward 
    or backward check is specified for a skip-check sub-block, then the
    corresponding backward or forward component of the MV pair will 
    be ignored. The array is partitioned into clusters of 4 sets of 
    bidirectional pairs motion vectors per MB, in contiguous row-major 
    ordering. The value of skip_block_type determines the number of pairs
    of MVs for each of the 4 entries: 
    
    Value of skip_block_type         Number MV pairs in each entry 
    ----------------------           ---------------------------
    CL_ME_MB_TYPE_16x16_INTEL        1 MV forward/backward pair
    per entry
    CL_ME_MB_TYPE_8x8_INTEL          4 MV forward/backward pairs 
    per MB
    
    The buffer layout assumes the maximum size of 4 MV pair entries per MB, 
    even if the count_motion_vector_buffer array specifies a smaller skip-
    check count. If the array passed to count_motion_vector_buffer specifies
    a skip-check size of zero for all macro blocks, no skip check 
    computation is performed and this argument can be NULL. 
        
    search_motion_vector_buffer defines an output array of signed short 
    integers pairs defining the best search motion vectors per macro block.    
    Refer to the description of argument search_motion_vector_buffer in 
    the first built-in kernel block_advanced_motion_estimate_check_intel
    for details.
 
    
    intra_search_prediction_modes_buffer specifies an output buffer 
    containing a sequence of signed chars describing the predictor modes 
    used during motion estimation.
    Refer to the description of argument intra_search_prediction_modes in 
    the first built-in kernel block_advanced_motion_estimate_check_intel
    for details.

    
    search_residuals defines an output buffer containing vectors of 
    unsigned short SAD-adjusted values corresponding to the best 
    search motion vectors populated in the search_motion_vector_buffer 
    array.
    Refer to the description of argument search_residuals in the first 
    built-in kernel block_advanced_motion_estimate_check_intel for 
    details.
    
    skip_residuals defines an output buffer containing vectors of 
    unsigned short SAD-adjusted values corresponding to the skip-check
    MVs defined by skip_motion_vector_buffer. The array is partitioned
    into clusters of 4 sets of residual values per MB, in contiguous
    row-major ordering. The value of skip_block_type determines
    the number of values in each of the 4 entries: 
    
    Value of skip_block_type     Number MVs in each entry 
    ----------------------  ---------------------------
    CL_ME_MB_TYPE_16x16_INTEL       1 residual per entry
    CL_ME_MB_TYPE_8x8_INTEL         4 residuals per entry
    
    The buffer layout assumes the maximum size of 4 residual values 
    per MB, however the number of valid residual entries corresponds
    to the skip-check MV count specified in count_motion_vector_buffer 
    for each MB. This argument can be NULL.
    
    intra_search_residuals defines an output buffer of unsigned short 
    SAD-adjusted vectors that correspond to the residual values used 
    during intra-prediction.
    Refer to the description of argument intra_search_residuals in the
    first built-in kernel block_advanced_motion_estimate_check_intel
    for details.

    
    This kernel is queued for execution using clEnqueueNDRangeKernel(). Several arguments 
    passed to this function are specific for this kernel:

    For both built-in kernels:

    work_dim must be 2.

    global_work_size represents the height and width of the area of interest to be
    processed. 

    global_work_offset specifies the top-left point of the area of interest. 

    local_work_size must NULL.

    The count and layout of macroblocks processed in the frame is based on the arguments 
    passed to the global_work_size and global_work_offset argument and not on the 
    dimensions of the input and reference images. The client must ensure that the data 
    layout of all arrays passed as arguments define the correct number of macroblocks. 
    The height and width dimensions of the area of interest specified by global_work_size
    and global_work_offset must be less than or equal to the width and height of the 
    source image. 

    The clEnqueueNDRangeKernel function returns the usual error codes, augmented with the
    following specific error codes for this kernel:
    
    - CL_INVALID_WORK_DIMENSION if work_dim is not 2. This built-in kernel requires a 2D 
      ND-range.

    - CL_INVALID_WORK_GROUP_SIZE if local_work_size is not NULL. 
    
    - CL_INVALID_WORK_GROUP_SIZE if the respective values of global_work_size[0] and 
      global_work_size[1] exceed the width and/or height of input images.

    - CL_INVALID_IMAGE_FORMAT_DESCRIPTOR if an image object passed as an argument does not
      have a supported format, as listed above.    

    - CL_INVALID_IMAGE_FORMAT if the image objects passed to src_image and ref_image
      arguments do not contain matching formats and sizes. 

    - CL_INVALID_GLOBAL_OFFSET if the respective values of global_work_offset[0] and 
      global_work_offset[1] exceed the width and/or height of input images.

    - CL_INVALID_KERNEL_ARGS if predictor_motion_vector_buffer is NULL and one or more 
      predictor MV sizes passed to count_motion_vector_buffer are greater than 0.   

    - CL_INVALID_KERNEL_ARGS if skip_motion_vector_buffer is NULL and one or more but
      skip-check MV sizes passed to count_motion_vector_buffer are greater than 0.

    - CL_INVALID_BUFFER_SIZE if any of the cl_mem objects passed as arguments has a size
      less than the expected size. 

Interactions with Other Extensions

    The advanced motion estimation extension is based on the cl_intel_accelerator and 
    cl_intel_motion_estimation  extensions, and is defined in terms of additions to the
    base accelerator and motion extension documents.