1. nvidia提供了一个c++的类库thrust用来简化编程,在安装cuda toolkit时候已经包含了thrust

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <algorithm>
#include <cstdlib>

#include <windows.h>
#include <algorithm>

template <class T>
void cpu_sort(T begin, T end)
    std::sort(begin, end);

void gpu_sort(thrust::host_vector<int> &h_vec)
  // transfer data to the device
  thrust::device_vector<int> d_vec = h_vec;

  // sort data on the device (846M keys per second on GeForce GTX 480)
  thrust::sort(d_vec.begin(), d_vec.end());

  // transfer data back to host
  thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());

#define CHK_TIME(x)    {int t1=GetTickCount();x;int t2=GetTickCount();printf(#x ": %d\n", t2-t1);}

int main(void)
  // generate 32M random numbers serially
  thrust::host_vector<int> h_vec(32 << 20);
  std::generate(h_vec.begin(), h_vec.end(), rand);

  thrust::host_vector<int> h_vec_1(h_vec);
  CHK_TIME(cpu_sort(h_vec_1.begin(), h_vec_1.end()));

  thrust::host_vector<int> h_vec_2(h_vec);

  return 0;
(debug version)
cpu_sort(h_vec_1.begin(), h_vec_1.end()): 94609
gpu_sort(h_vec_2): 3312
(release version)
cpu_sort(h_vec_1.begin(), h_vec_1.end()): 2828
gpu_sort(h_vec_2): 594
2. 关于cuda的sort算法,用的是 radix sort
Many GPU sorting implementations are variants of the bitonic sort, which is pretty well known and described in most reasonable texts on algorithms published in the last 25 or 30 years.

The "reference" sorting implementation for CUDA done by Nadathur Satish from Berkeley and Mark Harris and Michael Garland from NVIDIA (paper here) is a radix sort, and forms the basis of what is in NPP and Thrust.
3. NPP是nvidia的信号处理函数库,类似于ipp,包含了很多基本的处理算法

    Eliminates unnecessary copying of data to/from CPU memory
        Process data that is already in GPU memory
        Leave results in GPU memory so they are ready for subsequent processing
    Data Exchange and Initialization
        Set, Convert, Copy, CopyConstBorder, Transpose, SwapChannels
    Arithmetic and Logical Operations
        Add, Sub, Mul, Div, AbsDiff, Threshold, Compare
    Color Conversion
        RGBToYCbCr, YcbCrToRGB, YCbCrToYCbCr, ColorTwist, LUT_Linear
    Filter Functions
        FilterBox, Filter, FilterRow, FilterColumn, FilterMax, FilterMin, Dilate, Erode, SumWindowColumn, SumWindowRow
        DCTQuantInv, DCTQuantFwd, QuantizationTableJPEG
    Geometry Transforms
        Mirror, WarpAffine, WarpAffineBack, WarpAffineQuad, WarpPerspective, WarpPerspectiveBack  , WarpPerspectiveQuad, Resize
    Statistics Functions
        Mean_StdDev, NormDiff, Sum, MinMax, HistogramEven, RectStdDev
4.  另外,还有一些额外的库比如NVIDIA cuFFT,NVIDIA cuBLAS (6x to 17x faster performance than the latest MKL BLAS.),EM Photonics CULA Tools(linear algebra library), NVIDIA cuSPARSE,NVIDIA CUDA Math Library

