Saturday, February 18, 2012

Performances GPU vs CPU matrix palette skinning (WAS: Tutorial #2: Part 1: Animated character walking around. Walk cycle and animated clowds)

I planned, for a while, to do another GLES 20 demo for AnimKit with multiple characters on the scene. Found some time yesterday and this is the start (video bellow). Probably I'll cover this in multiple posts; posting the results here when I find some time to progress with it.




Note that the scene is just a work in progress, clouds behind are big panes causing some performance impact but I think a good illustration of what can be done in few hours and I'll use it as input for later performance optimization work on "AnimKit on OpenGL ES 2.0". I'll post code and more text soon, my one year old is about to wake up.


The first work in progress version of code is here: commit 3e7cfe2b27 WIP: Tutorial #2: Part 1: Animated character walking around.


The code could be used as illustration and for benchmarking, I guess if you have a simple scene with less then 4000 matrix skinned vertices it should also be usable, but for this example it is not OK - downloaded SDK for iOS 5.0 and tried it on iPhone 3GS. Got only 7 frames per second (~8000 skinned vertices).
Tried more complex model with 5 animated instances and got 1-2 frames per second (40000 skinned vertices). The model from post Skeleton and pose character animation using AnimKit runs at 14 frames per second.


As assumed in previous posts, CPU matrix skinning (akGeometryDeformer::LBSkinning()) that enumerates through all vertices and normals is the bottleneck - most of the time spent there.

On the other side, tried PoverWR example of GPU matrix skinning POWERVR Insider MBX Chameleon Man Demo and it runs quite smoothly,... but it has only ~1000 skinned vertices in the mesh.


Guess poly-reducing the mesh and moving the calculation to another thread would help, though not  significantly (based on the results above) so I plan to do that (reduce complexity of scene) but also check Chameleon Man source code. Code is available as part of PowerVR Insider SDK and you would just need to register to get it.


Update on March 2nd after implementing GPU skinning: didn't spend time on reducing the scene - just after implemented GPU GLSL matrix skinning the results already look promising (on iPhone 3GS):


Scene Skinned triangles FPS with CPU skinning FPS with GPU skinning
gles2farm - 1 animated character 5620 8 59
AppAnimKitGL - 5 animated characters 33700 1 22

Code is here.


Five animated characters scene looks like this:






In following post, I will try to explain how to implement matrix palette skinning (character skeleton animation using GPU).



CPU skinning is not the best choice for Opengl ES 2.0 devices.

CPU skinning is implemented like this: for every repaint, enumerate through all vertices, calculate and apply bone transformation. In more details, code bellow is an example of CPU skinning in AnimKit's  akGeometryDeformer::LBSkinningUniformScale. Apparently, on iPhone these matrix operations affect performance significantly and are better suited for vertex shader.

const btAlignedObjectArray<akMatrix4>& matrices = *mpalette;
for(unsigned int i=0; i<vtxCount; i++)
{
akMatrix4 mat(matrices[indices[0]] * weights[0]);
if (weights[1]) mat += matrices[indices[1]] * weights[1];
if (weights[2]) mat += matrices[indices[2]] * weights[2];
if (weights[3]) mat += matrices[indices[3]] * weights[3];
// position
*vtxDst = (mat * akVector4(*vtxSrc, 1.f)).getXYZ();
// normal
*normDst = (mat * akVector4(*normSrc, 0.0f)).getXYZ();
*normDst = normalize(*normDst);
akAdvancePointer(normSrc, normSrcStride);
akAdvancePointer(normDst, normDstStride);
akAdvancePointer(weights, weightsStride);
akAdvancePointer(indices, indicesStride);
akAdvancePointer(vtxSrc, vtxSrcStride);
akAdvancePointer(vtxDst, vtxDstStride);
}

Anyway, when bone transformation is applied and mesh vertices and normals updated, repaint is called after updating vertex buffer updating vertex buffer:

akSubMesh* sub = m_mesh->getSubMesh(i);
UTuint32 nv = sub->getVertexCount();
void *codata = sub->getSecondPosNoDataPtr();
UTuint32 datas = sub->getPosNoDataStride();
glBindBuffer(GL_ARRAY_BUFFER_ARB, m_posnoVertexVboIds[i]);
glBufferData(GL_ARRAY_BUFFER_ARB, nv*datas, NULL, GL_STREAM_DRAW);
glBufferSubData(GL_ARRAY_BUFFER_ARB, 0, nv*datas, codata);

This part also would get fixed by GPU skinning, since vertices only needs to be "uploaded" once in init() to vertex buffer, instead on every redraw:

            glBufferData(GL_ARRAY_BUFFER_ARB, nv*posnodatas, posnodata, GL_STATIC_DRAW)

2 comments:

  1. Hi,

    I am the author of the AnimKit lib. I am still trying to find time to work on this project but seeing your work is encouraging.

    Seeing this post makes me remind taht I should do some profiling/optimisation on this. The first thing I would try (you should try it also) would be to remove the "if(weights)" they are not necessary and are just there to early out some calculations which must execute faster without conditional on modern CPU with instruction pipeline and branch prediction.

    Change:
    akMatrix4 mat(matrices[indices[0]] * weights[0]);
    if (weights[1]) mat += matrices[indices[1]] * weights[1];
    if (weights[2]) mat += matrices[indices[2]] * weights[2];
    if (weights[3]) mat += matrices[indices[3]] * weights[3];

    To:
    akMatrix4 mat(matrices[indices[0]] * weights[0]);
    mat += matrices[indices[1]] * weights[1];
    mat += matrices[indices[2]] * weights[2];
    mat += matrices[indices[3]] * weights[3];

    Xavier

    ReplyDelete
    Replies
    1. Hello Xavier. Thanks for the comment. The bottleneck here is matrix and matrix-vector multiplication - intensive to do it for all the vertices.

      Code bellow the code you pasted is also contributing:
      https://github.com/astojilj/astojilj_animkit/blob/ios/Source/akGeometryDeformer.cpp#L286

      1) akMatrix4 mat(matrices[indices[0]] * weights[0]);

      2) if (weights[1]) mat += matrices[indices[1]] * weights[1];
      if (weights[2]) mat += matrices[indices[2]] * weights[2];
      if (weights[3]) mat += matrices[indices[3]] * weights[3];

      3) // position
      *vtxDst = (mat * akVector4(*vtxSrc, 1.f)).getXYZ();
      // normal
      *normDst = (mat * akVector4(*normSrc, 0.0f)).getXYZ();
      *normDst = normalize(*normDst);


      Measured fps (without GPU skinning):

      1,2,3 commented out: 60 fps
      1 on, 2 and 3 commented out: 33 fps
      1 and 3 on, 2 commented out: 11 fps
      1,2,3 on: 7fps

      One could try to use http://code.google.com/p/vfpmathlibrary/ or neon math library (both used in oolongengine for older version of iphone https://github.com/astojilj/astoj_oolongengine/blob/master/Math/neonmath/neon_matrix_impl.cpp) to get faster matrix computation, but I doubt it would help in this particular project as apparently GPU is idle most of the time. I guess for some other setup CPU based skinning might be sufficient or even better (if GPU unit is having more job to do).

      Delete