Archive for March, 2010

MocoFaceTracker: Facial tracking via OpenCV

Friday, March 26th, 2010

While I was in Bat-teK last semester someone had mentioned that there was a package called OpenCV that could do facial tracking. So on a whim I downloaded it and looked through the API and example code. In particular there was one example of using the Python binding with facial tracking.

Side Note: Installing OpenCV was a pain in the butt since Ubuntu already came with the 1.0.* package in the repository and I couldn’t for the life of me get OpenCV 2.0’s Python modules working correctly. So I just used the old module interface since OpenCV2.0 still has it supported anyway (for now). I have a thing or ten to say about the OpenCV API but I’ll bite my tongue since children may be reading this blog.

Anyways, MocoFacialTracking is a separate standalone program that connects to the MocoServer to do passive facial tracking and actuator movements. I ripped out the important bits from the facial tracking example and wrote a viewfinder Subscriber that pulls in the live video stream from the camera. Then it applies the OpenCV facial tracking algorithm (which probabilistically returns a bounding square for each of the human faces it finds on the image using the training set it comes with)… I then pick the biggest one of these bounding squares, find the midpoint of the square and then the vector from the middle of the frame (this determines the speed in which to move to bring the face to the center of the frame). After that I publish new actuator movements to the server which in turn move the pan/tilt head accordingly to bring the face midpoint to the frame midpoint. Since there isn’t a sure fire way to know how far to move the servos to center the face (you’d have to figure out the distance of the face to the camera and the field of view at that distance to figure out the frame edges to interpolate the pan/tilt absolutely)… instead I just move the pan/tilt servos a very tiny amount (multiplied by the distance to center vector I calculated earlier for an extra umpf and to slow down if it’s close). But remember that the image is being streamed continuously and the facial tracking is being done on every frame it can get. This means the facial tracking is continuously running and adjusting the servos incrementally. In the long run it comes off as smooth (albeit slow) movement to center the face in the frame. The biggest issue was the facial tracking bottleneck which can back up the image stream (and also the server on the other end)… to solve this I made the viewfinder subscriber a thread that rips through the incoming stream as fast as possible updating the latest image internally while another thread does the facial tracking and actuator publications. So whenever the algorithm gets done from an iteration it’ll have the latest image to process for the next iteration. Quick somebody implement this in GLSL 😛

So now you ask, “Great you’ve got primitive facial tracking which moves the pan/tilt to keep a person’s face in the center of the frame? WHY?”. Well I’m glad you asked… just imagine if you will a use case where a news reporter is on the field without a camera guy. Another good example would be if for a very long timeframe you want to track a certain feature in the world but you don’t know for certain the direction it will travel in (ie. a plant growing). The last thing you want is to come back a few days later and find out the plant grew out of the frame before it bloomed and thus ruining your whole shoot. Even though we’re doing facial tracking, the OpenCV algorithm is really “feature tracking” so given a good enough training set it should theoretically be able to track other features within a frame and thus realign the camera to follow it.

TODO: I’ll never get to it this semester but if we were to flesh out the TV news reporter scenario I’d love to add some heuristics to the MocoFacialTracker such that it uses some simple composition rules (like rule of 3rds, horizon line etc). Also if a second face shows up in the frame maybe it can adjust to doing a two-shot automatically (especially if the faces are pointed at each other as in conversation (ie. a news reporter interviewing a witness)… I think PittPat might have the facial tracking we’d need for face direction). I’d also love to track the handheld microphone the news reporter uses to use it as a cue to the algorithm as to where it should be focusing its attention. For example if the news reporter says something into the mic and then shifts it to the witness to speak, the algorithm would know to follow and focus on the witness til the mic is brought back. Just little things like that which a human camera guy would instinctively do and aren’t too computationally heavy or AI-y would be great things to add.

MocoCompositor: First Real Test

Wednesday, March 24th, 2010

So I finally got a hold of a green screen cloth from Krishna (Thanks Krishna!) and draped it over our office door. I setup the camera in front of it, found a cool sci-fi desktop wallpaper to use as my background plate, the live camera viewfinder as the foreground plate and ran the MocoCompositor with the greenscreen/chroma-key filter GLSL shader. The output from the compositor streams to a new viewfinder called compositor so any browser should be able to view the final composite also.

The computers were setup such that the netbook Dell Mini9 was running the server, background plate stream and the MocoBot (the live camera feed) programs. Meanwhile my desktop (which is a bit more heavy duty with an nVidia GPU card) was running the MocoCompositor which connected to the server on the Mini9 to pull the camera viewfinder feed as well as the background plate feed (oh ya I didn’t mention that I also wrote a quick static image streamer publisher that will later become a video/frame player in the UI). At each pyglet on_draw event the MocoCompositor applies/blits the current images to the appropriate plates/textures and composites them using the shader. The resulting texture is then pulled from the GPU framebuffer and compressed to jpeg (via pyglet via PIL) and streamed out to the server via a publisher.

It works surprisingly well and fast given the amount of network traffic it produces. And these are the results (I print-screen captured these while viewing the compositor viewfinder in a browser on yet another computer :). Note that the background plate is using a sci-fi wallpaper of the Earth I found on Google images… (I claim no copyright to it, but do claim fair-use for educational purposes):

Here’s me smiling cheesily (Note: my left shoulder isn’t abnormally low, I was reaching for the printscreen button on the keyboard below 🙂 )

And here’s me looking into the Universe contemplating existence or looking for Dr. Who.

Obviously the lighting was crappy and uneven hence the green pixels still present from the foreground plate. I need to come up with a way of passing the shader parameters to the server and to the UI for the user to fine-tune its settings rather than hardcoding it into the shader like I do now. Playing with green screening is getting to be more fun than I’d expected… need to get back to work.

Edit: I wonder if the JPEG compression/decompression could be done as a shader… Looks like NVIDIA’s site has an example of the DCT algorithm as a Cg shader (we use GLSL)… but I’m not sure of what the rest of the JPEG compression algorithm/format would need. But I think a future extension to this software could be a JPEG compression shader to be able to get rid of PIL doing the jpeg compression and speed things up even more… just a thought, but outside of the scope of this project obviously.

MocoCompositor: GPU Accelerated Compositing

Wednesday, March 24th, 2010

I’m a little giddy and excited about this new feature. Last semester I took a class at the main campus titled “Technical Animation” where we learned about all sorts of Computer Graphics and Animation techniques/algorithms used in the game and movie industries. It was a pretty cool class that focused on projects. My final project (teamed up with Federico Perazzi and Grace Lin (both from ETC)) was to create a target-driven smoke-simulation accelerated on the GPU. I knew absolutely nothing about smoke simulations or GPUs for that matter. Long story short, we taught ourselves how GPU accelerated computation worked and how to write shaders in GLSL… and eventually wrote a regular smoke simulation in GLSL (we ran out of time for the target-driven part). Turns out it doesn’t matter if you do the software in Python as far as speed is concerned since you’re passing all the heavy computation over to the GPU on the video card to do anyway. So we ended up using pyglet (the OpenGL interface in Python) and a tiny shader class to string together several custom shaders to do our smoke simulation… it worked in real-time pretty well.

Skip to the present: We at Mocotila talked a lot about compositing images plates together because that was one of the biggest uses for motion controlled cameras. Compositing a live action/model with a matte-painting or 3D model. But we also realized that compositing is usually done post-production and takes a lot of time to do, render and see the output. Then if any shots are screwed up in framing/etc you’d either have to reshoot or try to fudge the effects til it was acceptable.

But here we are with an awesome camera with its viewfinder being streamed to anything capable of reading http-streamed images… and not just the expensive camera either, our bioloid has a little webcam that is also being streamed in the same way and when we get Maya/Blender integration we might have live 3D renderings being streamed as well. Wouldn’t it be grand if the cameraman/director could see a live end-result composite preview so he could direct actors or reframe things appropriately? And what if we could just kinda composite several of these streams into a new viewfinder? This is where the MocoCompositor comes in… it runs on any computer on the network with a good video card capable of shaders. It pulls in (subscribes to) images from multiple image streams coming from the server and it publishes a new composited image stream to the server that anyone else can subscribe to.

The actual compositing is accomplished using the GPU via GLSL (the OpenGL Shading Language). This is where my Technical Animation story comes in… I went back and looked through my GPU smoke simulator and implemented it again taking out all the simulation stuff and adding layers of images to be processed like Photoshop Layers. Your view is from the top of the layer stack looking down (the very bottom being the background image plate). The algorithm runs from the bottom plate up applying an associated shader to each plate and saving the result into an output plate. The output plate is what is packaged as the jpeg image and published out to the server again.

So far we’ve implemented a green-screen shader that replaces all the green in the foreground plate with the pixels from the background plate… this means live green-screen replacement compositing. We experimented with background-subtraction (taking a reference shot of the background and subtracting it from the live shot so we wouldn’t need a green-screen, but it just wasn’t reliable or clean). We’re hoping to add some more shaders to this system all implemented via GLSL shaders… especially some gradient blur filters (if you know where I’m getting at :).

Under the Hood

The main thread is a pyglet app running its event loop. Word of advice NEVER mix OpenGL calls (or anything that touches hardware directly without locks and state preservation) across threads… bad things happen (one of these days I’ll get around to putting locks on the camera class too). Anyways in this main thread we have the on_draw event from pyglet where we run through each image plate and execute the appropriate shader on the input image (and working/output image). After we’ve gone through all the plates we package the output image texture back into a jpeg and send it out to the server via a http stream publisher.

Now since pyglet controls our main thread and we want to be able to pull images from at least 2 streams concurrently from the server, we need to do it in threads. So for each image stream coming in (subscribed to) we have a thread which grabs the image and updates a mutex-protected data structure (our image plates) with the raw data. The next time through the main pyglet loop it’ll reload the image from the raw data… we can’t do it out in the thread because lord knows what pyglet is doing behind the scenes (or if the libraries it uses are thread-safe) to load these images.

Speaking of the libraries used by pyglet… we were having this nasty SegFault in the MocoCompositor after a few seconds of working perfectly. Just out of nowhere it would SegFault (and not at the same period). After digging through pyglet and stepping through the execution using a python debugger I tracked down the problem to a codec being used for the JPEG decompression. Pyglet uses 3rd party libraries to decompress jpeg images (not sure if it has to do with patents or whatnot), but on the Ubuntu system I’d been using, it was defaulting to using gdkpixbuf to do the decompression… I assume it’s the fastest implementation available on the system (probably uses the C libjpeg or something). But I noticed in the pyglet documentation that you can specify a decoder to use for the decoding and I noticed that the Python Imaging Library was installed (PIL)… so I forced pyglet to use that decoder instead (it seems a tinsy bit slower) but it worked without any SegFaults [so far]… huzzah!

Also note that textures used in OpenGL really have to be made in powers of 2… otherwise crazy things start happening (things start striping and staggering diagonally). This was an ongoing struggle for quite a while, until on a whim I remembered that old rule and tried it as a power of 2 and it worked. So now I have the code looking at the input image size and rounding it up to the next power of 2 to allocate the texture. This leaves a big black unused area but it works… when we’re done calculating the output image, we simply blit the original size of the image from the big texture to send to the output stream as the jpeg image.

Software: Moco MotionBuilder GUI

Monday, March 15th, 2010

Before we began developing our GUI we did a lot of research into existing professional end software packages used by our target demographic. Our target demographics being semi-professional to professional videographers/cinematographers. We learned/played/used non-linear video editing software (Final Cut, Adobe Premiere), compositing software (Shake, Motion, AfterEffects) and 3D animation packages (Blender, Maya). Additionally we looked into the FLAIR robot-control software package that’s used by the Milo big-rig motion-controlled robot. FLAIR turned out to little more than a spreadsheet with a ton of buttons (designed by an engineer obviously).

We took a lot of the ideas in these packages and implemented them in our own GUI. Our GUI is a web-based interface connecting to the multithreaded Python-based custom web server on the backend. The GUI makes use of HTML5 technologies (namely <canvas> tags for 2d rendering), jQuery UI for the theming/sliders-widgets, AJAX (via jQuery) and JSON (Javascript Object Notation) for transferring data between the GUI/Server/Robot/etc.

MotionBuilder GUI


Our MotionBuilder GUI is split up into 2 halves. The top half has all the outputs from the robot (the viewfinder, the skeletal model of the side profile of the robot and the skeletal model of the top profile of the robot).

The bottom half of the GUI represents the curves editor. The editor consists of a canvas and vertical-slider for each actuator/servo on our robot as well as actuators for the camera (I’ve disabled them for the time being). The canvas for each actuator is where the motion curves are drawn (remember in a previous post I mentioned that we had implemented several curves including discrete, step, linear and catmull-rom). The vertical-slider allows the user to fine-tune the placement of the keypoint at wherever the play-head currently sits.

The play head is controlled by the frame-scrubber horizontal-slider that runs the width of the GUI and lies above the actuator curves. Scrubbing across the slider will cause the play head to move to the specific frame and publishes the actuator motions back to the server for that frame. This in turn causes the subscribers of the actuator publication (ie. the skeletal models in the top half of the GUI as well as the robot itself) to move accordingly to the new actuator position.

Above the frame-scrubber is another horizontal-slider with the ability to set a frame-range by dragging the minimum and maximum handles. This allows the user to select a particular sub-range of the animation to playback.

The Playback controls are above the scrubber sliders and include a “go-to-beginning-of-range”, “fast-rewind-play”, “rewind-play”, “stop”, “forward-play”, “fast-forward-play” and “go-to-end-of-range” buttons. We use a Javascript setInterval timer to step through in the playback and it is NOT representative of the final time/delay of the animation. It’s mainly used for previewing the shot.

To the left of the playback controls we have the curve-set buttons that allow the user to clear the editor and start with a new curve set, load an existing curve set (from a file on the server) or save the current curve set (to a file on the server).

Finally we have some keyframe buttons above each actuator curve canvas. It includes a button to move the playhead to the previous keyframe on that curve, move it to the next keyframe on that curve or to toggle the existence of the keyframe at the playhead (this is how you delete keyframes).

Under the Hood

Our Web-server is a custom-written (raw socket based) multi-threaded Python server. We created a paradigm of “subscriber/publisher” streaming where any given web client can connect to the server and offer to “publish” data (via a URI) and any number of clients can connect to the server and “subscribe” to the data (via the URI). Subscribers are added to a queue for the URI and a thread is fired up for each publisher of a URI. Whenever a publisher pushes data it is broadcast to each client on their queue as fast as possible (very little buffering). This technique allows us to broadcast streams of data/images to any number of clients regardless of their intentions of the data and it affords us a great amount of flexibility and scalability for adding on more subsystems to our software. You’ll see an example of this with our compositor later.

Image streaming from the server is accomplished via the old-school Netscape “multipart/x-mixed-replace” content/mime type… it’s what those web-enabled streaming spycam/monitoring-cameras use. Most modern browsers support this mime-type (and we don’t bother to support/test IE at all).

Data streaming from the server uses the old long-polling script tag block technique (I think it’s referred to as Comet nowadays)… basically the server keeps the connection open and sends the data as JSON strings surrounded by javascript callback function enveloped in <script> tags. Most browsers execute the javascript when the ending script tag is found (it’s a throwback to old-day compatibility). So as long as the server keeps the socket open it can send all the data it likes and the client will process it one after the other.

Communication to the server (for stream-like functionality like scrubbing the sliders) is just AJAX sending JSON via HTTP GET requests repeatedly. Surprisingly the TCP handshake and HTTP request overhead aren’t too bad when hammering away at the server… especially since the biggest bottleneck would be the real world servos (the user subconsciously moves those sliders slowly when they notice the robot sluggishly getting into position).

I’m sure there’s more I’m forgetting to mention… I’ll save those for a later post