Linux Video I/O

This Site
I work on this site in my off hours. Please help me to push aside my day job and work on it more by supporting the site in one of these ways:
donate now   Donate Now
Use your credit card or PayPal to donate in support of the site.

get anything at all from
Use this link to Amazon—you pay the same, this site gets 4% from Amazon.
get the best thai-english phrasebook app
Experience Thailand richly with my Talking Thai-English-Thai Phrasebook app.
get the best thai-english dictionary app
Learn Thai with my Talking Thai-English-Thai Dictionary app for iOS, Android, Windows.
get a cool thai-english paper dictionary
Don't leave home without the Thai-English English-Thai Compact Dictionary I co-authored.
get thailand fever
I co-authored this bilingual cultural guidebook to Thai-Western romantic relationships.
get the best chinese phrasebook app
Visit China easily with my Talking Chinese-English-Chinese Phrasebook app.

This Site
Like what you see?
Help spread the word using these social sites:

Video I/O on Linux: Lessons Learned from SGI

By Chris Pirazzi
Originally written for
Picked up by,, and

If the linux community is to learn anything from Silicon Graphics (SGI), it's to not design a video API like the SGI Video Library. As described in many pages of the Lurker's Guide to Video at
the Video Library makes many well-intentioned attempts to "help" the application by providing high-level primitives and encapsulation, but ends up making the task of writing a usable video app almost impossible.


It's exciting to see linux mature to the extent that there are so many audio and video developers. Rather than just being an alternative for "VGA" or "graphics screen," more and more developers are using the term "video" in the sense intended by this article -- television signals (NTSC, PAL, HDTV, etc.) which carry moving pictures. In addition to video capture and playback tools, we're starting to see video editors, video effects packages, movie file players, video converters, animation tools, and more.

Linux seems to have reached the point where developers want a video I/O API which lets applications be source- and binary- compatible with multiple video cards.

SGI made this same transition from board-specific APIs to common APIs from 1991-1997. During that period, I worked on some of those APIs and also wrote audio/video capture and playback applications with them. By the time I left SGI, the company had a rich, useful set of APIs, but the path was steep and full of potholes for both SGI and its developers. By sharing my experiences, I hope that I can help linux leap ahead in the process and avoid the mistakes SGI made.

Some linux developers are also working on other useful applications, such as DVD playback directly to the graphics screen, which have entirely different demands from the class of applications I mention above. Some of those are better solved by different APIs than the type I describe in this article.

Table of Contents

Been There, Done That (Wrong)

The natural tendency of software developers is to try to do too much at once.

One can build fancy mechanisms which have network transparency, compression/decompression, format conversion, graph-based dataflow management, etc. on top of a well-designed video I/O API, and such mechanisms might be useful for some applications. But SGI's big mistake--one which hampered development of useful audio/video applications for years--was to try to build and offer those fancier mechanisms to developers instead of offering a simple API that worked on multiple video devices.

Network Transparency

We initially assumed--incorrectly--that what was good for X Windows (network transparency, central control daemon, display-format independence) would be good for video too. X is a UI toolkit, and it was both practical and useful to transport its windowing and drawing commands over today's networks (and even 1980s networks). Video is a continuous stream of moving pictures, in the range of 20 megabytes per second, which must be displayed perfectly in sequence with no dropped frames in order to meet the user's expectations. Its very nature is completely different from windows, buttons, and mouse cursors. Light levels of compression suitable for editing applications may bring the datarate down to 5 megabytes a second, but the complex process of compressing and decompressing (many formats, inter-frame dependencies) is not something you can or should hide inside a video I/O API.

Even given current network technologies, network transparency is probably useless, or at the very least not interesting enough to make the centerpiece of your video I/O API. Network-transparent video data transfer layers can be built on top of a well-designed, simple video I/O API without burdening the vast majority of audio/video applications which do not care about network transparency.

Plug-in Dataflow Graphs

We also incorrectly assumed that most application writers would want "help" creating complex graphs of interconnecting components (video/file sources, processing plugins, video/file sinks) and pushing data through those graphs. Our design became complex: writing the API itself, and writing applications with that API, became difficult. Eventually we were spending most of our time focusing on this complexity problem we had created ourselves. We weren't helping developers with the high-level mechanisms: we were putting obstacles between them and their data.

Just Give Me the Data!

It turns out that 99% of video applications which record, process, edit, and play video just want a simple set of calls that will just hand them the raw video data in memory or take it from memory. And they want the API to work without modification or recompilation on lots of video boards. That's it!

SGI spent so much energy worrying about the high-level dataflow issues that it lost sight of the whole point of the exercise, and as a result each video board had to be programmed in its own device-specific way using the supposedly device-independent API.

And we're not just talking about a little setting here or there--the various video board designs disagreed on their basic design philosophy about how an app manipulates video. Some boards could only DMA video into and out of memory, and some boards could only send video direct to/from graphics using dedicated hardware paths. Those boards which did make video available in memory did so in a variety of often incompatible pixel formats. A video capture or playback application that worked on more than one board basically had to be written several times, once for each board!

Yet somehow all of this insanity happened within the framework of a single video I/O API. How could that happen?

It Could Happen...

Imagine that one day, the world gets fed up that they can't plug in their toasters and blenders in neighboring countries because the plugs don't fit and the voltage is wrong. So they appoint a panel of experts to solve the problem.

After deliberating for years and speaking with the power agencies in each country, the panel announces a new, world-wide standard set of labels, colors, and international symbols that will now appear on every plug, so that if you've got a round-pointy-flat 3 prong 240-volter, by God you'll know exactly what it is, how to plug it in, and you can bet your last Euro that that plug will be the same color as the flat-round-pointy-pointy 4 prong 240-volter they use next door.

The panel also enthusiastically introduces a plan to "simplify" your life by offering:

It all seems so great, except you still can't plug in your toaster or your blender.

Healthier Now, but Still Scarred

Nowadays, all the useless parts of the SGI Video Library (daemon, network transparency, dataflow system) have been removed, and where possible, misguided attempts to "help" the user (buffer management APIs which hid the data from the user) have been replaced by simple calls that just give you the data.

But, that hard-learned lesson still burdens SGI video programmers today (as many pages of the Lurker's Guide explain).

Right now we have an opportunity to skip over all that hardship on linux.

There's a Better Way!

So, linux should get it right. Linux should offer a simple Video I/O API that just:

This Video I/O API should work in a binary- and source- compatible way with any video I/O card for which one writes a linux video driver.

The most critical thing to notice here, and the thing SGI ignored at its peril, is that you can't do this without bringing in the hardware folks. A video I/O API is ultimately limited to doing what the video I/O device can do, and you need to make sure the devices you support all have a useful, common subset of functionality on which all applications can depend.

Otherwise, you've failed to offer device independence, and like the panel of experts above, you're only getting in the way.

Main Memory is Good

Resist the urge to wrap buffers of data in any kind of "helper" object: let the application provide its own pointers to malloc()ed memory to receive the video data, or from which to read video data. Think of video I/O as a slightly augmented version of read() and write() rather than some high-level encapsulated object-oriented thing. Believe me, we've been there, and it's a waste of time.

Constraining the alignment of those buffers is ok if necessary (for good DMA, etc.) as long as this is clearly documented and works for every device supported by the API. Constraining the way the application can touch those buffers while they are being used for video I/O is ok as long as this is clearly documented and works for every device supported by the API. Yanking control of the buffer address itself from the application is not ok. Remember that the video I/O library is just that, it's not a memory manager. The application uses your library in concert with other libraries and none of the libraries should be selfish and claim control of buffer allocation for themselves.

Detect Dropped Data

Make it so the application can detect whether any frames have been dropped on input or output, but do not attempt to "help" the application by copying fields/frames on input. On output, it's nice if the video board can repeat fields/frames if it starves, but remember this is a nasty error case, is not that important, and should occupy only a tiny fraction of your development time.

Do not ever attempt to "help" the application by cutting down the input rate or output rate of video to something less than the video signal's full field/frame rate within your library or your device. Just give the application every field/frame and require the application to provide every field/frame. Back when PCs had 286 processors and 10 MB/sec memory throughput, it was necessary for hardware to "help" us by dropping frames on input or duplicating them on output, but nowadays doing so is just an academic exercise that is a complete waste of time and leads to many unnecessary ratholes in the area of dropped frame detection and synchronization (and tends to turn simple APIs into incomprehensible ones).

UNIX Needs a Cushion

Make it so the application removes from (input) or adds to (output) a queue of memory buffers which:

This queue is an absolute necessity for video applications because on linux (as on IRIX), an application can't be guaranteed that it will run 50/60 times a second and definitely be able to read/provide a new field/frame. The scheduler in the kernel simply cannot provide these guarantees, because it is at the mercy of hundreds of third party device driver writers and kernel developers who are not aware that running without a field/frame-time latency guarantee could even be a problem. The application needs this queue as a cushion so that it can provide reliable, drop-free video on input and output even if linux occasionally holds off the application for a frame or two.

To be useful, it must be possible to queue up at least 1/2 second of video in the queue. Ideally you should offer the ability to queue more if the application so chooses. The queue should be designed so that if the app so chooses (for example, if it's running in an environment with qualified hardware, kernel, and device drivers), it can choose to queue less than 1/2 second. There are many applications where low latency is a requirement (computer animation of human actors wired up with sensors, for example). In your API, there should be no need for any configuration parameters relating to queue depth: queue depth should just be a function of how far ahead of time the application chooses to queue buffers and how the application gets scheduled by the linux process scheduler.

The application must be able to queue the same buffer multiple times for output, for example if it is trying to display a still field/frame or display an alternating pair of fields that form a still image.

Tests for Completion

Make it so the application can determine when any given queued field/frame has been filled in (input) or transmitted (output), so that the app can react to incoming data and so that it knows when it can re-use buffers it has handed to your library. Make it so the application can do this in either of two ways:

  1. it should be able to test using some efficient, non-blocking call that checks whether a specified field/frame is done or not.

  2. it should be able to block on a file descriptor using select() until a new field/frame is available on input, or a new field/frame has been transmitted on output. Remember that the application is using your library in concert with other libraries (audio, X, etc.) and no library should ever block the application or claim control of the main select() loop.

These two mechanisms meet specific goals of real-world video applications. Mechanism #1 is an absolute requirement--real-world video applications sometimes need to poll the status of their buffers rather than relying on a blocking mechanism. Mechanism #2 could be built upon #1 by adding a timeout in the select() loop, but would be nice to have to avoid the guesswork of polling in cases where it is not necessary.

Number Fields/Frames of the Signal

Pretend that each field/frame of a video signal is numbered sequentially (and the numbers keep marching on whether or not the computer successfully captures/outputs the field/frame).

As each input frame completes, the application should be able to determine the frame/field number of each chunk of data it has input. This should be part of the same efficient, non-blocking call the application uses to check completion (#1 above). The application can use this number to detect dropped input fields/frames and also do the synchronization trick described below.

As each output frame completes, the application should be able to determine the frame/field number on which the video device output the application's chunk of data. This should be part of the same efficient, non-blocking call the application uses to check completion (#1 above). That way, the application can pre-queue some data (perhaps black) and take some measurements so that it can begin queuing data knowing exactly the field/frame number when that data will go out. This is useful for the synchronization trick described below. In addition, this mechanism lets the application detect the serious error condition where the queue drains to zero, starving the application and causing missed frames on output.

As an optional extension to "help" the application on video output, you might also want to let the application provide a field/frame number when they queue their buffer for output, such that the hardware will refrain from outputting that field/frame until the specified field/frame number arrives. This saves the app the trouble of queuing some black to get into a sane initial condition. But it turns out queuing black is not that hard, and so don't spend a lot of effort doing this. Furthermore, this is no substitute for the above: the application might not choose a timely frame number, or a serious problem on the system might delay the application, so it's absolutely required to still provide the output frame/field number feedback described above so that the application can detect exactly which frames were missed on output.

Synchronization is Easier Than You Think

To help with audio/video synchronization and frame-accurate machine control, make it so the application can measure, with a guaranteed worst-case accuracy, the relationship between the field or frame boundaries of the incoming or outgoing video signal and some local, highly precise, unadjusted clock which the application can read. The unadjusted clock could be something like:

but NOT gettimeofday()--being adjusted by network time daemons, this clock is basically useless for a/v sync.

Keep this simple: all that is needed is a non-blocking call which returns a pair consisting of a field/frame number, and a value of the unadjusted clock, which coincide. This must be the same field/frame number mentioned above, so that the application can map the number back to an exact memory buffer which they have submitted.

The guaranteed worst-case accuracy of the relationship between the two clocks needs to be at worst +-15ms for good audio/video lip sync, and it needs to be at worst +-2ms if you expect to do frame-accurate machine control (used for frame-accurate video editing with an external VCR/VTR).

Non-blocking is important: do NOT tie this to the linux scheduler. Because linux delivers no real-time guarantees, having a call which "unblocks when ___ happens" is useless for audio/video synchronization, because by the time your process runs, an unpredictable amount of time could have passed (easily in the 10-50ms range, sometimes more). Instead, each library should provide the ability to map data buffers it inputs/outputs to precise values along a common unadjusted clock like the ones above and the ability to queue input/output data buffers ahead of time.

For more insights on this whole subject, see the "Time and Synchronization" section of the Lurker's Guide at:

Other Faux Pas

Do not bother trying to make your video I/O API have a feature that makes it synchronize with some other audio (or other) device unless your video device has some special hardware connection with the audio device which the application cannot control by other means. Instead, make your video I/O API provide the basic primitive described above. In concert with a similar primitive from an audio I/O API, the application can do any synchronization task it needs all by itself with no help from you.

Do not, I repeat do not, bother with any kind of video daemon; this is pointless. See the notes above about network transparency. Talk to programmers who write real video editing applications first before designing an API.

Keep it simple!

Image Scale (Zoom), Pixel Format, Pixel Aspect Ratio, Cropping

Common Set Must Work on All Devices

You cannot satisfy the basic goal of a video I/O API--device independence--unless there is a common set of scales, pixel formats, aspect ratios, and crops that are supported by all devices. If two devices have the same API but one can only do square-pixel R'G'B'A and one can only do half-sized non-square Y'CbCr, you're not doing any favors for the application writer, and in fact your API is probably just making the app writer's life more difficult by presenting a false illusion of device-independence.

You could instead design your API to support a wide-variety of parameters, and have some kind of query mechanism so an app can see what works on what devices. SGI did that.

The only problem is, that's a totally useless and stupid thing to do.

If a video I/O API does not offer a common set of scales, formats, aspects, and crops which are guaranteed to work on all devices, you can have all the query mechanisms in the world but it does an app no good. Think about it -- what is the app supposed to do when you plant it on a machine which supports only some new format? Spontaneously sprout new code to handle the format?

It is truly amazing how many otherwise respectable programmers have gone all the way down the "support everything, make it queryable" road without realizing that they are not solving any problem.

Once you provide a set of formats etc. that work on all devices, then it may be useful to provide software-queryable extensions which the app writer can test for when writing their code. Again, if they found one of these extensions was available, it must be the case that the extension works the same way on every device that supports it.

But spending time building an extension mechanism might be a waste of time also. The vast majority of application writers will stick with what they know works on all devices: application writers get no joy out of writing custom code for different boards.

Video Image Scaling (Zoom)

Video scaling performed by video I/O devices is now largely irrelevant to audio/video applications:

So all video devices supported by your video I/O API must support at least unscaled video.

If you think it's important, you could also specify maybe one other zoom level like 1/2 in each direction, but it's only likely to get used if it works on all devices.

A related issue to scaling is pixel aspect ratio, which will be described in the next section.

You Must Understand Videosyncrasies

Before you can succeed at this effort, you need to realize that the pixel format, pixel aspect ratio, and cropping issues of video are very complex and you absolutely must understand them in detail in order to even know whether a set of video boards behave the same way, or to be able to communicate with the video geeks at those companies in order to pull off the feat of device-independence.

I can tell you from personal experience that if you take a casual, sloppy approach at specifying these critical aspects of video--the same approach that Apple, Microsoft, and other vendors have taken for years--your video API will be just as device-dependent and broken as theirs.

If you go into video making the assumptions of computer graphics, you'll have the same problems that SGI did. Read this article and just skip past all that pain!

For example, in the computer graphics world we assume that 100 pixels in X is the same length on the screen as 100 pixels in Y. But in some ubiquitous video cases, pixels are not square. For more info, read:
Some video boards will produce square pixels in memory, some non-square. You'll need to document the common subset that all video boards can do.

Another example of video's complexity is fields. Read this article:
to understand why your video I/O API must specify fields even if each data buffer contains a "frame" (at the very least, your users will need to know which field of a frame is temporally first and which is second).

As a final example, in the computer graphics world we have the simple, healthy idea that an image has certain dimensions like 640x480 and that's it. So we tend to think that video must also have a "full size" and that if you grab a "full frame" of the same video picture in the same pixel format from two video boards, you'll get the same data in the computer. Not so. It may surprise you to know that there is no standard concept of "full frame," not even in the video industry itself -- video board vendors often disagree on the exact line at which "the frame" starts and ends, and so two "full frame" images might be off by a few lines from each other.

For a video editing application which intends to take video material captured from different computers and edit it together--a very, very common operation in computer effects and animation--this is fatal. There's no way app writers can detect the problem, so this forces app users to make constant manual adjustments for no good reason.

There is an unambiguous way to say what "part" of the video your board is capturing, but to say it you must speak the language of video specifications.

This is a problem you can and must solve in order to have a device-independent video I/O API. This is a prime example of why creating the video I/O API must involve the hardware vendors. If you just write one without talking to them, you are basically wasting your time, because you're not going to help application writers.

Fortunately, none of the problems above are very hard to solve once you can clearly state the behavior your want from each board vendor.

Even more fortunately, the vast majority of the documentation work you'd need to do has already been done in the context of the QuickTime file format extensions for Y'CbCr data, which you can read about here:
This document was based on an extensive survey of existing practice in high-end video, Macs, and PCs.

I recommend you steal from this document. Ignore the few parts of it that relate to QuickTime and pay attention to the various parameters which are being specified.

I would further recommend that the set of globally supported parameters include the 8-bit '2vuy' pixel format at the "production level of conformance," as defined in the document above. This combination is likely to work with most, if not all, current PC video boards (you'll have to check with the vendors to make sure), provides a reasonable baseline for video applications, and provides good compatibility with high-end video gear and software.

Even if your video API were to only support that combination at first, the document above will get you a clear, video-centric description of behavior which you can show to video board vendors and which they will understand (but you need to understand it as well!). And already you'd have a video I/O API that's better than anything else available on any operating system today.

Then, when you get your confidence up, you can see if you can also add some R'G'B'A format at the production level of conformance to the list of combinations supported by all devices.

In Summary

By following the advice in this article, you should be able to build a video I/O API which rivals anything produced on a PC or Mac today.

The main piece of advice is to always think of your API design from an app developer's perspective, and always bounce your ideas off of app developers. It wasn't until SGI got around to writing their own apps that we learned most of the lessons described here.

This Site
I work on this site in my off hours. Please help me to push aside my day job and work on it more by supporting the site in one of these ways:
donate now   Donate Now
Use your credit card or PayPal to donate in support of the site.

get anything at all from
Use this link to Amazon—you pay the same, this site gets 4% from Amazon.
get the best thai-english phrasebook app
Experience Thailand richly with my Talking Thai-English-Thai Phrasebook app.
get the best thai-english dictionary app
Learn Thai with my Talking Thai-English-Thai Dictionary app for iOS, Android, Windows.
get a cool thai-english paper dictionary
Don't leave home without the Thai-English English-Thai Compact Dictionary I co-authored.
get thailand fever
I co-authored this bilingual cultural guidebook to Thai-Western romantic relationships.
get the best chinese phrasebook app
Visit China easily with my Talking Chinese-English-Chinese Phrasebook app.
CopyrightAll text and images copyright 1999-2017 Chris Pirazzi unless otherwise indicated.