Support This Site | I work on this site in my off hours. Please help me to push aside my day job and work on it more by supporting the site in one of these ways: |
Use your credit card or PayPal to donate in support of the site. | |
Use this link to Amazon—you pay the same, I get 4%. | |
Learn Thai with my Talking Thai-English-Thai Dictionary app: iOS, Android, Windows. | |
Experience Thailand richly with my Talking Thai-English-Thai Phrasebook app. | |
Visit China easily with my Talking Chinese-English-Chinese Phrasebook app. | |
I co-authored this bilingual cultural guide to Thai-Western romantic relationships. | |
Submit This Site | Like what you see? Help spread the word on social media: | |||
| ||||
If the linux community is to learn anything from Silicon Graphics (SGI), it's to not design a video API like the SGI Video Library. As described in many pages of the Lurker's Guide to Video at
http://lurkertech.com/lg/the Video Library makes many well-intentioned attempts to "help" the application by providing high-level primitives and encapsulation, but ends up making the task of writing a usable video app almost impossible.
Linux seems to have reached the point where developers want a video I/O API which lets applications be source- and binary- compatible with multiple video cards.
SGI made this same transition from board-specific APIs to common APIs from 1991-1997. During that period, I worked on some of those APIs and also wrote audio/video capture and playback applications with them. By the time I left SGI, the company had a rich, useful set of APIs, but the path was steep and full of potholes for both SGI and its developers. By sharing my experiences, I hope that I can help linux leap ahead in the process and avoid the mistakes SGI made.
Some linux developers are also working on other useful applications, such as DVD playback directly to the graphics screen, which have entirely different demands from the class of applications I mention above. Some of those are better solved by different APIs than the type I describe in this article.
One can build fancy mechanisms which have network transparency, compression/decompression, format conversion, graph-based dataflow management, etc. on top of a well-designed video I/O API, and such mechanisms might be useful for some applications. But SGI's big mistake--one which hampered development of useful audio/video applications for years--was to try to build and offer those fancier mechanisms to developers instead of offering a simple API that worked on multiple video devices.
Even given current network technologies, network transparency is probably useless, or at the very least not interesting enough to make the centerpiece of your video I/O API. Network-transparent video data transfer layers can be built on top of a well-designed, simple video I/O API without burdening the vast majority of audio/video applications which do not care about network transparency.
SGI spent so much energy worrying about the high-level dataflow issues that it lost sight of the whole point of the exercise, and as a result each video board had to be programmed in its own device-specific way using the supposedly device-independent API.
And we're not just talking about a little setting here or there--the various video board designs disagreed on their basic design philosophy about how an app manipulates video. Some boards could only DMA video into and out of memory, and some boards could only send video direct to/from graphics using dedicated hardware paths. Those boards which did make video available in memory did so in a variety of often incompatible pixel formats. A video capture or playback application that worked on more than one board basically had to be written several times, once for each board!
Yet somehow all of this insanity happened within the framework of a single video I/O API. How could that happen?
After deliberating for years and speaking with the power agencies in each country, the panel announces a new, world-wide standard set of labels, colors, and international symbols that will now appear on every plug, so that if you've got a round-pointy-flat 3 prong 240-volter, by God you'll know exactly what it is, how to plug it in, and you can bet your last Euro that that plug will be the same color as the flat-round-pointy-pointy 4 prong 240-volter they use next door.
The panel also enthusiastically introduces a plan to "simplify" your life by offering:
But, that hard-learned lesson still burdens SGI video programmers today (as many pages of the Lurker's Guide explain).
Right now we have an opportunity to skip over all that hardship on linux.
The most critical thing to notice here, and the thing SGI ignored at its peril, is that you can't do this without bringing in the hardware folks. A video I/O API is ultimately limited to doing what the video I/O device can do, and you need to make sure the devices you support all have a useful, common subset of functionality on which all applications can depend.
Otherwise, you've failed to offer device independence, and like the panel of experts above, you're only getting in the way.
Constraining the alignment of those buffers is ok if necessary (for good DMA, etc.) as long as this is clearly documented and works for every device supported by the API. Constraining the way the application can touch those buffers while they are being used for video I/O is ok as long as this is clearly documented and works for every device supported by the API. Yanking control of the buffer address itself from the application is not ok. Remember that the video I/O library is just that, it's not a memory manager. The application uses your library in concert with other libraries and none of the libraries should be selfish and claim control of buffer allocation for themselves.
Do not ever attempt to "help" the application by cutting down the input rate or output rate of video to something less than the video signal's full field/frame rate within your library or your device. Just give the application every field/frame and require the application to provide every field/frame. Back when PCs had 286 processors and 10 MB/sec memory throughput, it was necessary for hardware to "help" us by dropping frames on input or duplicating them on output, but nowadays doing so is just an academic exercise that is a complete waste of time and leads to many unnecessary ratholes in the area of dropped frame detection and synchronization (and tends to turn simple APIs into incomprehensible ones).
To be useful, it must be possible to queue up at least 1/2 second of video in the queue. Ideally you should offer the ability to queue more if the application so chooses. The queue should be designed so that if the app so chooses (for example, if it's running in an environment with qualified hardware, kernel, and device drivers), it can choose to queue less than 1/2 second. There are many applications where low latency is a requirement (computer animation of human actors wired up with sensors, for example). In your API, there should be no need for any configuration parameters relating to queue depth: queue depth should just be a function of how far ahead of time the application chooses to queue buffers and how the application gets scheduled by the linux process scheduler.
The application must be able to queue the same buffer multiple times for output, for example if it is trying to display a still field/frame or display an alternating pair of fields that form a still image.
As each input frame completes, the application should be able to determine the frame/field number of each chunk of data it has input. This should be part of the same efficient, non-blocking call the application uses to check completion (#1 above). The application can use this number to detect dropped input fields/frames and also do the synchronization trick described below.
As each output frame completes, the application should be able to determine the frame/field number on which the video device output the application's chunk of data. This should be part of the same efficient, non-blocking call the application uses to check completion (#1 above). That way, the application can pre-queue some data (perhaps black) and take some measurements so that it can begin queuing data knowing exactly the field/frame number when that data will go out. This is useful for the synchronization trick described below. In addition, this mechanism lets the application detect the serious error condition where the queue drains to zero, starving the application and causing missed frames on output.
As an optional extension to "help" the application on video output, you might also want to let the application provide a field/frame number when they queue their buffer for output, such that the hardware will refrain from outputting that field/frame until the specified field/frame number arrives. This saves the app the trouble of queuing some black to get into a sane initial condition. But it turns out queuing black is not that hard, and so don't spend a lot of effort doing this. Furthermore, this is no substitute for the above: the application might not choose a timely frame number, or a serious problem on the system might delay the application, so it's absolutely required to still provide the output frame/field number feedback described above so that the application can detect exactly which frames were missed on output.
but NOT gettimeofday()--being adjusted by network time daemons, this clock is basically useless for a/v sync.
Keep this simple: all that is needed is a non-blocking call which returns a pair consisting of a field/frame number, and a value of the unadjusted clock, which coincide. This must be the same field/frame number mentioned above, so that the application can map the number back to an exact memory buffer which they have submitted.
The guaranteed worst-case accuracy of the relationship between the two clocks needs to be at worst +-15ms for good audio/video lip sync, and it needs to be at worst +-2ms if you expect to do frame-accurate machine control (used for frame-accurate video editing with an external VCR/VTR).
Non-blocking is important: do NOT tie this to the linux scheduler. Because linux delivers no real-time guarantees, having a call which "unblocks when ___ happens" is useless for audio/video synchronization, because by the time your process runs, an unpredictable amount of time could have passed (easily in the 10-50ms range, sometimes more). Instead, each library should provide the ability to map data buffers it inputs/outputs to precise values along a common unadjusted clock like the ones above and the ability to queue input/output data buffers ahead of time.
For more insights on this whole subject, see the "Time and Synchronization" section of the Lurker's Guide at:
http://lurkertech.com/lg/
Do not, I repeat do not, bother with any kind of video daemon; this is pointless. See the notes above about network transparency. Talk to programmers who write real video editing applications first before designing an API.
Keep it simple!
You could instead design your API to support a wide-variety of parameters, and have some kind of query mechanism so an app can see what works on what devices. SGI did that.
The only problem is, that's a totally useless and stupid thing to do.
If a video I/O API does not offer a common set of scales, formats, aspects, and crops which are guaranteed to work on all devices, you can have all the query mechanisms in the world but it does an app no good. Think about it -- what is the app supposed to do when you plant it on a machine which supports only some new format? Spontaneously sprout new code to handle the format?
It is truly amazing how many otherwise respectable programmers have gone all the way down the "support everything, make it queryable" road without realizing that they are not solving any problem.
Once you provide a set of formats etc. that work on all devices, then it may be useful to provide software-queryable extensions which the app writer can test for when writing their code. Again, if they found one of these extensions was available, it must be the case that the extension works the same way on every device that supports it.
But spending time building an extension mechanism might be a waste of time also. The vast majority of application writers will stick with what they know works on all devices: application writers get no joy out of writing custom code for different boards.
If you think it's important, you could also specify maybe one other zoom level like 1/2 in each direction, but it's only likely to get used if it works on all devices.
A related issue to scaling is pixel aspect ratio, which will be described in the next section.
I can tell you from personal experience that if you take a casual, sloppy approach at specifying these critical aspects of video--the same approach that Apple, Microsoft, and other vendors have taken for years--your video API will be just as device-dependent and broken as theirs.
If you go into video making the assumptions of computer graphics, you'll have the same problems that SGI did. Read this article and just skip past all that pain!
For example, in the computer graphics world we assume that 100 pixels in X is the same length on the screen as 100 pixels in Y. But in some ubiquitous video cases, pixels are not square. For more info, read:
http://lurkertech.com/lg/pixelaspect/Some video boards will produce square pixels in memory, some non-square. You'll need to document the common subset that all video boards can do.
Another example of video's complexity is fields. Read this article:
http://lurkertech.com/lg/fields/to understand why your video I/O API must specify fields even if each data buffer contains a "frame" (at the very least, your users will need to know which field of a frame is temporally first and which is second).
As a final example, in the computer graphics world we have the simple, healthy idea that an image has certain dimensions like 640x480 and that's it. So we tend to think that video must also have a "full size" and that if you grab a "full frame" of the same video picture in the same pixel format from two video boards, you'll get the same data in the computer. Not so. It may surprise you to know that there is no standard concept of "full frame," not even in the video industry itself -- video board vendors often disagree on the exact line at which "the frame" starts and ends, and so two "full frame" images might be off by a few lines from each other.
For a video editing application which intends to take video material captured from different computers and edit it together--a very, very common operation in computer effects and animation--this is fatal. There's no way app writers can detect the problem, so this forces app users to make constant manual adjustments for no good reason.
There is an unambiguous way to say what "part" of the video your board is capturing, but to say it you must speak the language of video specifications.
This is a problem you can and must solve in order to have a device-independent video I/O API. This is a prime example of why creating the video I/O API must involve the hardware vendors. If you just write one without talking to them, you are basically wasting your time, because you're not going to help application writers.
Fortunately, none of the problems above are very hard to solve once you can clearly state the behavior your want from each board vendor.
Even more fortunately, the vast majority of the documentation work you'd need to do has already been done in the context of the QuickTime file format extensions for Y'CbCr data, which you can read about here:
Apple Technical Note 2162This document was based on an extensive survey of existing practice in high-end video, Macs, and PCs.
I recommend you steal from this document. Ignore the few parts of it that relate to QuickTime and pay attention to the various parameters which are being specified.
I would further recommend that the set of globally supported parameters include the 8-bit '2vuy' pixel format at the "production level of conformance," as defined in the document above. This combination is likely to work with most, if not all, current PC video boards (you'll have to check with the vendors to make sure), provides a reasonable baseline for video applications, and provides good compatibility with high-end video gear and software.
Even if your video API were to only support that combination at first, the document above will get you a clear, video-centric description of behavior which you can show to video board vendors and which they will understand (but you need to understand it as well!). And already you'd have a video I/O API that's better than anything else available on any operating system today.
Then, when you get your confidence up, you can see if you can also add some R'G'B'A format at the production level of conformance to the list of combinations supported by all devices.
The main piece of advice is to always think of your API design from an app developer's perspective, and always bounce your ideas off of app developers. It wasn't until SGI got around to writing their own apps that we learned most of the lessons described here.
Support This Site | I work on this site in my off hours. Please help me to push aside my day job and work on it more by supporting the site in one of these ways: |
Use your credit card or PayPal to donate in support of the site. | |
Use this link to Amazon—you pay the same, I get 4%. | |
Learn Thai with my Talking Thai-English-Thai Dictionary app: iOS, Android, Windows. | |
Experience Thailand richly with my Talking Thai-English-Thai Phrasebook app. | |
Visit China easily with my Talking Chinese-English-Chinese Phrasebook app. | |
I co-authored this bilingual cultural guide to Thai-Western romantic relationships. | |
Copyright | All text and images copyright 1999-2023 Chris Pirazzi unless otherwise indicated. |