The CynMor Video Digital Video Primer

What is Digital Video?

Introduction: Why Digital? We've all heard the term "digital." We have been familiar with digital telephones and digital CD's for some time. The cellular telephone industry has been bombarding us with visions of "crystal clear" transmissions for months now. The consumer has been introduced to the digital camera, which, in conjunction with a computer, allows us to shoot and view a still image without film. "Digital video" is, not surprisingly, the same technology applied to a stream of still images designed to produce a "motion picture" effect. Digital video.

Learn more!
Want to learn more? Check out the PC Technology Guide

But what is so great about digital video? Why a new technology? After all, traditional analog video formats, such as the familiar VHS format, seem to work just fine. The secret of digital lies in the way the image is recorded. Instead of sampling and recording a complex analog waveform (more about that technical stuff later), digital technology records a long series of "binary digits" ["bits"], electromagnetic "states" that are either "on" or "off" then processes or codes them in a special way to turn them into a video signal.

But wait! That sounds exactly like how a computer works. Bits and bytes. We all remember those things. Well, that's exactly why digital video technology is so exciting. It brings the world of video production to the ordinary computer owner like you and me. We call this "desktop video" [DTV]. It is now possible for us to shoot high-quality videos, download the images to our computers, edit them and manipulate the images in extraordinary ways, add fascinating and engaging special effects and sound,and upload the finished product to a "master" medium such as the VHS tape, CD ROM, or for Internet viewing. We can even play back the videos on our computer monitors as full-screen, full-motion dazzling multimedia events. And we can do this at PC computer system prices, not the high-price "broadcast" suites affordable only by the pros, often costing $100,000 or more.

The Video Image: Are there Dots in Front of Your Eyes? To really understand the power of digital video, we need to get technical for a moment or two and look "under the hood" at the video image itself and understand how that image is processed.

Oh, oh, you mean it isn't enough for me to know that my TV screen has a display "resolution" defined by 720 "dots" per line in the horizontal direction and 485 in the vertical direction? That my computer display is considered high resolution at 640 x 480 or "super" high resolution at 800 x 600? Or that the Digital Video [DV] format specification is 720 x 480? Well, bear with me for a moment. I'll tell you why. Consider this scenario:

A high-quality color image "dot" on your screen requires a digital "code" that takes up 8 bits of space for each of the three screen colors -"red," "green," and "blue." Eight bits define a computer "byte" (you remember that, don't you?). A typical video image changes about 30 times a second in order to give the illusion of motion. At a resolution of 720 x 480, that means a single second of video requires 30 x 720 x 480 x 3 bytes to "represent" those images - more than 30 million bytes!! That's 30 megabytes [MB]. Therefore, a two-hour feature video would require 30 x 7200 (because there are 7200 seconds in two hours) bytes, or more that 200 billion bytes!!! That's 200 gigabytes [GB]. We're talking HUGE! Today's computers may come with a 1 or 2 GB hard disk. You can buy a 10 GB hard disk for maybe $1000. But forget about 200 GB!! We're not quite there yet in the ordinary consumer market. Moreover, moving data of those magnitudes and at sufficient speeds through a computer is quite a daunting task. Consider your lowly modem, which, at best today moves data into your computer at rates of 33 to 56 thousand bytes per second [56 KB]. How do we move 30 million pieces of data through the computer in one second? The answer is compression.

Understanding Video Compresion: First, the Spatio-Temporal Signal The DV standard calls for us to move 3.6 MB data a second. So, if we can compress our data 5 to 10 times, without losing any quality we would have a digital quality image. Remember, one second of raw, uncpmpressed DV data requires about 30 MB.

HELP! Like our friend here trying to roll that huge golf ball up the hill, even compression can be an overwhelming task. Let's get an idea how it's done.

Oh, no! Not more technical stuff! I thought I did pretty well slogging through that video bits and bytes stuff. Now what!!! Well, OK, just a little about the art of video compression.

The idea with video compression, in the ideal, is to take a video image data file and somehow make it smaller without losing any of the information. This is called lossless compression. Thus, in theory, the compressed file accurately represents the "parent" file exactly, but utilizing fewer data parameters. If, however, some inforamtion is lost in the process, this is called a lossy compression technique. The best DV hardware/software setups today can achieve virtually lossless compression, or, at least, can achieve so called "broadcast quality" video.

To understand how this compression is achieved, and to set the stage for understanding the power and versatility of DTV, consider, for a moment -I promise, just a moment- the video signal itself. A "motion picture," as we have seen, is merely a stream or sequence of images, displayed over time. Each image, as we have also seen, is a collection of "dots" [the familiar "pixels" we hear about on our computer screens, for example]. In black and white, each dot in the image would have to be described in a two-dimensional space - the screen - so we would need an x and a y coordinate (remember that algebra?) to do the job. A DV format has an image array of 720 dots per line x 480 lines. Thus each dot in the image would require an x or horizontal coordinate, a y or vertical coordinate AND a t or time coordinate to represent it in the data file. This is the Spatio-Temporal Signal. Now, one second of video images would require about 10 million of these dots [720 x 480 x 30 "images" or "frames"]. A color pixel needs three pieces of data to accurately represent it, so we would need 10 x 3, or 30 million pieces of data. This is the 30 MB figure we saw before.

There is another way to represent a color video signal besides the Red, Green, Blue [R,G,B] method we discussed earlier. Each color dot could also be represented as a data element made up of a "light intensity" or "luminance" [commonly called the "Y" value] and two color chrominance components, commonly called "U" and "V." It is possible to represent colors this way because the human eye is much more responsive to changes in light intensity that subtle changes in chrominance.

So, a video signal can be broken down into a mathematical array of monochromatic [B & W] light intensities, sampled over time, of each horizontal and vertical position in the "frame" [the single image captured by our camera - there are about 30 frames per second of TV video] or component [RGB or YUV] color intensities.

In order to create this video signal, the typical device "scans" the image pixel intensities over time in the horizontal direction, line by line. [Well, actually the computer industry does this; the TV industry scans every other line, yielding two fields per frame, a process called "interlacing." The computer display signal, is thus, a "non-interlaced" display]. The typical complete video signal, then, is a synchronized coupling of the horizontal and vertical (i.e. line by line ) scans. Moreover, if you think about it, this results in an analog signal. That is, the data stream represents a "picture" of the varying intensities of the dots, line by line, and is not yet a digital -that is, binary, representation.

So, our analog video signal is really a sequence of "dots," of varying intensities and chrominance [color], layed down line by line, very rapidly [30 frames per second, or 60 "fields" of odd/even lines in the case of the TV signal]. And, of course, each horizontal line contains 720 dots [in the "TV" and the DV formats], so the TV scanner is laying down 720 x 30, or nearly 22,000 dots, on every other line, per second. That's pretty darn fast. No wonder our eyes perceive this as continuous "motion."

There are three main analog video standards, "Composite," "Component," and "S-Video." The Composite signal combines the Y, U, and V componenets into a single signal. This is used today in the US in the most common TV signal standard [the NTSC standard]. (In the component standard, the the three primary components are coded as three separate signals; and, in the S-Video type, there are two - a Y and a C - separate signals. More about this later when we look at the hardware options for DV). So, how do we get from that analog signal to a digital signal?

Enter the Digital Signal The analog video signal may be fine for ordinary broadcast purposes, but it suffers from several problems that are very relevant to the modern-day video world. If we are really trying to meld the realms of computer, video and telecommunications on a single multimedia platform, it is vital that the video signal be scalable, allow interactivity, be platform independent, and stand up to the numerous requirements of these operations.

Consider the following simple but eminently reasonably scenario:

You shoot that once-in-a-lifetime video of your cousin's wedding on your Canon Hi8 ES4000 Camcorder. You then plan to edit it on your computer, throw in some transitions, add some fancy effects, mix in some awesome music, then master the final product on a VHS-format tape for duplication and distribution to your relatives and friends. And you would like to do it in your lifetime.

DTV can make this process fun, easy and relatively inexpensive. And that 10th copy of the master will look just as good as the "first generation." But if we only had an analog signal to work with, the process would be horrendous, expensive or would require a professional "linear editing" suite, like the studios. Enter the digital video signal.

To digitize that analog signal, which, you remember, started life as a spatio-temporal signal in one dimension, consisting of three coordinates (horizontal, vertical and time coordinates), the analog signal is measured (sampled) in all three "directions." We call each sample point a "pixel." The digital signal, then, will contain a set of numbers that defines the number of pixels per line, the number of lines per frame, and the frame/field rate, among other things. [It also needs to specify the "aspect" ratio and interlacing method]. The first two parameters define the image resolution. The standard for the TV industry, known as CCIR 601 NTSC, defines a 720 x 485 image, with a 4:3 aspect ratio, a temporal rate of 60 fields [30 frames] per second, and 2:1 interlacing, as we saw above. A VESA standard SVGA computer display, on the other hand, is an 800 x 600 image, at a temporal rate of 72, with 1:1 interlacing. But as we saw above, the number of data elements involved in representing this digital signal is more than 30 MB per second of raw, uncompressed digital video.

The Compression CODEC: The obvious fix for this bandwidth hog is video compression. The idea is simple - reduce the number of data elements necessary to represent the image pixels, without losing perceptable quality. The process results in a binary code which the computer can now use. That binary code is, of course, a series of 1's and 0's, and is now subject to "perfect" copying without degradation.

Oh, oh! I sense we are about to get pretty technical again! Well, yes, but it won't be so bad this time. To get a hold on this compression business, we need to think about redundancy. If you think about a spatio-temporal video stream, with 60 fields/30 frames "captured" each second, unless we are dealing with a moving image of Superman, there will be a lot of similarity, and hence, redundancy, between, for example, the frame at 01:16:34:25 and the one at 01:16:34:26 [that's the frame "address," just like your VCR's, as an example, at one hour, 16 minutes and 34 seconds into the tape, for consecutive frames 25 and 26. We will come back to these "addresses" later when we look at Edit Decision Lists (EDL) and "deck control" issues].

For example, the same pixel in consecutive frames may well have the same set of parameters. Or they may be correlated. In addition to spatial and temporal redundancies, video compression and decompression (CODEC) algorithms take advantage of some biological realities. We discussed briefly above that the human eye is much more sensitive to changes in luminance (Y) than changes in chrominance (U, V). The eye is also less sensitive to light wave changes at the high frequency end of the spectrum. Finally, CODEC algorithms take advantage of another element of the analog video signal representation; viz., not all parameters occur with the same frequency, so there is no need to use the same number of bits to code them.

Thee are several well-known video CODECs out there, such as the CCITT Recommendation H.261 (used in video conferencing applications), and the familiar Motion Picture Experts Group (MPEG) CODECs, of which there are the MPEG-1, and MPEG-2, and the MPEG-4 (under development). The MPEG CODECs contain parts which deal with "administrative" details such as timing and mixing (systems), as well as audio and video parts. The newer standards include a data storage and retrieval part.

CODEC Mechanics: An MPEG Example You may want to skip this part, since it is very technical. But if you want a large dose of Discrete Cosine Transformation mechanics and the like, here goes:

One way of looking at the video stream is that it consists of individual pictures, which may be a frame or two fields, and larger groups of pictures. Mathematically, a picture is a set of three pixel matrices, one for luminance and two for chrominance. We saw above that coding algorithms takie advantage of spatial, temporal, biological, and coding redundancies. One example is used in MPEG compression. The MPEG CODECs utilize the human eye idiosynchcracies by fractionating the number of chrominance pixels in the horizontal and/or vertical directions, by a factor of one-half. This results in a 4:2:0 format, or a 4:2:2 format. If there is no reduction, this is known as a 4:4:4 format.

In the MPEG schemes, individual pictures are categorized into three types, depending on the schemes used to compress them. These are I, P, or B pictures. Intra (I) pictures are coded by a technique known as transform coding, which takes advantage of spatial redundancy. The picture is divided into 8 x 8 blocks and these pixel values are transformed into a set of coefficients, one for each YUV matrix, by the Discrete Cosine Transform (DCT). The transform takes advantage of the fact that low energy, high frequency color pixels can be coded with fewer bits than their opposites. Remember, that the luminance matrix, under the MPEG standard, is already quantified differently than the chrominance matrices.

The DCT scheme employs a redundancy coding algorithm that utilizes a set of rules called "Run Length and Huffman." The result is a lossless coding scheme, but since spatial redundancy is somewhat limited, only moderate compression can be achieved. In a typical analog-to-digital "migration," about two frames per second will be compressed by this technique, assuming 30 frames per second of video.

MPEG gets its best compression from the techniques used for the B (Bidirectional) and P (Predicted) pictures. The technique takes advantage of temporal redundancy by recognizing that frames are very closely related. In P pictures, the frame is divided into 16 x 16 macroblocks and a previous I picture is used to "predict" a subsequent frame, using temporal and more complex differential [which involves "motion vectors"] coding techniques. For B pictures, similar methods are used, but these are based on previous as well as future predicted I and P frames.

The result of these various MPEG compression schemes using different combinations of spatial, temporal, and coding redundancy techniques, is an analog video stream transformed into a digitally-coded video clip which is far less of a bandwidth hog, and which suffers very little in terms of quality degradation. MPEG-1 schemes are usually used for computer-based applications such as video mail, CD-ROM film clips, and games. They employ data rates of between 1.5 and 3.0 Mbps (million bytes per second) and the video is normally formatted at 352 x 240 resolution. This is considered VHS compatible. The MPEG-2 standard is intended for "broadcast quality" videos, employs much higher data rates, from 4-100 Mbps, and is formatted at 720 x 480 resolution.

Editing Considerations: If all you want to do is take raw analog video footage and digitize it, there are numerous, quite inexpensive hardware/software products on the market which, utilizing suitable PC or Mac platforms, can get the job done for you. Specific products and their specifications and features will be discussed in the Hardware and Software portions of the Primer. However, to unleash the full power and versatility of DTV/DV, you will want to edit your videos. Of course, the motion picture and television broadcast industries have reached the pinnacle in terms of production and post-production editing of their analog projects. Many of them actually use digital techniques in some of their work. But for the "home desktop video aficianado" like you or me, digital editing is very much an evolving art.

Editing a digital video stream involves the ability to be able to view and manipulate the images. We could, theoretically, access every field or frame of the video by first (en)coding the analog source, as described above, then decode the digitized format in order to access and view an image for editing. However, since the CODECs are not lossless technology, there will be some degradation of the product each time we compress/decompress the stream. Fortunately, the I pictures in the digitized format are directly accessible in the video stream. The more I pictures per second, the better the editing opportunity and quality. By practice, as we saw above, there are usually two I pictures per second available for editing. Even higher quality MPEG-2 profiles and levels consist of video streams that are either all I pictures or IBIBIB format. The MPEG-2 Studio Profile, utilizing a 4:2:2 format for the chrominance matrices, at data rates up to 50 Mbps, results in a very high quality and eminently editable digital image. Other digital editing considerations involve the ability to perform splicing, fast forward and rewind, and closed captioning manipulations. All these techniques can be affected by the MPEG CODEC specifications in terms of data rates, number of I pictures per second, and fractionation formats, etc.

Editing in the digital video realm involves other aspects such as the ability to "scale" the image as well as embed audio in the digitized stream. Some MPEG formats allow for scalability, permitting the derivation of different resolutions. Audio encoding techniques, like video methods, reduce bandwidth requirements by eliminating redundancies. At sample rates ranging from 32 to 48 Khz and data rates between 32 and 384 Kbps, audio compression is normally identical to the Masking pattern adapted Universal Subband Integrated Coding and Multiplexing (MUSICAM) standard, although Dolby Digital (Surround AC-3) sound, with six discrete sound channels, is another option available through the MPEG standards. This combination of MPEG video compression and Dolby sound is used, for example, in Digital Video Disk (DVD) applications.

With digital video (and audio), the future is here. Literally.

Learn more!
Want to learn more? Check out the PC Technology Guide

Back to Primer
Home Page Visit CynMor's
Home Page

Email: levitm@ix.netcom.com