An overview of video conferencing technology, standards and applications
Igor Tešija B.sc.
Abstract - Video conferencing provides real-time video and audio communications; and collaborative computing provides application sharing among desktop users in distributed locations. The combination of video conferencing and collaborative computing represents the key components of the multimedia conferencing applications. The intent of this document is to provide information on technologies necessary for multimedia conferencing applications. This paper is focused on the audio and video compression methods, ways of data transmission over various types of networks and interoperability standards.
Advances in computer technology such as faster processors and better data compression algorithms enabled the integration of audio and video data into the computing environment. Today videoconferencing can be achieved by adding software and relatively inexpensive hardware to standard desktop computers. Such systems, also, have the ability to easily incorporate data from other desktop computer applications into the conference.
Next chapters discuss the enabling technology for videoconferencing. First, audio and video must be captured from their analog form and stored digitally to be manipulated by the computer. Uncompressed, this data would require massive amounts of bandwidth to transmit, therefore the data is compressed before it is sent over communication channels. All this must happen in real-time to enable communication and interaction.
The frequency of sound waves is measured in Hertz (Hz), meaning cycles per second. The human ear can typically perceive frequencies between 20 Hz and 20 kHz. Human voice can typically produce frequencies between 40 Hz and 4 kHz. It is important to keep these limitations in mind when discussing digital audio encoding. Video conferencing systems are designed to handle speech quality audio and that represents a much smaller range of frequencies than the range perceptible to humans.
2.1 Audio sampling
An analog audio signal has amplitude values that continuously vary with time. To digitally encode this signal the amplitude value of the signal is measured in regular intervals. This is called sampling. To faithfully represent a signal of certain frequency, the sampling rate must be at least twice that of the highest frequency present in the signal . Sampling is lossless operation since the original signal can be reconstructed from the samples.
Using Nyquist’s theory, 8kHz is sufficient sampling rate to capture the range of human voice (40Hz to 4 kHz) and 40 kHz is a sufficient sampling rate to capture range of human hearing (20 Hz and 20 kHz). In practice, typical sampling rates range from 8 kHz to 48 kHz .
2.2 Audio quantizing
Sampled values representing the amplitude of the signal are quantized into a discrete number of levels. The number of levels depends on how many bits are used to store the sample value. For digital audio, this precision usually ranges from 8 bits per sample (256 levels) to 16 bits per sample (65536 levels) .
2.3 Digital audio compression
Uncompressed digital audio requires a large amount of bandwidth to transmit. There are many techniques used to compress digital audio. Some of the techniques commonly used in desktop videoconferencing systems are described below. Typically these are techniques that can achieve real-time compression and decompression in software or inexpensive hardware. Some techniques apply to general audio signals and some are designed specifically for speech signals.
2.3.1. mu-law and A-law PCM:
With uniform PCM (Pulse Code Modulation) encoding, each sample is represented by a code word. Uniform PCM uses uniform quantizer step spacing. By performing a transformation, the quantizer step spacing can be changed to be logarithmic, allowing a larger range of values to be covered with the same number of bits. mu-law and A-law are two most commonly used transformations. These transformations allow 8 bits per sample to represent the same range of values that would need 14 bits per sample with uniform PCM. This way a compression ratio of 1.75:1 is achieved.
The mu-law and A-law PCM encoding methods are formally specified in the ITU-T Recommendation G.711, “Pulse Code Modulation (PCM) of voice frequencies”. The mu-law PCM encoding format is common in USA, Canada and Japan for digital telephony with ISDN (Integrated Services Digital Network). The A-law PCM is used with ISDN in other countries .
PCM encoding methods encode each audio sample independently from adjacent samples. However, usually adjacent samples are similar to each other and the value of a sample can be predicted with some accuracy using the value of adjacent samples. For example, one simple prediction method is to assume that the next sample will be the same as the current sample. The ADPCM (Adaptive Differential Pulse Code Modulation) encoding method computes the difference between each sample and its predicted value and encodes the difference (hence the term "differential") . This way, fewer bits (typically 4) are needed to encode the difference than the complete sample value. Encoders can adapt to signal characteristics by changing quantizing or prediction parameters (hence the term "adaptive"). ADPCM typically achieves compression ratios of 2:1 when compared to mu-law or A-law PCM .
Differences among different flavors of ADPCM encoders include the way the predicted value is calculated and how the predictor or quantizer adapts to signal characteristics.
Many videoconferencing systems use ADPCM encoding methods. The ITU-T has several recommendations defining different ADPCM methods such as G.721, G.722, G.723, G.726 and G.727.
2.3.3. LPC and CELP:
There are some encoding methods designed specifically for speech. By using models of the characteristics of speech signals, these encoding methods can achieve good results for speech data but these methods do not work well for non-speech audio signals .
A LPC (Linear Predictive Coding) encoder fits speech signals to a simple analytic model of the vocal tract. The best-fit parameters are transmitted and used by the decoder to generate synthetic speech that is similar to the original .
A CELP (Code Excited Linear Prediction) encoder does the same vocal tract modeling as an LPC encoder. In addition, it computes the error between the input speech data and the model and transmits the model parameters and a representation of the errors. The errors are represented as indices into a common code book shared between encoders and decoders. This is where the name "Code Excited" comes from. The extra data and computations produce a higher quality encoding than simple LPC encoding .
ITU-T Recommendation G.728, which is one of the audio encoding formats specified by H.32x, uses a variation of CELP. G.728 requires a bandwidth of 16 kbps and is quite computationally complex, requiring fast processors or special hardware.
Video is a sequence of still images. When presented at a high enough rate, the sequence of images (frames) gives the illusion of fluid motion. Videoconferencing uses analog video signal as input. This signal must be digitally encoded so that it can be manipulated by a computer. To better understand digital encoding, it helps to understand some background information about analog video, including basic color theory and analog encoding formats.
The human eye has tree types of color photoreceptor cells called cones. Because of this, three numerical components are necessary and sufficient to represent a color . Color spaces are three dimensional coordinate systems whose axes correspond to three color components. Different color spaces are useful for different purposes and data can be translated from one color space to another. The color encoding systems used for video are derived from the RGB color space. RGB is an additive space that uses combinations of Red, Green, and Blue primaries.
Brightness and color information are treated differently by the human visual system. Humans are more sensitive to changes in brightness than changes in color. Because of this, a special component is used to represent brightness information. This component is called luminance and is denoted by the symbol Y. Two remaining components are used to represent color and are called chrominance. These chrominance components are color differences - the blue and red components with luminance removed. A YUV notation is usually used to refer to a color space represented by luminance and two color differences.
There are three formats for analog video. NTSC (National Television Standards Committee) format is used in the Americas and Japan. NTSC format has a resolution of 525 lines per frame and 60 interlaced frames per second. With interlacing, two fields make a complete frame, resulting in 30 frames per second. Only 438 lines contain the video information. PAL (Phase Alteration Line) format is used in Western Europe and Australia. PAL format has resolution of 625 lines per frame and 50 interlaced frames per second. In France, Russia and Eastern Europe a third format called SECAM is used. The third format is rarely used in video conferencing systems.
3.1 Digital video compression
Analog video is digitized so that each frame of video becomes a two dimensional array of pixels. A complete color image is composed of three image frames, one for each color component.
Video compression is typically lossy, meaning some of the information is lost during the compression. This is acceptable though, because encoding algorithms are designed to discard information that is not perceptible to humans or information that is redundant. There are some basic techniques common to most video compression algorithms, including color space sampling and redundancy reduction.
Color space sampling is an effective technique used to reduce the amount of data that needs to be encoded. If an image is encoded in YUV space, the U and V components can be subsampled because the human eye is less sensitive to chrominance information.
Redundancy reduction is another technique used to decrease the amount of encoded information. Intraframe encoding achieves compression by reducing the spatial redundancy within a picture. This technique works because neighboring pixels in an image are usually similar. Interframe encoding achieves compression by reducing the temporal redundancy between pictures. This technique works because neighboring frames in a sequence of images are usually similar.
JPEG is an encoding standard for still images developed by the Joint Photographic Experts Group. Although designed for still images, with special hardware it is possible to encode and decode a series of JPEG images in real-time to achieve motion video. This use of JPEG encoding is typically referred to as Motion JPEG or MJPEG. JPEG encoding typically achieves compression ratio between 10:1 and 20:1. Higher compression results in poorer image quality. A user configurable quality parameter is usually available that allows a compression vs. quality tradeoff.
No official MJPEG standard exists.
3.2. ITU-T Recommendation H.261
ITU-T Recommendation H.261 is video compression standard designed for communication bandwidths between 64 kbps and 2 Mbps, measured in 64 kbps intervals. This technique is also referred to as "px64" where "p" ranges from 1 to 30. H.261 was designed primarily for videoconferencing over ISDN and is specified by H.320.
H.261 utilizes both intraframe spatial and interframe temporal encoding. Two picture formats, CIF (Common Intermediate Format) – 352x288 pixels and QCIF (Quarter CIF) are defined. QCIF operation is mandatory, while CIF operation is optional. QCIF is usually used for low bit rates, such as p<3. Images are composed of three color components, Y and two color differences. The color difference components contain half the amount of information as the luminance component (for each 4 blocks of luminance information encoded, only 2 block of chrominance information is encoded).
Intraframe encoding works essentially as with JPEG. 8x8 blocks are DCT transformed, quantized and run-length/entropy encoded. In interframe encoding mode, a prediction for blocks in the current frame is made based on the previous frame. If the difference between the current block and the predicted block is below a certain threshold then no data is sent. Otherwise the difference is calculated and DCT transformed, quantized and run-length/entropy encoded.
The quantizing step determines the amount of information that is sent - more information meaning better image quality. H.261 encoders adjust the quantizer value to achieve a constant bit rate. If the transmission buffer is close to full, the quantizer step size will be increased, causing less information to be encoded and poorer image quality. Similarly, when the buffer is not full, the quantizer step size is decreased, causing more information to be encoded and better image quality. Because of this quantizer adjustment, rapidly changing scenes will have poorer quality than static scenes .
IV. CIRCUIT-SWITCHED VS. PACKET-SWITCHED COMMUNICATIONS
Different types of data have different service requirements. Some data types are sensitive to delay, while other types are sensitive to reliability. Generic data is not sensitive to delay but is sensitive to reliability. An example is a data file that is sent over a network. It does not matter how long the file takes to get to its destination, but the information in the file is expected to be correct. Voice data is sensitive to delay but is not sensitive to reliability. Voice data must arrive at a constant rate with little variance for it to be intelligible, but some loss of information is acceptable. Still image data is not sensitive to delay but is sensitive to reliability. Incorrect image data may be noticeable in the form of visual artifacts, but delivery time is not crucial. Video data is sensitive to delay and large delays will be obvious by a jerky picture. Uncompressed video data is not sensitive to reliability since if one frame is lost it will immediately be replaced by another frame. However, compressed video, which uses intraframe and interframe encoding techniques, is sensitive to reliability since redundancy has been removed and the effects of data loss may propagate. This is an important thing to consider when sending video data across unreliable communication channels. Some video compression techniques compensate for this sensitivity to data loss by periodically sending complete information about a frame.
Videoconferencing can involve all the data types discussed above. Audio and video data is sent among participants. Other types of data that may be sent are whiteboard data or shared application data. Some of this data requires reliable transmission while some requires timely transmission.
4.1 Circuit-switched communications
Circuit-switched communication is a method of data transfer where a path of communication is established and kept open for the duration of the session. A dedicated amount of bandwidth is allocated for the exclusive use by the session. When the session is completed, the bandwidth is freed and becomes available for other sessions.
Advantages of circuit-switched communication for videoconferencing are that dedicated bandwidth is available and the timing of the data delivery is predictable. A disadvantage of circuit-switched communication for videoconferencing is that sessions are primarily point-to-point and require expensive multi-conferencing units (MCUs) to accommodate multipoint conferences. Also, dedicated bandwidth is wasted during periods of limited activity in a conference session.
4.2 Packet-switched communications
Packet-switched communication is a method of data transfer where the information is divided into packets, each of which has an identification and destination address. Packets are sent individually through a network and, depending on network conditions, may take different routes and arrive at their destination at different times and out-of-order. No dedicated bandwidth circuit is set up as with circuit-switched communication. Bandwidth must be shared with all other users of the network.
An advantage of packet-switched communication for videoconferencing is the capability to more easily accommodate multipoint conferences. A disadvantage is the unpredictable timing of data delivery, which can cause problems for delay sensitive data types such as voice and video. Video packets received out-of-order may have to be discarded. Audio packets can be buffered at the receiver, re-ordered, and played out at a constant rate, however this induces a delay, which can be a big problem when trying to achieve interactive communication.
4.3 Broadband ISDN
Broadband ISDN (BISDN) can solve the problems encountered with circuit-switched and packet-switched communication. Asynchronous Transfer Mode (ATM) is the data link layer protocol that is commonly associated with BISDN. ATM combines the best qualities of circuit-switched and packet-switched communication. ATM can support different data transmission speeds, multiplex signals of different data types, and provide different classes of service. These capabilities will satisfy the service requirements of the different types of data possible with videoconferencing.
In CARNet various experiments with room based videoconferencing over ATM was conducted. As a result interactive distance lectures over ATM are a regular practice in Croatian universities today.
4.4 Modes of conferencing
4.4.1. POTS conferencing
POTS (Plain Old Telephone Service) is the basic telephone service that provides access to the public switched telephone network (PSTN). This service is widely available but has very low bandwidth (typical modem speeds are 14.4 kbps, 28.8 kbps or 33.6 kbps). Work is in progress on H.324; interoperability standard for POTS with V.34 modems (up to 28.8/33.6 kbps). H.324 is designed for the best performance possible on low bitrate networks.
4.4.2. ISDN conferencing
ISDN (Integrated Services Digital Network) is a digital service. There are two access rates defined for ISDN, Basic Rate Interface (BRI) and Primary Rate Interface (PRI). Basic Rate Interface provides 2 data channels of 64 kbps (B-channels) and one signaling channel of 16 kbps (D-channel). There are many desktop videoconferencing products on the market that utilize ISDN BRI. However, problems exist with access to ISDN because it is not available in all areas. Primary Rate Interface provides 23 or 30 B channels of 64 kbps and one D channel of 64 kbps. ISDN PRI is expensive and therefore not really applicable for desktop videoconferencing.
Because ISDN channels offer 64 kbps of bandwidth, standards and compression algorithms have been designed around that number. 64 kbps has become somewhat of a magic number for videoconferencing.
4.4.3. LAN and Internet conferencing - ITU
LANs provide connectivity among a local community. The Internet connects LANs to other LANs. The protocol developed to interconnect various networks is called the Internet Protocol (IP). Two transport layer protocols were developed with IP, TCP and UDP. TCP (Transmission Control Protocol) provides a reliable end-to-end service by using error recovery and reordering. UDP (user datagram protocol) is an unreliable service making no attempt at error recovery.
Videoconferencing applications that operate over the Internet primarily use UDP for video and audio data transmission. TCP is not practical because of its error recovery mechanism. If lost packets were retransmitted, they would arrive too late to be of any use. TCP is used by some videoconferencing applications for other data that is not time sensitive such as whiteboard data and shared application data.
4.4.4.) Internet MBone conferencing
The Multicast Backbone, or MBone, has been called a virtual network because it is layered on parts of the Internet. To understand how the MBone works, it is important to understand the difference between unicast and multicast. Unicast is a point-to-point transmission of data. To achieve a one-to-many transmission, separate copies of the data must be sent by the source to each destination. Multicast enables a more efficient way to deliver the same data to multiple destinations.
The challenges of transmitting audio and video over the Internet has led to the development of a new transport protocol proposed by the Audio/Video Transport working group of the IETF (Internet Engineering Task Force). RTP (Real-time Transport Protocol) provides support for sequencing, timing, and quality of service reporting for point to point or multipoint. Most of the commonly used MBone tools implement some version of RTP as do some commercially available tools such as InPerson (Silicon Graphics), and ShowMe (Sun Microsystems) .
V. INTEROPERABILITY STANDARDS
Interoperability means that products from different vendors can communicate. To accomplish this goal, standards are required. There are several standards groups working towards producing and promoting standards for desktop videoconferencing.
The ITU (International Telecommunication Union) is an agency of the United Nations. It is a worldwide organization within which governments and private companies coordinate the establishment and operation of telecommunication networks and services. The ITU-T is the Telecommunication Standardization Sector of the ITU and has developed standards for audio, video and data conferencing primarily over ISDN. The ITU-T is working cooperatively with the IETF to extend its videoconferencing standards to include packet-switched networks.
The IMTC (International Multimedia Teleconferencing Consortium) is a non-profit corporation founded to promote the creation and adoption of international standards for multipoint videoconferencing and document conferencing. IMTC’s emphasis is on multimedia teleconferencing, including still-image graphics, full motion video and data teleconferencing. The IMTC promotes the standards adopted by the ITU including H.320, H.323, H.324 and T.120, conducts interoperability trials and defines API specifications as extensions to standards .
5.1 ITU-T Recommendation H.323
The H.323 standard provides a foundation for audio, video, and data communications across IP-based networks, including the Internet. By complying to H.323, multimedia products and applications from multiple vendors can interoperate, allowing users to communicate without concern for compatibility.
H.323 is an umbrella recommendation from the ITU that sets standards for multimedia communications over LANs that do not provide a guaranteed QoS (Quality of Service). The H.323 specification was approved in 1996. The standard is broad in scope and includes both stand-alone devices and embedded personal computer technology as well as point-to-point and multipoint conferences.
The standard addresses call control, multimedia management, and bandwidth management for point-to-point and multipoint conferences. H.323 also addresses interfaces between LANs and other networks.
TABLE I - THE ITU H.320 AND H.323 UMBRELLA RECOMMENDATIONS
H.323 is part of a larger series of communications standards that enable videoconferencing across a range of networks. Known as H.32X, this series includes H.320 and H.324, which address ISDN and PSTN communications, respectively. H.323 is in many ways a derivative of H.320, a 1990 umbrella recommendation for video telephony over switched digital telephone networks. H.323 borrows heavily from H.320’s structure, modularity, and audio/video codec recommendations .
H.323 references the T.120 specification for data conferencing.
5.2 ITU-T Recommendation T.120
The T.120 standard contains a series of communication and application protocols and services that provide support for real-time, multipoint data communications.
Broad in scope, T.120 is a comprehensive specification that solves several problems that have historically slowed market growth for applications of this nature. Perhaps most importantly, T.120 resolves complex technological issues in a manner that is acceptable to both the computing and telecommunications industries. Over 100 key international vendors including Apple, Microsoft, AT&T, BT, Cisco Systems, Intel, MCI, and PictureTel, have committed to implementing T.120 based products and services .
This paper discussed important technical aspects of videoconferencing. Data compression is important since audio and video data require a large amount of bandwidth for transmission. Data compression techniques vary in their quality, amount of bandwidth required, and computational complexity. There are two major types of communication channels available to transmit the data: circuit- and packet-switched. Circuit-switched channels such as ISDN offer dedicated bandwidth and predictable timing of data delivery but do not easily support multipoint communication. Packet-switched channels, either local (LAN) or wide area (Internet), more easily support multipoint communication but do not provide predictable timing of data delivery. B-ISDN and ATM show promise for solving some of the problems encountered with both circuit- and packet-switched networks. Interoperability standards that allow systems from various vendors to communicate with each other are one of the most important factors to the future of videoconferencing.
 D. Pan, Digital Audio Compression, Digital
Technical Journal, Vol. 5 No. 2,
 C. Poynton, Frequently Asked Questions about Colour,
 R. Frederick, Experiences with real-time software
 S. Casner, Frequently Asked Questions (FAQ) on the
Multicast Backbone (MBONE),
 A. Spencer, Video Communications - Phillips OmniCom Training Course Book