Inspironlabs | 31 July, 2020

What is OTT Platform

Written by Sachin Shetty

Decades back, Television was the primary source of information as well as entertainment. We all heard of our elder peers narrating the story of how the entire world was hooked to their TV sets to see the first human to step on Moon. We also witnessed events like the Beijing Olympics that received a massive overall viewership of 3.6 billion in total, primarily through Television.

The drawback of watching on a Television is that the person is bound to a place. But now, due to the continuous internet revolution, this has changed. The innovation of HD-capable smartphones and tablets equipped with powerful video decoders has enabled people to watch their favourite programs on the go.

OTT stands for Over The Top. It is a platform that uses the Internet as a key media distribution to discover, share, and consume TV content anywhere, anytime, and on any device. Some of the most popular OTT content providers are Netflix, Hulu, Disney-Hotstar, Voot, Amazon Prime, YouTube, HBO, Sony, and many more.

Video Streaming Protocols for OTT Platform

If we compare to stored media using DVDs and VHS tapes, streaming video content on an OTT platform is known to be the most efficient, convenient, and reliable. It helps to protect Digital Rights Management (DRM), which will protect the content from piracy. Hence makes it more acceptable to stream a video directly to a personal device (such as a PC, phone, tablet) and not via a proprietary STB, which employs hardware controls and encryption to prevent someone from reusing the video content.

Compression helps to convert the information into a lower bitrate. Hence reduces the cost to store and transmit a video. Let us discuss some of the main elements of the building blocks of video coding. Also, we look into some of the practical tips suggested by experts to improve compression efficiency and video quality of different types of content in OTT.

1. Sampling formats of Raw videos

A raw 60 min SD video may require roughly more or less 100 GB of storage, which can be equivalent to 25 DVDs. A raw 1-min UHD-1 video may take up as much as 50 GBytes. A raw UHD-2 video may take as much 100GBytes, which is double the storage requirement.

Each digital video frame, which may also be known as a picture or an image, is sampled and represented by a rectangular matrix or array of picture elements called pixels. The term pixel is used to describe a still-frame image.

Each raw pixel in a video frame can be divided into three samples that correspond to three different colour components. They are luminance (Y) sample and two-colour samples namely red chrominance (Cr) and blue chrominance (Cb). The YCrCb components are also collectively known as the YUV sampling format or colour space.

The luminance (luma) sample represents brightness whereas the chrominance (chroma) samples represent the extent to which the colour deviates from Gray toward red or blue. Each sample of the colour component is represented by a fixed integer value, typically ranging from 0 to 255 for 8 bits of precision (0 for white, 255 for black).

Thus, the luma and chroma components of a video frame can be represented by three rectangular matrices (planes) of integers. By defining a colour space, samples can be identified numerically by their coordinates. The term sample (rather than a pixel) is more commonly used in video coding standards since the luma and chroma samples may require a different set of coding parameters.

2. Impact of Video Compression

The primary aim of video compression is to remove spatial and temporal redundancy, to encode a video at the minimum bit rate for a given level of video quality or improve the video quality for a given bit rate. The redundancy is inherent in the video content because on the average, a small number of samples change from one frame to the next. Hence, if only the changes are encoded, a significant amount of storage space or network bandwidth can be conserved. The video quality of a compressed video is largely dictated by the encoding process. Lossy video coding improves the compression efficiency (i.e., smaller compressed file size) compared to lossless compression.

3. General video codec operations

The encoding process is typically more time consuming than the decoding process. While many videos can be pre-encoded in advance, the decoding latency should be reasonable for streaming applications and channel switching in broadcast operations. Each video frame is typically partitioned into a grid of blocks, which are analysed by the encoder to determine the blocks that must be transmitted (i.e., blocks that contain significant changes from frame to frame).

To do this, each block is spatially or temporally predicted. Spatial or intraframe prediction employs a block identifier for a group of samples that contain the same characteristics (e.g., colour, intensity) for each frame. The identifier is sent to the decoder. Temporal prediction predicts interframe motion and compensates for any inaccuracy in the prediction. The difference in the frame information, which also corresponds to the prediction error, is called a residual.

4. Transform Coding

Transform coding is a key component of video compression. A good transform can decorrelate the input video samples and concentrate most of the video information (or energy) using a small number of transform coefficients. In this way, many coefficients can be discarded, thus leading to compression gains.

The transform should be invertible and computationally efficient, such as the ability to support subframe video coding using variable block sizes. Also, the basic functions of a good transform should produce smoother and perceptually pleasant reconstructed samples. Many video coding standards employ a block transform of the residuals. Intracoded blocks and residuals are commonly processed in N ×N blocks (N is usually 4, 8,16, 32) by a two-dimensional (2D) discrete transform.

5. Entropy Coding

Entropy coding is a lossless or reversible process that achieves additional compression by coding the syntax elements (e.g., transform coefficients, prediction modes, MVs) into the final output file. As such, it does not modify the quantization level.VLCs such as Huffman, Golomb, or arithmetic codes are statistical codes that have been widely used.


The ISO/IEC Moving Picture Experts Group (MPEG) family of video coding standards is based on the same general principles: spatial intracoding using block transformation and motion-compensated temporal intercoding. The hierarchical structure supports interoperability between different services and allows decoders to operate with different capabilities (e.g., devices with different display resolutions). Popular software-based MPEG codecs such as FFMPEG [3] are readily available. Powerful H.264/AVC and H.265/HEVC encoders such as x264 and x265 are based on FFMPEG.

I frames are key frames that provide checkpoints for resynchronization or re-entry to support trick modes (e.g., pause, fast forward, rewind) and error recovery. They are commonly inserted at a rate that is a function of the video frame rate

P frames are temporally encoded using motion estimation and compensation techniques. P frames are first partitioned into blocks before motion-compensated prediction is applied. The prediction is based on a reference to an I frame that is most recently encoded and decoded before the P frame (i.e., the I frame is a past frame that becomes a forward reference frame). Thus, P frames are forward predicted or extrapolated, and the prediction is unidirectional

B frames are temporally encoded using bidirectional motion-compensated predictions from a forward reference frame and a backward reference frame. The I and P frames usually serve as references for the B frames (i.e., they are referenced frames). The interpolation of two reference frames typically leads to more accurate prediction (i.e., smaller residuals) than P frames

7. Group of pictures

MPEG video sequences are made up of groups of pictures (GOPs), each comprising a pre-set number of coded frames, including one I frame and one or more P and B frames. Pictures are equivalent to video frames or images. The I frame provides the initial reference to start the encoding process.

The interleaving of I, P, and B frames in a video sequence is content dependent. For example, video conferencing applications may employ more B frames since there is little motion in the video. On the other hand, sports content with rapid or frequent motion may require more I frames to maintain good video quality. This implies that there may be little difference in the compression efficiencies of new and legacy video coding standards for sports content. Coupled with high frame rates, sports content typically requires higher bit rates than any other content.

8. Motion Estimation and Compensation

Motion estimation is the key compression engine of many video coding standards.It exploits the similarity of successive frames (i.e., temporal redundancy) in video sequences. Many standards employ block-based motion estimation with adjustable block size and shape to search for temporal redundancy across frames in a video.

When sufficient temporal correlation exists, MVs may be accurately predicted and only a small residual is transformed and quantized, thereby reducing the data needed to code the motion of each frame. Because objects tend to move between neighbouring frames, detecting and compensating motion errors are essential for accurate prediction.

Such techniques help partition and scale the bitstream with priority given to data that is more globally applicable. Thus, they not only improve the coding efficiency but also enhance error resilience.

9. Non - MPEG Video Coding

a) Motion JPEG

M-JPEG employs lossless intracoding for timed sequences of still images, which may be combined with audio. M-JPEG is widely used in Web browsers, media players, game consoles, digital cameras, streaming, and nonlinear (i.e., nondestructive) video production editing. It is based on discrete wavelet transform that works on the entire image (as opposed to blocks). This achieves good compression efficiency without exploiting temporal redundancy.

b) Dirac

Dirac is an open-source royalty-free video coding technology that employs a 2D 4 × 4 discrete wavelets transform to remove spatial redundancies [6]. The transform allows Dirac to operate on a wide range of video resolutions because entire frames, as opposed to smaller blocks, are used.

c) WebM Project

VP8 is an open-source video format formerly owned by Google but released as part of the WebM project, which was launched in May 2010. VP8’s data format and decoding process are described in RFC 6386. VP8 was designed to be more compact, easy to decode, and royalty-free. By adopting a freely available Web platform, VP8 targets faster innovation and better user experience.

VP9 is the latest open video codec that became available on June 17, 2013. An overview of the VP9 bitstream is described in [9]. VP9 is supported byWeb browsers that understand HTML5, including Chrome and Firefox. Android does not natively support WebM well. Google announced on January 11, 2011 that future versions of its Chrome browser will no longer support H.264/AVC. Both VP8 and VP8 employ nonadaptive entropy coding, which leads to faster encoding and more consistent coding gains.

Copyright© 2023 InspironLabs. All rights reserved.