If you watched Super Bowl 50 on your Windows 10 PC, Surface Tablet, or Xbox One, stop for a second to marvel at the technology that made that possible. At the heart of a very complex series of events are a pair of hardworking, talented, and unknown “players” who make streaming live video over the Internet possible.
It starts with the Digital HD video cameras capturing the live action. These HDTV cameras are taking 2Mp JPEG photos at a rate of 24 photos per second. These photos have a resolution of 1920 x 1080 pixels (2.07 megapixels). Not coincidentally, this is the same resolution as 1080p (HD) televisions. There’s a lot that happens to these 24 photos before they produce 1 second of exciting football action in your living room, but that’s another story.
Again, the HDTV camera is outputting video as a stream of JPEG photos. Each photo has over 2 million pixels and each pixel’s position and color is defined by 3 bytes. That’s 622,080 bytes per photo (called a frame in TV-speak), which is nearly 15 million bytes per second or 120 million bits per second. 120 mbps is just the video – when you add in the digitized audio (and other data to keep the frames in order and the sound in synch with the video) you get a data stream of at least 150 mbps. Not only is there an awful lot of data here, but the stream is rather fragile.
The Internet is definitely not a safe place for a lot of data that needs to streamed in a set order. The Internet uses packets of data that are individually addressed to specific destinations – sort of like boxes of bits being mailed back and forth at the speed of light. These network packets don’t always take the same path and some arrive at their destination before others. A stream of data put on the Internet gets divided up into a series of packets and it’s almost guaranteed to arrive at its destination with the packets out of order. And occasionally a packet goes astray.
Getting this video stream across the Internet to your Xbox is rather like asking a quarterback to throw a perfect 90-yard pass to his receiver in spite of the opposing team – and do it 24 times every second. For every minute of the game. Plus the commercials and the halftime show too. And then the receiver has to line up the footballs he caught in the order thrown - not in the order he caught them.
This is where our unsung superstars – the CODECs - come in. Two CODECs working together perform the passing magic required to bring the Super Bowl to your Xbox and TV.
The broadcasting CODEC prepares the data in the raw video stream by encoding and compressing it and stuffing that data into network packets to be thrown across the Internet, while the CODEC in your Xbox receives and opens the network packets, re-orders them into the proper order, decompresses and decodes the data back into a digital video data stream that your Xbox can render on your TV.
“CODEC” is an acronym for COmpression/DECompression.There are many different CODECs and they do essentially the same job in slightly different ways. The CODECs on the Xbox “team” use the H.264 video standard, which is also used by Blu-ray and YouTube. So, how do they do it?
The raw JPEG frames in the stream are compressed into three kinds of frames: intra-coded frames (I-frames), predictive-coded frames (P-frames), and bidirectionally-predictive-coded frames (B-frames).
To make an I-frame, the CODEC picks one JPEG frame in the stream, divides it up into 8 x 8 pixels blocks (called macroblocks), and compresses the data. This compression algorithm takes advantage of spatial redundancy (ex: all of this patch of sky is blue) and of the inability of the eye to detect certain changes in the image. I-frames can be decompressed by the receiving CODEC back into a full video frame.
To make a P-frame, the CODEC typically takes the third JPEG frame that follows the I-frame, divides it up into macroblocks, and then compares those macroblocks with the macroblocks of the I-frame to see if any have moved. If a macroblock has moved, the CODEC describes the direction and distance of the movement instead of describing the macroblock itself. To reconstruct a P-frame, the receiving CODEC takes the information in the P-frame and decodes it in reference to the related I-frame.
To make a B-frame, the CODEC looks at both the nearest I-frame and the nearest P-frame and applies a similar macroblock movement-detection process. To reconstruct a B-frame, the receiving CODEC takes the information in the B-frame and decodes it in reference to both the related I-frame and the related P-frame.
Typically, every 15th frame or so is made into an I-frame. P-frames and B-frames might follow an I-frame like this: IBBPBBPBBPBBPBBI. The video stream is therefore changed from a series of raw JPEG frames (JJJJJJJJJJJJJJJ) into a highly compressed series of I-, P-, and B-frames, along with information about the proper order of the frames. The job of the receiving CODEC is to convert the I-, P-, and B-frames back into a series of JPEG frames that closely matches the original series of JPEG frames produced by the HDTV camera.
All of this encoding and decoding takes place for one reason – to get a lot of data across the chaos and restricted bandwidth of the Internet. Before being placed into network packets on the Internet, streaming media is placed into transport packets. The MPEG-2 Transport packet was designed specifically for this purpose. The transport packets include time-synchronized video and audio data and they are sequentially numbered.
So, we now have a bunch of of I-, P-, and B-frames, along with digital audio, wrapped up in transport packets, which are in turn wrapped up in network packets being broadcast all over the Internet. Some of those network packets have your Xbox’s address on them. Your Xbox receives its network packets, unwraps them, sorts the transport packets back into the proper order, opens them, and gives the encoded video and audio data to the receiving CODEC. The CODEC then goes to work decoding the video (while a similar audio codec decompresses the audio) and reassembles the digital video stream. Your Xbox can now output an HDTV stream to your TV that’s very similar to the output of the HDTV camera at the 50 yard line. A lot’s been happening along the way, but there is less than a second’s delay from the Super Bowl to your living room. Your TV then does some more magic to turn 24 frames per second into a perceived 60 frames per second on the screen, but as I said earlier, that’s another story.
The preceding is a greatly simplified explanation of a highly complex process. I hope that this gives you an appreciation the amazing journey that digital video data takes in order to bring you streaming video over the Internet. Hopefully, you also found this story to be somewhat entertaining as well.