Using a Machine Learning Model to Bring Black-and-White TV Shows to Life

UC San Diego Extension student Katie He lives just a few miles away from Big Sur in Carmel, California, which is considered one of the most beautiful coastlines in the world. Little wonder, then, that she’s a photography buff—and nurtures a soft spot for black-and-white movies. One of her favorite TV shows is The Twilight Zone, an anthology that first aired on CBS in 1959 and is still considered one of the most legendary sci-fi TV shows of all time.

When it was time to choose a topic for her capstone project while enrolled in UCSD Extension’s Machine Learning Engineering Bootcamp, He knew she wanted to combine her love of photography and old movies to create a machine learning model that would benefit the artistic community instead of tackling a more run-of-the-mill classification or regression problem, such as classifying emails as “spam” or “not spam” or predicting house prices over time. Image colorization was the perfect way to combine these two areas of interest.

Colorization is the process of adding color to black-and-white, sepia, or monochrome images and producing output images that represent the semantic colors and tones of the input. For example, an ocean on a clear, sunny day must be plausibly blue. When it comes to archival images, professional colorists must rely on image annotations or research to ensure that the colorization is historically accurate.

“I wanted to see how machine learning can be applied to problems that are a bit more ambiguous, where the answer is not so straightforward, but still have a practical application,” He said.

She decided to build a machine learning model that could colorize footage from The Twilight Zone, thereby automating a process that is usually time-consuming, expensive, and performed by professional colorists.

What is image colorization and how does it work?

Image colorization is a topic of ongoing research in computer vision, but its commercial application still entails a largely manual process. A film colorist is a production technician responsible for designing a film’s color scheme to achieve a specific style or mood. These professionals use computer software to associate a range of gray levels to each object in the frame and indicate to the computer any movement of the objects within a shot. The software is capable of sensing variations in the light level from frame to frame and correcting it if necessary. However, the process is still a lot like editing photos in Photoshop.

Some companies specializing in image colorization have produced automatic region tracking algorithms that perform functions like pattern recognition and background compositing. However, machine learning is rarely used to automate the entire process, except when it comes to still images. Smartphone apps like Chromatix and Colorize use machine learning models to automatically recolor grayscale photos, while some traditional photo editors support image colorization with built-in shortcuts like image segmentation and natural recoloring of recognizable objects.

Recently, there has been a surge of interest in colorizing archival photographs from World War I and II, using images culled from public archives and records to offer a glimpse of the war from the perspective of those on different sides of the conflict. Faithful recreation requires research, particularly when it comes to coloring flags, insignias, logos, or decals for wartime aircraft.

“All of these projects are very labor-intensive, and I don't claim that we can use machine learning to solve every problem,” He said. “But it would be cool if we could save at least a little bit of time, so you don't have to go into every single frame and pick out the objects, then colorize them.”

One of the most ambitious feats of modern-day image colorization was the 2018 documentary They Shall Not Grow Old by Lord of the Rings director Peter Jackson, which featured over 100 hours of archival footage from WWI. Production technicians used frame-rate retiming, digital image restoration, and colorization to transform the videos from the stilted, flat aesthetic of archival footage to a more fluid, modern-day presentation.

“The humanity of the people jumps out at you, especially the faces,” Jackson wrote in the film’s production notes. “They’re no longer buried in a fog of film grain and scratches and stuttering and sped-up footage.”

For her part, He wanted to achieve a similar feat by using machine learning to automate part of the image colorization process.

Turning a black-and-white movie into an image dataset

To train her machine learning model, He generated her own image dataset of 15,000 screenshots from a 1970s color TV show called Ghost Story, another horror anthology, which she used to recolor test images from The Twilight Zone. To generate the dataset, she used a Python library called pytube to download video files of each film from YouTube, processed the videos using OpenCV, a library of programming functions for computer vision, by capturing screenshots from the video footage at one-second intervals.

For the sake of training efficiency, He downsampled the images (decreased the color depth) so that her model would consume less time and computing power during the training phase. In addition, the images were transformed from RGB to Lab Colorspace. RGB, short for ‘Red, Green, Blue,’ encodes the colors of digital assets by telling the computer how much of each color is needed. Lab Color uses three values (L, a, and b) to specify colors using a three-axis system and functions more like the human eye than RGB or CMYK. It also separates the grayscale channel (L) from the two color channels (‘a’ and ‘b’) so that the first layer can be fed directly into the model as input.

For the test set on which her model would be used, He generated screenshots from The Twilight Zone using a similar method, but this time she took screenshots for every frame so she could reassemble the series of images into a video after the colorization process.

“Once the model is trained, it can predict the color frame by frame and then concatenate all those frames together and make it back into a video again,” explained He.

Given that the image dataset was generated from a TV show, the majority of the subjects of the photographs were human faces.

Training a generative adversarial network to colorize images

To build her model, He decided to use a generative adversarial network (GAN), a popular machine learning model for image-to-image translation, which is the process of generating an output image in the style—or in this case, the color scheme—of the input image. The GAN consists of two machine learning models: a generator and a discriminator.

  • The generator captures the data distribution and attempts to generate new examples or derivations of the input data that can pass for real data;

  • The discriminator classifies these outputs as real or fake.

In other words, the generator attempts to fool the discriminator by outputting high-fidelity derivations of the original input image. Here, the generator tries to produce colors as close to the ground truth as possible, while the discriminator distinguishes the generator outputs from the original color version of the inputs. Each time the discriminator flags an image as fake, the generator gets better at producing a more faithful example of the training images, while the discriminator becomes more astute at distinguishing inauthentic examples.

The central architecture of He’s model is inspired by Unsupervised Diverse Colorization via Generative Adversarial Networks, a model proposed by a group of researchers at Shanghai Jiao Tong University in Shanghai, China, with slight modifications.

He’s model is a conditional GAN, which is slightly different from a regular GAN. Traditional GAN models generate new samples from a randomized input. cGAN, on the other hand, leverages other information along with the randomized input. In this particular case, the model uses grayscale images to extract additional information so that the generated color distribution is conditional on the features of the black-and-white input. The architecture of He’s model also differs from many others in that the generator uses only convolution layers with no downsampling, such that the output of each layer has the same dimension as the grayscale image in order to retain the maximum amount of information. The additional information helps reduce the sepia overtones often seen in image colorization models.

“If you’ve ever played with watercolor paints, you know that when we mix a whole bunch of colors, you always get brown,” He said. “Mathematically, it’s quite similar. Brown is just this middle color in terms of the color value. So when the generator is not quite sure—because I’m using mean scoring laws—it’s just much safer to predict brown than any other extreme color.”

Essentially, the generator predicts the color value of an image pixel by pixel. Digital images are comprised of a matrix of pixels. Each pixel has a pixel value that describes how bright that pixel is and/or what color it should be. For grayscale images, the pixel value is a single number that represents the brightness of the pixel. The most common pixel format is the byte image, where the number is stored as an 8-bit integer, giving a range of possible values from 0-255. However, He had scaled her images such that each pixel value ranged from negative 1 to 1, in order to make the training of the model numerically stable. Rescaling is done by dividing all pixel values by127.5 (half of 255, the largest pixel value) and subtracting 1.

Fine-tuning the model to produce vivid images that are true to life

To measure the accuracy of her generator and discriminator, He calculated the average mean squared error and binary cross-entropy for each output image. The mean squared error compares the pixel value of the output from the generator with that of the original color version of the input image. Mean squared error is typically used to calculate model accuracy for a linear regression function, such as predicting house prices over time, where MSE indicates how close the regression line (a set of predicted outputs) is to a set of points. It does this by calculating the distances from the points to the regression line known as “errors” or “loss” and squaring them (the squaring removes any negative values).

He used the same logic to calculate the average squared difference in pixel value between the output colors from the generator and the true colors from the training image.

“It tells you how wrong you are at each pixel—and, on average, how wrong you are for the entire image,” explained He. “Say the color is supposed to be deep red, but your model predicts light red or pink, then you’re slightly off. But if the model predicts blue, then you’re even farther off.”

In other words, the model’s success is predicated on achieving an MSE as low as possible for each output image. Due to constraints in computing power, He was only able to use a relatively small image dataset, so she did not specify a specific MSE threshold that would be considered successful. In general, larger datasets result in more accurate predictions, and therefore a lower MSE.

“I have 15,000 images in the training set, and I did over 500 iterations, which took three whole days using a GPU on the cloud,” said He. “Since it's really hard for the model to converge, I just stopped at the point where the training image looks somewhat believable to the human eye.”

Dipanjan Sarkar, He’s mentor during her Bootcamp course, is working on an image colorization model of his own as he prepares to write a book on the subject. “After a thousand iterations, the loss is around 0.0001,” he offered by way of a benchmark. “You want that number to be as low as possible.”

While the generator learns from an MSE score indicating how close the pixel value of the output image is to the training data, binary cross-entropy is used as the loss function for the discriminator. Binary cross-entropy is a loss function used in binary classification tasks—in other words, classification problems with only two choices, such as yes or no, or real or fake. In this case, binary cross-entropy is a natural choice given that the discriminator outputs a single number for each input.

“This value tells you how real the discriminator thinks the input image is, whether it considers it to be an original image from the training dataset, or the output from the generator,” said He.

That value is then passed into a sigmoid function so that it can be interpreted as a probability, and binary cross-entropy penalizes each wrong classification. A sigmoid function is an S-shaped curve which serves as an activation function in machine learning that adds non-linearity to a model. In other words, it decides which values to pass as output and which ones not to pass. The range of the sigmoid function is always between 0 and 1.

For example, in the case of a single image, if the input to the discriminator is an output from the generator, the true label for this input is zero (indicating fake colors). If the discriminator outputs a very negative value such that when passed into the sigmoid function it becomes very close to zero, the loss will be small. However, if the value is a large positive number, when passed into the sigmoid function, it will be closer to one, indicating a larger loss. Binary cross-entropy compares the probability distribution between the true labels and the outputs of the discriminator.

In addition to these statistical values, she also used a visual benchmark to evaluate the results of her model. DeOldify is a state-of-the-art deep learning model developed by Jason Antic for image colorization. He used the DeOldify API to run her test images through Antic’s model and perform a side-by-side comparison of the output alongside the predictions generated by her own model. These were the results.

DeOldify v. Model Comparison: Example Image 1
DeOldify v. Model Comparison: Example Image 2

The images from He’s model are noticeably more vibrant, but she notes that there are fewer random patches of color in the background in the output images from DeOldify.

“The difference in color is likely due to the characteristics of the input,” she said. “I think DeOldify does better with higher resolution images since it's trained on larger images, and a higher resolution provides more information--for instance, at a lower resolution, an apple might look like an orange but we know they have different colors.”

Given that He’s image dataset was generated from a TV show, human faces were the primary subject of the images. Consequently, the neural network showed an aptitude for detecting and colorizing human faces. However, for objects that don’t have a distinctive color, such as background elements or clothing, the GAN opted for a safe brownish color to minimize errors.

Sample Video

This is a short clip from the original (black and white) Twilight Zone. Click to play on YouTube.
And here is the network output.

Sample Images

A gallery of images from the training set.

Example 1
Example 2
Example 3
Example 4

Aside from rounding errors made by the GAN, in which the agglomerated color values produced a sepia tone, the color palette of the training dataset may have also been a factor in causing the output images to exhibit a brownish tint. “Because I used a TV show from the seventies, the colors are not quite as bright as, say, TV shows from more recent years,” He said. “So I think that’s one of the reasons why you get this brownish overtone.”

However, rather than using screenshots from a modern-day show to train her model, He purposely chose an old TV series in order to remain faithful to the aesthetic of The Twilight Zone, which was shot before color TVs were fully mainstream.

“I didn’t think the bright colors would fit the original as well if I were to use something more recent,” she said.

For an image generation task of this nature, selecting a training and test set of similar composition was also important to train the model how to map specific objects to their corresponding colors. For example, if one were to use the model to colorize a wildlife documentary, the training dataset should comprise images of animals, otherwise, the model would not be able to determine coloring for specific objects.

Improving on the model

He says she would have liked to use a more diverse set of training data on her model to improve the accuracy of general object colorization. Given the nature of her training set, the model was overwhelmingly taught to recognize human faces, while receiving less input regarding other objects. If she were to retrain her model, He says she would also select a TV show with a more diverse cast.

“I think the skin tones for the actors and actresses are quite homogeneous in the training set because it is a show from the seventies after all,” she said. “So I'm a little concerned that if the model were to see footage from a newer TV show, everybody's face will have the same skin color, which might not be appropriate.”

Sarkar, who worked closely with He on her capstone project and helped her hone the topic, praised her for finding an innovative way to use machine learning to tackle a real-world problem and going above and beyond linear prediction problems.

“Generative deep learning is pretty advanced,” said Sarkar. “It is not easy to do because you are not just predicting, say, a classification label like ‘spam’ or ‘not spam,’ or you're not just predicting a house price, but you're generating an entire image.”

Another way He would like to improve her model, if she had the requisite computing power, is to train it on high-resolution images without needing to downsize the photographs.

“I definitely want to make the model work with higher resolution images, but at the same time I want to make it more efficient and scalable as well,” she explained. “The model currently has over 23 million parameters, some of which might not even be activated or used. So there's definitely newer architecture out there that can perhaps reduce the number of parameters and achieve similar results and process bigger images better.”

While He has not experimented with her model outside of her capstone project, as a photographer she has several ideas for how she can put her model to use.

“I haven't tried it yet, but I would like to try to colorize some of the images that my grandparents have,” she said. “I think it would be lovely to [recreate], say, their wedding photos in color and see what they think of it.”


Katie He is a member of the Machine Learning Strategy and Risk Platform team at Welton Investment Partners, where she researches and develops new systematic trading models and maintains Welton’s in-house risk platform. She’s the architect of the team's machine learning computing infrastructure and bespoke computer cluster.

Before joining Welton in 2018, Katie was a Teaching Assistant for undergraduate macroeconomics while earning an MS in Applied Economics and Finance at the University of California, Santa Cruz. She graduated Magna Cum Laude from UCSC with a BA in Economics/Mathematics.

Generic Profile Image