Generative AI as an Extreme Compression Method

Are text files the future of image and, potentially, video compression? Using text prompts to generate images is well known. But, what about using generative AI in the reverse direction to create a text file prompt for a given image? Using this technique could result in more than 100:1 improvement in file size from hundreds of kilobytes to kilobytes.

It is clear from the limited test results shown below, that off-the-shelf generative AI products could achieve this sort of compression improvement. The fundamental stumbling block, at this point, is the accuracy of the recreated image.

[Note: Added after publication. This 2023 Tweet from John Carmack pointed out that “Generative AI has the potential to improve video compression by an order of magnitude, but at an extreme computational cost. It is an interesting question whether that will ever be worth it, given that network and storage cost and capacity are still improving as fast as computation.”

Additionally, this November 2022 Baidu Research (USA) paper proposed “a generative image compression method that demonstrates the potential of saving an image as a short text embedding which in turn can be used to generate high-fidelity images which is equivalent to the original one perceptually.”]

Reverse Generative AI Test with Gemini #

As a test to see how close we are to answering the above question, the following prompts were given to Gemini

1st Prompt “Could you create an image that shows a reference frame with an associated text prompt that can be used for this article?” Second prompt after the first image was a black and white chart. “Well, I was hoping for an image of, say a person, with a description of the action the person is doing (say running) with a description of the background.”

Based on that minimal prompt, Gemini returned the following image

A Gemini-generated reference image and description

Gemini was then asked to describe the above image and it came back with the following description

“A young woman with fair skin and long, brown hair tied in a ponytail is running along a paved path in a sunny park. She’s wearing a bright orange t-shirt, dark blue athletic shorts, and white running shoes. Her face shows a joyful expression, with a slight smile and her eyes looking ahead. The path curves gently to the left, bordered by lush green grass and a variety of colorful flower beds. Tall trees with full foliage line the path in the background, casting dappled shadows. The sun is shining brightly, creating a warm, inviting atmosphere. The overall scene has a vibrant, energetic feel, with a sense of motion and vitality.”

Yikes, the most significant mistake is that it identified the person in the image as female, instead of male. This is borne out when this prompt was input into another instance of Gemini (different account), resulting in the following image.

Gemini-generated image based on Gemini-generated prompt from the Gemini-generated reference image

From a compression standpoint, it is a winner as the text prompt is about 1 KBytes compared to 623 KBytes (2048×2048 pixels) for the original image. Even if the prompt were ten times more detailed, the text file would still be approximately 7KBytes, meaning about a 90:1 compression. The quality of the regenerated image is as good as the original. The issue is accuracy. The inaccuracy starts with the person (female instead of male) and continues in the details (e.g., both images appear to be in a park, but one is on a path and the other is running on a lawn).

Reverse Generative AI Test with Grok 3 #

The Gemini-created prompt was input to Grok 3 to create a reference image, as seen in the image on the right, below. In many ways, this is close to the Gemini-generated image. 1

Grok 3 was then asked to describe the image it created and it came up with this description

“A young woman with fair skin and long, brown hair tied in a ponytail is running along a paved path in a sunny park. She’s wearing a bright orange t-shirt, dark blue athletic shorts, and white running shoes. Her face shows a joyful expression, with a slight smile and her eyes looking ahead. The path stretches straight ahead, bordered by lush green grass and vibrant flower beds with red, yellow, and pink flowers. Tall trees with full, green foliage line the path in the background, casting dappled shadows. The sun is shining brightly, creating a warm, inviting atmosphere with a clear blue sky. The overall scene has a vibrant, energetic feel, with a sense of motion and vitality.”

The description is very close to the Gemini-generated prompt except for some additional details, such as

  • 3rd line – The path stretches straight ahead instead of curving gently.
  • 3rd line – The flowers are described as “vibrant flower beds with red, yellow, and pink flowers”, instead of just “colorful”.
  • 4th line – The foilage is described as green.

The Grok 3 description was then fed back as a prompt, resulting in the following images2

Again, quality is not an issue, but accuracy is. Although more accurate than Gemini, in the sense that it didn’t mix up the gender, the output images shared similarities with, but would not be confused with the original. Still, given the relatively nascent nature and the continuing rapid development of generative AI for image creation, it does not seem out of the realm of possibility that the accuracy could be improved to the point where the human eye will not be able to discern a difference between the original and the regenerated image based on its text description.

A Journey Led by a Pied Piper? #

The inspiration for this article was a 25-minute conversation with Gemini while taking an afternoon stroll. The focus of that conversation was using the above techniques for video compression. At the end of the conversation, Gemini wrote a paper summarizing what was learned. The paper addressed the commercial possibilities, the technical challenges, and the opportunities of this new approach. Ultimately, we may have re-invented something similar to Pied Piper from the fictional show, Silicon Valley.

Gemini’s shortcoming was the lack of external references. It simply did not have links to its claims. This meant fact-checking, rewriting, and injecting new points into the paper. In the end, that article was shelved. With that said, there were a couple of interesting takeaways from that effort

  • Deep Render seems to be leading the way with its generative AI-based compression work .
  • NPUs (Neural Processing Units), specialized processors for neural network acceleration and crucial for real-time AI video, are projected to be in 2.2 billion devices by 2028 (according to data presented by Deep Render in this interview with streaming guru, Jan Ozer).

Whether the approach described above has a future or if it takes the path of Deep Render, AI is sure to have some role in the future of image and video compression.


Footnotes: #

  1. It generated two images. This one was selected for reference purposes. ↩︎
  2. Unfortunately, unlike the Gemini test case, the same X account was used to generate the image from the prompt and then recreate the prompt based on the Grok 3-generated image. The question is whether the prompt that Grok 3 created also referenced the earlier text prompt or if it was only analyzing the image. ↩︎

Author Ken Pyle, Managing Editor


Posted

in

,

by

Comments

One response to “Generative AI as an Extreme Compression Method”

  1. Ken Pyle, Managing Editor Avatar

    This paper regarding Large Vision-Language Models that suggests a way to reduce hallucinations by steering the latent space. “In particular, we pre-compute the “direction” of more stable features and then apply them consistently to all query examples during inference to reduce hallucination, without introducing additional training or inference cost. As sometimes hallucination rise from the text decoder, i.e. the LLM, we further obtain a textual direction and apply it to the text decoder to maximize the performance.” https://arxiv.org/pdf/2410.15778

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.