Generative AI as an Extreme Compression Method

Are text files the future of image and, potentially, video compression? Using text prompts to generate images is well known. But, what about using generative AI in the reverse direction to create a text file prompt for a given image? Using this technique could result in more than 100:1 improvement in file size from hundreds of kilobytes to kilobytes.

It is clear from the limited test results shown below, that off-the-shelf generative AI products could achieve this sort of compression improvement. The fundamental stumbling block, at this point, is the accuracy of the recreated image.

[Note: Added after publication. This 2023 Tweet from John Carmack pointed out that “Generative AI has the potential to improve video compression by an order of magnitude, but at an extreme computational cost. It is an interesting question whether that will ever be worth it, given that network and storage cost and capacity are still improving as fast as computation.”

Additionally, this November 2022 Baidu Research (USA) paper proposed “a generative image compression method that demonstrates the potential of saving an image as a short text embedding which in turn can be used to generate high-fidelity images which is equivalent to the original one perceptually.”

Added January 24th 2026 – A Synaptics NPU, integrated into a set-top from Bouygues Telecom turns any video into HDR, noise-free output in real-time. It is also done locally without the need for processing in the cloud.]

Table of Contents

Reverse Generative AI Test with Gemini
Reverse Generative AI Test with Grok 3
A Journey Led by a Pied Piper?
Added 7/27/25 – Using JSON as a Descriptor
Footnotes:

Reverse Generative AI Test with Gemini #

As a test to see how close we are to answering the above question, the following prompts were given to Gemini

1st Prompt “Could you create an image that shows a reference frame with an associated text prompt that can be used for this article?” Second prompt after the first image was a black and white chart. “Well, I was hoping for an image of, say a person, with a description of the action the person is doing (say running) with a description of the background.”

Based on that minimal prompt, Gemini returned the following image

A Gemini-generated reference image and description

Gemini was then asked to describe the above image and it came back with the following description

“A young woman with fair skin and long, brown hair tied in a ponytail is running along a paved path in a sunny park. She’s wearing a bright orange t-shirt, dark blue athletic shorts, and white running shoes. Her face shows a joyful expression, with a slight smile and her eyes looking ahead. The path curves gently to the left, bordered by lush green grass and a variety of colorful flower beds. Tall trees with full foliage line the path in the background, casting dappled shadows. The sun is shining brightly, creating a warm, inviting atmosphere. The overall scene has a vibrant, energetic feel, with a sense of motion and vitality.”

Yikes, the most significant mistake is that it identified the person in the image as female, instead of male. This is borne out when this prompt was input into another instance of Gemini (different account), resulting in the following image.

Gemini-generated image based on Gemini-generated prompt from the Gemini-generated reference image

From a compression standpoint, it is a winner as the text prompt is about 1 KBytes compared to 623 KBytes (2048×2048 pixels) for the original image. Even if the prompt were ten times more detailed, the text file would still be approximately 7KBytes, meaning about a 90:1 compression. The quality of the regenerated image is as good as the original. The issue is accuracy. The inaccuracy starts with the person (female instead of male) and continues in the details (e.g., both images appear to be in a park, but one is on a path and the other is running on a lawn).

Reverse Generative AI Test with Grok 3 #

The Gemini-created prompt was input to Grok 3 to create a reference image, as seen in the image on the right, below. In many ways, this is close to the Gemini-generated image. ¹

Gemini-created Image based on Gemini-generated prompt

Grok 3-created Image based on Gemini-generated prompt

Grok 3 was then asked to describe the image it created and it came up with this description

“A young woman with fair skin and long, brown hair tied in a ponytail is running along a paved path in a sunny park. She’s wearing a bright orange t-shirt, dark blue athletic shorts, and white running shoes. Her face shows a joyful expression, with a slight smile and her eyes looking ahead. The path stretches straight ahead, bordered by lush green grass and vibrant flower beds with red, yellow, and pink flowers. Tall trees with full, green foliage line the path in the background, casting dappled shadows. The sun is shining brightly, creating a warm, inviting atmosphere with a clear blue sky. The overall scene has a vibrant, energetic feel, with a sense of motion and vitality.”

The description is very close to the Gemini-generated prompt except for some additional details, such as

3rd line – The path stretches straight ahead instead of curving gently.
3rd line – The flowers are described as “vibrant flower beds with red, yellow, and pink flowers”, instead of just “colorful”.
4th line – The foilage is described as green.

The Grok 3 description was then fed back as a prompt, resulting in the following images²

Grok 3-generated image based on the Grok 3 prompt

Again, quality is not an issue, but accuracy is. Although more accurate than Gemini, in the sense that it didn’t mix up the gender, the output images shared similarities with, but would not be confused with the original. Still, given the relatively nascent nature and the continuing rapid development of generative AI for image creation, it does not seem out of the realm of possibility that the accuracy could be improved to the point where the human eye will not be able to discern a difference between the original and the regenerated image based on its text description.

A Journey Led by a Pied Piper? #

The inspiration for this article was a 25-minute conversation with Gemini while taking an afternoon stroll. The focus of that conversation was using the above techniques for video compression. At the end of the conversation, Gemini wrote a paper summarizing what was learned. The paper addressed the commercial possibilities, the technical challenges, and the opportunities of this new approach. Ultimately, we may have re-invented something similar to Pied Piper from the fictional show, Silicon Valley.

Gemini’s shortcoming was the lack of external references. It simply did not have links to its claims. This meant fact-checking, rewriting, and injecting new points into the paper. In the end, that article was shelved. With that said, there were a couple of interesting takeaways from that effort

Deep Render seems to be leading the way with its generative AI-based compression work .
NPUs (Neural Processing Units), specialized processors for neural network acceleration and crucial for real-time AI video, are projected to be in 2.2 billion devices by 2028 (according to data presented by Deep Render in this interview with streaming guru, Jan Ozer).

Whether the approach described above has a future or if it takes the path of Deep Render, AI is sure to have some role in the future of image and video compression.

Added 7/27/25 – Using JSON as a Descriptor #

This Tweet from @IamEmily2050 inspired taking the JSON (see below for the prompt) she had provided and inputting in into Gemini. Although attributes such as the background and pose aren’t identical it is amazing it appears to be the same person in each picture.

Result of @IamEmily2025’s JSON prompt using Google Imagen V4 Ultra on the Gemini app

@IamEmily2025’s JSON prompt on 7/26/25 using either Gemini Imagen 4 or Imagen 3

@IamEmily2025’s JSON prompt on 7/27/25 using either Gemini Imagen 4 or Imagen 3

Grok provided similar results, although it was clearly a different person. Grok’s current thinking on this topic doesn’t deviate much from when this article was written.

“Overall, this approach has niche potential as an extreme compression tool, particularly for media where creativity trumps exactness—like regenerating stock photos or video thumbnails from prompts.arxiv.org However, tying back to our determinism conversation, widespread adoption would require standardized, reproducible models to ensure reliable decompression. Without that, it’s more of a clever hack than a robust method. If models evolve toward better controllability (e.g., via embeddings like in the cited paper), it could disrupt storage paradigms, but for now, it’s best suited to experimental or artistic workflows like the balcony portrait recreations we’ve been playing with.”

JSON Used to Generate the Above Images

{
  "meta": {
    "styleName": "Candid Balcony Sunlight",
    "aspectRatio": "3:4",
    "promptPrefix": "IMG_2025.JPG"
  },
  "camera": {
    "model": "35mm film camera (e.g., Canon AE-1, Olympus OM-1)",
    "focalLength": "standard prime lens (e.g., 50mm)",
    "angle": "slightly low-angle shot, looking up at the subject",
    "type": "candid portrait photography"
  },
  "subject": {
    "primary": "a young East Asian woman",
    "emotion": "a soft, natural, and slightly introspective expression",
    "pose": "standing casually on a balcony, body angled with one hip pushed out, head tilted slightly",
    "gaze": "looking towards the camera"
  },
  "character": {
    "appearance": "long, dark, messy and voluminous hair, natural 'no-makeup' look",
    "wardrobe": "an oversized, plain white t-shirt and simple light-colored underwear or bikini bottoms",
    "accessories": "none"
  },
  "composition": {
    "theory": "natural framing using the balcony railing, rule of thirds, candid moment capture",
    "visualHierarchy": "The subject is the primary focus, with the urban background providing environmental context."
  },
  "setting": {
    "environment": "an apartment balcony on a sunny day",
    "architecture": "generic city apartment buildings and power lines visible in the hazy background",
    "furniture": "simple balcony railing in the foreground"
  },
  "lighting": {
    "source": "direct, natural afternoon sunlight",
    "direction": "high side lighting",
    "quality": "harsh and high-contrast, creating bright, almost blown-out highlights on the skin and clothing, with deep, defined shadows. The overall light has a warm, golden quality."
  },
  "style": {
    "artDirection": "lo-fi, vintage, slice-of-life, reminiscent of Japanese or Korean indie film aesthetics",
    "mood": "nostalgic, intimate, lazy, dreamy, and sun-drenched"
  },
  "rendering": {
    "engine": "emulation of vintage 35mm color film",
    "fidelitySpec": "visible film grain, soft focus (not digitally sharp), and halation or bloom in the highlights",
    "postProcessing": "color grading to mimic a vintage film stock, featuring slightly washed-out colors, a warm yellow or greenish cast, and cool tones in the shadows"
  },
  "colorPlate": {
    "primaryColors": [
      {"name": "Bright White (shirt)", "hex": "#F5F5F5", "percentage": "30%"},
      {"name": "Warm Skin Tones", "hex": "#E0BBAA", "percentage": "25%"},
      {"name": "Sky Blue", "hex": "#87CEEB", "percentage": "20%"}
    ],
    "accentColors": [
      {"name": "Dark Brown/Black (hair)", "hex": "#3D2B1F", "percentage": "15%"},
      {"name": "Shadow Blue/Gray", "hex": "#778899", "percentage": "10%"}
    ]
  }
}

Footnotes: #

It generated two images. This one was selected for reference purposes. ↩︎
Unfortunately, unlike the Gemini test case, the same X account was used to generate the image from the prompt and then recreate the prompt based on the Grok 3-generated image. The question is whether the prompt that Grok 3 created also referenced the earlier text prompt or if it was only analyzing the image. ↩︎

Author Ken Pyle, Managing Editor

View all posts by Ken Pyle, Managing Editor | Website

Comments

6 responses to “Generative AI as an Extreme Compression Method”

Ken Pyle, Managing Editor

March 29, 2025

This paper regarding Large Vision-Language Models that suggests a way to reduce hallucinations by steering the latent space. “In particular, we pre-compute the “direction” of more stable features and then apply them consistently to all query examples during inference to reduce hallucination, without introducing additional training or inference cost. As sometimes hallucination rise from the text decoder, i.e. the LLM, we further obtain a textual direction and apply it to the text decoder to maximize the performance.” https://arxiv.org/pdf/2410.15778

Loading…

Ken Pyle, Managing Editor

October 26, 2025

DeepSeek appears to be taking this the opposite way, whereby an image can be used to encode text and achieve 97% accuracy with 10x compression. They have developed an encoder and decoder pair. On page 11, they suggest that “In the future, it may be possible to achieve nearly 10x lossless contexts compressions through text-to-image approaches.” https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

Loading…

Ken Pyle

October 28, 2025

Synaptics provides another example of an NPU that could bring AI processing to the edge. https://cdn.bfldr.com/ZU41R0OK/at/tjm5s8hrmz5mgqrjtsrc5c4/sl2610-product-brief.pdf

Loading…

Ken Pyle, Managing Editor

January 24, 2026

And, in the case, the Synaptics NPU, is integrated into a set-top from Bouygues Telecom to turn any video into HDR, noise-free output in real-time. It is also done locally without the need for processing in the cloud. https://www.ariase.com/box/actualite/ce-matin-bouygues-telecom-devoile-son-nouveau-decodeur-b-tv-et-il-est-bourre-d-innovations

Loading…

Ken Pyle, Managing Editor

June 27, 2026

This is an interesting development from ASCILINE that seems to be taking a “semantic video summaries” to greatly reduce the amount of bandwidth necessary for transmitting video. https://github.com/YusufB5/ASCILINE

Loading…

Ken Pyle

July 11, 2026

Black Forest Labs, with their Flux models, is another to watch.

https://bfl.ai/research

Loading…