How we generated a basic cartoon from text input

Storybird uses a combination of voice cloning and large language models to create custom audio stories for kids. But recently, we started experimenting with creating custom cartoons, too, and we want to describe how we did it, and what we think is coming next. The only input for the cartoon was a request from a kid named Asher, who asked for a cartoon about a 6 year old boy, a 1000 Soldiers, Cereal, and an Army Museum.

This is the output:

The cartoon we created suffers from several limitations: the dictation is somewhat robotic; the illustrations are not animated; and it lacks some continuity. But we believe that if the earlier part of 2023 is dominated by Large Language Models, the latter part of this year will be dominated by movie generation models. Google is working on a project called Phenaki/Imagen, Facebook is releasing Make-A-Video, and anthropic and OpenAI are also working on similar technology. While the cartoon you see above is quite primitive, we believe that we aren’t far from generating realistic, animated cartoons that kids can actually interact with and “choose their own adventure.”

Story generation.

First, we sent Asher’s story request to OpenAI’s newly released GPT 3.5 API and asked it to create a narrative. We noticed that if you ask ChatGPT for a kid’s story about particular elements, the stories appear superficially OK, but on closer examination they all follow a prescribed format: A character is introduced, there is some challenge, and then there is falling action in the last three paragraphs. In the second to last paragraph, ChatGPT likes to say “As the sun set…".

We got around this through prompt engineering. Instead of asking ChatGPT to create a children’s story based on the chosen elements, we first asked ChatGPT to transform the narrative in a number of steps. First, we asked ChatGPT to assume the role of a children’s author; then we asked it to create a theme for a story based on certain inputs; then we ask it to create a bullet point outline of the story in chapters; then we ask it to write out the chapters; then we ask it to put together the whole story; then we ask it to rewrite the story in a more visual manner; then we ask it to make it more kid friendly; and so on. With each successive transformation, the story gets better. The end result is a cohesive children’s story containing Asher’s desired elements.

Audio generation.

Once we have the story, we create an audio narration of it. Right now, this is using my voice, and limited training data, but soon we’ll let people create stories in a range of voices, including their own. We do this using’s API, but believe Eleven Labs is a viable option as well.

Image generation.

Now that we have the story and audio, our task is to create good illustrations.

There are a few challenges that we have had to explore using each of the existing text-to-image generators, like Stable Diffusion, DALLE-2, and Midjourney. One was creating consistency in images as these tools have trouble retaining context between images, which is necessary for an extended narrative. You don’t want the main character to change midway through a story. Same thing applies to the style of the visuals. We want to have a consistent visual from the beginning to the end to avoid making the story feel disjointed.

Second was ensuring there are no deformed characters especially where limbs, eyes and fingers are incorrectly depicted. Third was to ensure the visual style of the images are suited for a children’s animation. To overcome these, we tested different AI image generators, as well as using prompt engineering.

Prompt Generation for images using ChatGPT.

At first, we fed ChatGPT’s story back into ChatGPT. We asked ChatGPT to extract details about the main characters, main locations, and main story props, specifically asking for details that describe the subjects visually. Below is our conversation with ChatGPT:

Pretend you are the children’s fiction writer of the following story written in inverted commas. Confirm by responding Yes. “Place story content here”.
I want you to provide a visual description of the character Asher in bullet points.
6 years old boy, Curious, adventurous, imaginative, kind-hearted, brave, determined, leader, open-minded, eager to learn, values friendship.

So the final output was: 6 years old boy, Curious, adventurous, imaginative, kind-hearted, brave, determined, leader, open-minded, eager to learn, values friendship.

The same method was used to create the initial prompts for each of the key elements of the story. Texts below detail how we created the character of Asher for the cartoon using AI.

Image generation using Stable Diffusion

We took the keywords into Stable Diffusion Version 1.5, and it produced the below images.. You can see the 3 problems mentioned earlier clearly persisting here; consistency, body deformation, and unsuitability of style. Note that each of these images were created with the exact same input.

We then added other prompts to achieve a more consistent result, as well as a style suited for children’s cartoon. These included; “concept art, children’s book illustration style,”. Also note that each of these images were created with the exact same input.

As you can see, there are less problems with deformity. And the style, although far from perfect, is now closer to what we want to suit a children’s cartoon. To further resolve these, we experimented with different Stable Diffusion models.

Stable Diffusion has many different models now, and model Version 1.5 is one of them. These different models can be installed online. We experimented using the same prompts with different model versions, and the model Comic Diffusion V2 gave us much better results for the style, which are the images below, using the exact prompts as model Version 1.5 for better comparison.

We are now much happier with the lack of deformity, as well as the style of the images. However, the one problem that this does not touch on is the consistency. With every new image generation, the characters, color scheme, and general style changes significantly. Some images, as you can see, draw two different characters in the same image.

This is because Stable Diffusion generally requires more descriptions to give you a more accurate image. So we added more prompts to control this, such as describing the characters clothing and colors, his hairstyle and colors, and his general features. Below are the results.

We are getting closer to a more consistent character design, but despite more accurate descriptions, the characters still differ in every new image generation.

To create consistency, we added the keyword “character design sheets”, so different character poses and expressions are created within the same image. The following are the outcome.

While there is more consistency here for each character, they are still slightly different in every different pose. Stable Diffusion is also confusing the colours where the cloths keep changing colour in each character. It has been a very common downfall of Stable Diffusion to confuse colours and assign them to different objects in each different iteration of image generation. This was not helping us.

Stable Diffusion ControlNet and Img2Img

Instead of text to image (txt2img), we tried using image to image generation (Img2Img) to tackle the problems of consistency. But the same problems about the lack of color and character consistency persisted. Also in some cases it wasn’t helping with deformed limbs in characters, as you see in the middle image below. The left image below is the original 3D character used in Img2Img, and the other two are the different results generated using the exact same prompts.

In Stable Diffusion, we reduced the Denoising Strength from 0.75 to 0.4, and the images were becoming better and clearer.

The images are looking a lot better, however, there are a few problems with this approach. One is that Stable Diffusion cannot redesign, and change the clothing colours and the character poses without significantly impacting the consistency of the character, see the earlier example.

The other problem with this approach is that the character has to be designed, each different pose re-animated, and all the colours configured manually before putting them through Stable Diffusion. This would take a lot of man-hours, and would beat the purpose of using AI. As a result, I did not pursue this route any further.

Image Generation using MidJourney

At the same time, experimenting with Midjourney was giving us much different results. Running on Discord, Midjourney tends to provide cleaner images with less descriptions and prompts.

Using the very initial prompts of “6 years old boy, Curious, adventurous, imaginative, kind-hearted, brave, determined, leader, open-minded, eager to learn, values friendship.”, these images across were produced. Notice the clarity of the images, and lack of deformity of the images. Also notice how each of the words were translated into a symbolic image of them; ie. Brave was translated into a lion.

Hence, the keywords were modified to remove any excess visuals. Learning from the prompt engineering with the Stable Diffusion experiments, we also added “Children’s book illustration style, character design sheet, multiple poses and expressions,” as well as keywords to describe the characters clothing, colours, hairstyle, and his other features.

In the images below you can see the changes as we progressively added more descriptions of clothes and colors to the prompts;

The text to image results, using the same prompts, are a lot more acceptable coming from Midjourney compared to Stable Diffusion, in terms of consistency, control of the style, and lack of deformed body parts.

Image below shows the final image with the prompts used for it. Also the buttons at the bottom of the image allow for upscale (U1, U2, U3, U4) of the four different images, as well as create a variation (V1, V2, V3, V4) of the four different images.

Below is an upscale of image 1 top left (U1) for illustration purposes. Notice the consistency of the character, style, as well as the higher quality of the image compared to Stable Diffusion images, especially the consistency in colour, where Stable Diffusion was suffering. There is very few deformity around the mouth but that can be fixed with a quick edit.

Once we found a method that works, the rest of the characters were created using the same methodology in Midjourney.

To keep it all consistent, the location images were also created using the same methodology and in Midjourney. Also with locations, it is more acceptable for each image to look slightly deformed, and different. We overlaid this with audio to create a video.