Hacker News new | past | comments | ask | show | jobs | submit login
The state of AI for hand-drawn animation inbetweening (yosefk.com)
262 points by luu 14 days ago | hide | past | favorite | 38 comments



Great article!

This is one of the most overlooked problems in generative AI. It seems so trivial, but in fact, it is quite difficult. The difficulty arises because of the non-linearity that is expected in any natural motion.

In fact, the author has highlighted all the possible difficulties of this problem in a much better manner.

I started with some simple implementation by trying to move segments around the image using some segmentation mask + ROI. That strategy didn't work out, probably because of some mathematical bug or data insufficiency data. I suspect the later.

The whole idea was to draw a segmentation mask on the target image, then draw lines that represent motion and give options to insert keyframes for the lines.

Imagine you are drawing a curve from A to A. You divide the curve into A, A_1, A_2... B.

Now, given the input of segmentation mask, motion curve, and whole image, we train some model to only move the ROI according to the motion curve and keyframe.

The problem with this approach is in sampling the keyframe and matching consistencies --making sure RoI represents same object-- across subsequent keyframes.

If we are able to solve some form of consistency, this method might be able to give enough constraints to generate viable results.


I currently shelved 3K more words of why it's hard if you're targeting real animators. One point is that human inbetweeners get "spacing charts" showing how much each part should move, even though they understand motion very well, because the key animator wants to control the acting


Great read - I learned quite a bit about this. The only quibble I had is at the end, and it's a very general layperson's view:

> But what’s even more impressive - extremely impressive - is that the system decided that the body would go up before going back down between these two poses! (Which is why it’s having trouble with the right arm in the first place! A feature matching system wouldn’t have this problem, because it wouldn’t realize that in the middle position, the body would go up, and the right arm would have to be somewhere. Struggling with things not visible in either input keyframe is a good problem to have - it’s evidence of knowing these things exist, which demonstrates quite the capabilities!).... This system clearly learned a lot about three-dimensional real-world movement behind the 2D images it’s asked to interpolate between.

I think that's an awfully strong conclusion to draw from this paper - the authors certainly don't make that claim. The "null hypothesis" should be that most generative video models have a ton of yoga instruction videos shot very similarly to the example shown, and here the AI is simply repeating similar frames from similar videos. Since this most likely wouldn't generalize to yoga videos shot at a skew angle, it's hard to conclude that the system learned anything about 3D real-world movement. Maybe it did! But the study authors didn't come to that conclusion, and since their technique is actually model-independent, they wouldn't be in a good position to check for data contamination / overfitting / etc. The authors seem to think the value of their work is that generative video AI is by default past->future but you can do future->past without changing the underlying model, and use that to smooth out issues in interpolation. I just don't think there's any rational basis for generalizing this to understanding 3D space itself.

This isn't a criticism of the paper - the work seems clever but the paper is not very detailed and they haven't released the code yet. And my complaint is only a minor editorial comment on an otherwise excellent writeup. But I am wondering if the author might have been bedazzled by a few impressive results.


You're technically correct, there's no basis to argue that a "3D representation" was learned as opposed to "enough 2D projections to handle the inputs in question." That said, the hand which did not exist in either of the original 2D frames makes an appearance. I think calling wherever it was pulled out of "the 3rd dimension" is not wrong; it was occluded in both 2D inputs and functionally you had to know about its existence in the 3rd dimension to show it even if you technically did it by learning how pixels look across frames.

You can also see much more 3D ish things in the paper, with 2 angles of a room and video created moving the camera between them. Of course in some sense it adds to my point without detracting from yours...


My problem is that this behavior can be attained by a shallower and non-generalizable understanding. Instead of realizing that the hand is blocked, perhaps the system's model is equivalent to the hand disappearing and reappearing with a "swipe." This understanding would not be obtained by a 3D modeling of human anatomy, but rather a hyperfocused study of yoga videos where the camera angle is just like the one shown in the paper (it is mostly the cliched camera angle that raises my suspicions). An understanding like this would not always generalize properly, e.g. instead of a hand being partially occluded in a skew video it visibly pops in and out, or the left hand strangely blends into the right.

There's a general issue with generative AI drawing "a horse riding an astronaut" - art generators still struggle to do this because they just can't generalize to odd scenarios. I strongly suspect this method has a similar issue with "interpolate the frames of this yoga video with a moving handheld camera." AFAIK these systems are not capable of learning how 3D people move when they do yoga: they learn what 2D yoga instructional videos look like, and only incidentally pick up detailed (but non-generalizable) facts about 3D motion.


The results are actually shockingly bad, considering that I think this should be _easier_ than producing a realistic image from scratch, which ai does quite well.

I don't have more than a fuzzy idea of how to implement this, but it seems to me that key frames _should_ be interchangeable with in between frames, so you want to train it so that if you start with key frames and generate in-between frames, and then run the in-between frames through the ai, it should regenerate the keyframes.


> I think this should be _easier_ than producing a realistic image from scratch

Think of this in terms of constraints. An image from scratch has self consistency constraints (this part of the image has to be consistent with that part) and it may have semantic constraints (if it has to match a prompt). An animation also has the self consistency constraints, but also has to be consistent with other entire images! The fact that the images are close in some semantic space helps, but all the tiny details become so important to get precisely correct in a new way.

Like, if a model has some weird gap where it knows how to make an arm at 45 degrees and 60 degrees, but not 47, then that's fine for from-scratch generation. It'll just make one like it knows how (or more precisely, like it models as naturally likely). Same with any other weird quirks of what it thinks is good (naturally likely): It can just adjust to something that still matches the semantics but fits into the model's quirks. No such luck when now you need to get details like "47 degrees" correct. It's just a little harder without some training or modeling insight into how an arm at 45 degrees and 47 degrees are really "basically the same" (or just that much more data, so that you lose the weird bumps in the likelihood).

I wouldn't be surprised if "just that much more data" ends up being the answer, given the volume of video data on the internet, and the wide applicability of video generation (and hence intense research in the area).


It's counterintuitive but less so considering that it's way easier for a human to draw something from scratch than to inbetween 2 key frames as well!

(I guess we're used to machines and people struggling at opposite things so this is counter counter intuitive, or something...)

Animation key frames are not interchangeable with inbetween frames since the former try to show the most body parts in "extreme" positions though it's not always possible for all parts due to so called overlapping action. This is not to say you can't generate plausible "extremes" from inbetweens; acting wise key frames definitely have the most weight.

AI being good at stills is true, though it takes a lot of prompting and cherry picking quite often; most results I get out of naively prompting the most famous models are outright terrifying.


Animation is much lower framerate than live video, motion can be extremely exaggerated and the underlying shape can depend on the view, i.e. be non-euclidean. Additionally there are fewer high-frequency features (think leopard spots) that can be cues about how the global shape moves (leopard outline). And of course things are drawn by humans, not captured by cameras, which means animation errors will be pervasive throughout the training data.

These things combined mean less information to learn a more difficult world model.


I only scrolled through the article, reading snippets and looking at pictures, but the pictures of yoga moves were what caught my attention of "this is hard". Specifically, interpolating between a leg that's visible and extended, to a leg that is obscured/behind other limbs... it will be impressive/magical when the AI correctly distinguishes between possibilities like "this thing should fade/vanish", and "this thing should fold and move behind/be obscured other parts of the image".


Same, I would have thought that edge detection would have been among the first problems to get solved !


A blog post from the same guy that used to maintain the C++ FQA!

https://yosefk.com/c++fqa/


One of my favourite bits of content written about C++. Highly recommended.


The state of the art of 3d pose estimation and pose transfer from video seems to be pretty accurate. I wonder if another approach might be to infer a wireframe model for the character, then tween that instead of the character itself. It would be like the vector approach described in the article but with much, much fewer vertices, then once you have the tween, use something similar to pose transfer to map the most recent character's frame depiction to the pose.

Training on a wireframe model seems like it would be easier, since there are plenty of wireframe animations out there (at least for humans) you could use and remove in-between frames to try inferring them.


Now that's a great use for AI! Inbetweening has always been a thankless job usually outsources to sweatshop-like studios in Vietnam and - even recently - North Korea (Guy Delisle's "Pyongyang" comic is about his experience as a studio liaison there).

And AI has less room to hallucinate - it is more a kind of interpolation - even if in this short curt example, the AI still "morphs" instead of cleanly transitioning.

The real animation work and talent is in keyframes, not in the inbetweening.


As an animator for 40 plus years, I can tell you that in-betweening is a very difficult job. The fact that it's often cheaply outsourced is more of a factor that the people paying for the animation simply don't care about the quality. The results are seldom good.

As to how much poor quality in-betweening hurts the performance to the audience is a complicated discussion. Animation that is _very_ bad can often be well accepted if other factors compensate (voice acting, design, direction, etc.)

A good in-betweener is not simply interpolating between the keys. For hand drawn animation at least, there's a lot more going on than that.

We'll leave out any discussion of breakdowns here. For one it's a difficult concept, much more difficult than 'tweening to explain. The other is that different animators will give different opinions on what a breakdown is or does.

I will say, though, I think that properly tagged breakdown drawings could significantly improve the performance of ai generated in-betweens.

Anyone who is seriously interested in the process should read the late, great Richard William's book, _The Animator's Survival Kit_. This is especially true for those who want to "augment" the process with machine learning. The book is very readable, even for non-artists. And he gets into the nitty gritty of what makes a good performance, and the mechanics behind it.

Edit: Another good resource, and relevant to 3D animation as well, is Raf Anzovin's _Just To Do Something Bad_ blog. He has many posts on what he calls "ephemeral rigging" that are absolutely fascinating. Be aware that the information is diffused through out the blog and not presented in a form for teaching. His opinions are fairly controversial in the field. But I think he is onto something. (https://www.justtodosomethingbad.com/)


Post author here - would be very interesting to hear more of your thoughts on this! It's not easy to find a pro animator willing to consider the question given the current level reached by AI methods


I would suggest reading the Williams book as a place to start. Thomas and Johnson's _The Illusion of Life_ is also a must. Thomas and Johnson were two of Disney's 9 old men. The Preston Blair book, simply called _Animation_ is good.

The thing about animation is that it is not about interpolation. It's about the spacing between drawings. The methods developed by animators were not at all mathematical, but something that they felt out by experimentation (trial and error).

The math that does enter into it are directly related to the frame rates. If animation had started in modern times, with frame rates of 30 fps or 60 fps, it would have been a very different animal. And much harder!

At 12 fpt or 24 fps you have a very limited range of "eases" that can be done. So while eases do figure into it, its the arcs, the articulation, and the perceived mass of the parts of the character that make it seem alive. Looking only at the contours and the in-betweens misses all the action.

An awareness of the graphic nature of the drawings, the stylizations of figures and faces are also critical. Cartooning is its own artform and it is tied directly to the way human brains make sense of what the eye sees. Getting more realistic often takes you further from your destination.

Storytelling is also a core part of good animation. Making a character seem to think and react, like it is alive can be done by a good animator. But you won't get there by imitating the real world directly. Rotoscoping has very limited use in good character animation and storytelling. It's all about abstracting out what the brain feels is important and what it expects. You can get away with murder if you caricature the right details.

When I've worked with training new animators, one of the points I stress is that it is articulation and the perceived mass of the character that really sells a performance. The best art style in the world is nearly useless if the viewer doesn't buy into the notion that they are watching a thinking person reacting with a physical body to events in an interactive world.

My feeling is that you will get further if you build articualated rigs and teach the ai to make it move. 2D or 3D. There is footage of tiny AI driven robots in a Google eperiement that are learning to play soccor. The ai is learning to make them move and solve problems (running around the soccar field and scoring goals.) Very natural looking behavior (animation!) develops almost automatically from that.

Trying to solve the problem by dealing with lines, contours, and interpolation seems very far away from the important parts of animtion.

Just my two cents worth.

Get a copy of the Williams book, it's on Amazon. Read his thoughts, he explains things much better, and more entertainingly, than I do. Sharpen up your pencil and start making some simple walks. Simple stick figures and tube people work just fine. And you may find that you enjoy the art form. Even if you don't become an animator yourself, the exercises will deepen your appreciation and understanding of the art form.


I've read Williams' book, and I've studied animation and done some, though I wouldn't call myself an animator yet.

I hope to avoid building rigs because they're, well, rigs. Much nicer and more flexible to control things though drawing than a rig which has a bunch of limitations and then there are issues with hair/cloth/water/etc. What can be done without a rig is another question but the methods I reviewed in this post are not the most that can be done for sure.


Rig can mean a lot of things. Hand drawn animators often use a rig, it’s just the rig is made of graphite lines on paper driven by a meat based neural network.

I'm not just being cute when I say that. The problems the AI in the examples was having have distinct analogs with the problems human animators have. Arcs are a problem, as is the notion that in-betweens are mostly about interpolation.

As I said, timing and articulation are at the heart of most kinds of animation. Even very stylized animation must be aware of this, if not being a slave to it. Imagination and expression are important, but first the audience has to _believe_.


Actually inbetweening is really hard (and I think requires talent) and used to be a big way to learn enough to become a key animator. And I would worry about AI eliminating this learning route if classical animation wasn't struggling to survive at all


It's hard like translating novels - you have to match someone else's style, which is why it's thankless.

I don't know if that's really a good pathway to become a key animator - how many inbetweeners are there for one key animator?


This was really fun. It captured a lot of thinking on a topic I've been interested in for a while as well.

The discussion about converting to a vector format was an interesting diversion. I've been experimenting with using potrace from inkscape to migrate raster images into SVG and then use animation libraries inside the browser to morph them, and this idea seems like it shares some concepts.

One of my favorite films is A Scanner Darkly, and that used a technique called rotoscoping which I recall was a combination of hand tracing animation and computers then augmenting it, or vice versa. It sounded similar. The Wikipedia page talks about the director Richard Linklater and also the MIT professor Bob Sabiston who pioneered that derivative digital technique. It was fun to read that.

https://en.m.wikipedia.org/wiki/Rotoscoping

https://en.m.wikipedia.org/wiki/Bob_Sabiston


> and that used a technique called rotoscoping

Technically it's "interpolated rotoscoping" using a custom tool called Rotoshop, which takes vector shapes drawn over footage then smoothly animates between the frames giving a distinct dream-like look to it.

Rotoscoping is where you work to a traditional animation framerate drawing over live action but each frame is a new drawing and doesn't have the signature shimmery look Scanner Darkly and Waking Life so I think it's worth pointing out the distinction.

https://en.wikipedia.org/wiki/Rotoshop


Animation has to be the most intriguing hobby I'm never planning on engaging with, so this kind of blog post is great for me.

I know hand-drawn 2D is its own beast, but what's your thought on using 3D datasets for handling the occlusion problem? There's so much motion-capture data out there -- obviously almost none of it has the punchiness and appeal of hand-drawn 2D, but feels like there could be something there. I haven't done any temporally-consistent image gen, just playing around with StableDiffusion for stills, but the ControlNets that make use of OpenPose are decent.

3D is on my mind here because the Spiderverse movies seemed like the first demonstration of how to really blend the two styles. I know they did some bespoke ML to help their animators out by adding those little crease-lines to a face as someone smiles... pretty sure they were generating 3d splines however, not raster data.

Anyway, I'm saving the RSS feed, hope to hear more about this in the future!


The 2nd paper actually uses a 3D dataset, though it explicitly doesn't attempt to handle occlusion beyond detecting it.

I sort of hope you can handle occlusion based on learning 2D training data similarly to the video interpolation paper cited at the end. If 3D is necessary, it's Not Good for 2D animation...

AI for 3D animation is big in its own right; these puppets have 1 billion controllers and are not easy for humans to animate. I didn't look into it deeply because I like 2D more. (I learned 3D modeling and animation a bit, just to learn that I don't really like it...)


Maybe there is also value in 2d datasets that aren't hand drawn. A lot of TV shows are made in Toon Boom or Adobe Animate (formerly Macromedia Flash). Those also do automatic inbetweening, but with a process that's closer to CSS animations: everything you want to move independently if it's own vector that can be moved, rotated and squished, and the software just interpolates the frames in between with your desired easing algorithm. That's a lot of data that's available on those original project files that's nontrivial to infer from the final product


I doubt you can learn much out of tweened flat cutouts beyond fitting polynomials to data points. The difficulty with full animation is rotation & deformation you can't do at all with cutouts. (Puppet warp/DUIK cutouts are much less rigid than Flash but the above still applies)


Really cool read, I liked seeing all the examples.

I wonder if it would be beneficial to train on lots of static views of the character too - not just the frames - so that permanent features like the face gets learned as a chunk of adjacent pixels, so when you go to make a walking animation, the relatively low amount of training data on moving legs in comparison to the high repetition of faces would cause only the legs to blur unpredictably, where the faces would be more in tact - the overall result might be a clearer looking animation.


Almost certainly a good idea. I'm about to start trying things in this direction


Not sure who would fund this research? Perhaps the Procreate Dreams folks [0]. I'm sure they'd love to have a functional auto-tween feature.

0: https://procreate.com/dreams


Frankly I'm surprised this isn't much higher quality. The hard thing in the transformers era of ML is getting enough data that fits into the next token or masked language modeling paradigm, however in this case, inbetweening is exactly that task and every hand-drawn animation in history is potential training data.

I'm not surprised that using off the shelf diffusion models or multi-modal transformer models trained primarily on still images would lead to this level of quality, but I am surprised if these results are from models trained specifically for this task on large amounts of animation data.


They're indeed not diffusion models, though they are trained on animation data as well as specifically designed for it (the raster papers at least.) I'm very hopeful wrt diffusion, though I'm looking at it and it's far from straightforward.

One problem with diffusion and video is that diffusion training is data hungry and video data is big. A lot of approaches you see have some way to tackle this at their core.

But also, AI today is like 80s PCs in some sense: both clearly the way of the future and clumsy/goofy, especially when juxtaposed with the triumphalism you tend to hear all around


Convert frames to SVG, pass SVG as text to ChatGPT, then ask for the SVG of in-between frames. Simple as.


If an AI could ever capture the charm of the original hand drawn animation, then it's over for us


If an AI can't make animators more productive, it's as close to over for hand drawn animation as it has been for the last decade


i don't get it. we essentially figured this out in the first Matrix by having a bunch of cameras film an actor and then used interpolation to create a 360 shot from it.

why can't this basic idea be applied to simple 2d animation over two decades later?


What was interpolated in the Matrix? I was under the impression they were creating 1 second of 24 fps video by combining images shot on 24 individual cameras.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: