Project Write-up · Final-Year Thesis

Detecting deepfakes without ever seeing one.

Rahul Bhushan Vemula · ~7 min read · Python TensorFlow OpenCV Flask

Most deepfake detectors are trained on a library of known fakes. Mine never sees a single one. It learns what real looks like — and flags anything that doesn't fit. It needs zero labelled fakes in training and runs on a plain CPU. Here's why I built it that way.

01The problem with training on fakes

The obvious way to build a deepfake detector is a classifier: feed it thousands of real videos and thousands of fake ones, and let it learn the boundary between the two. It works — until it meets a fake made by a method it has never seen.

Generation techniques move fast. A model trained on last year's face-swaps doesn't necessarily recognise this year's diffusion-based ones. A classifier is only ever as good as the fakes in its training set, and that set is always behind. I wanted something that wouldn't go stale the moment a new generator appeared.

02Reframing it as anomaly detection

So I flipped the question. Instead of "does this look like a known fake?" I asked "does this look like a real video at all?"

That's an anomaly-detection problem, and the tool for it is an autoencoder. You train it on real videos only. It learns to compress a frame down to a compact representation and then reconstruct it. Trained on enough real footage, it gets very good at rebuilding real faces — and noticeably bad at rebuilding anything that breaks the patterns it learned. That reconstruction error becomes the signal.

The core idea

A fake is never explicitly defined. It's simply whatever the model struggles to reconstruct. High reconstruction error → likely manipulated. This is what lets it generalise to fakes it was never trained on.

03Why Convolutional Autoencoder + LSTM

A deepfake gives itself away in two ways, and I wanted to catch both:

Spatial artifacts — within a single frame: blending seams around the face, inconsistent lighting, warped textures. The convolutional layers handle this, the same way a CNN learns image features.
Temporal artifacts — across frames: a flicker between frames, unnatural blinking, motion that doesn't quite track. A single-frame model misses these entirely. The LSTM layer reads the sequence of frames and catches inconsistencies over time.

Put together: convolutional layers extract per-frame features, the LSTM models how those features should evolve across a real clip, and the decoder reconstructs the sequence. When the temporal flow is off — as it often is in a generated video — the reconstruction breaks down and the error spikes.

04The pipeline, end to end

Frame extraction & face cropping

OpenCV pulls frames from the uploaded video and isolates the face region, so the model focuses on where manipulation actually happens.

Sequence assembly

Cropped faces are grouped into short ordered sequences — the unit the LSTM reasons over.

Reconstruction

The Convolutional Autoencoder + LSTM compresses and rebuilds each sequence, trained only on real footage from FaceForensics++.

Scoring & verdict

Reconstruction error is compared against a threshold. Above it → flagged as likely fake. A real-or-fake verdict is returned to the user.

05Does it actually work?

I evaluated it on the FaceForensics++ (C23) dataset — compressed, real-world-quality video, which is exactly where a lot of unsupervised detectors fall apart. The system held up well on both real and manipulated clips, and the part I care about most is why it held up.

It clearly outperformed autoencoder-only baselines — the kind that look at single frames in isolation. The temporal LSTM layer is doing real work: a fake that looks clean frame by frame still betrays itself in how it moves across frames, and that's the signal a spatial-only model misses entirely. Just as importantly, it runs on a plain CPU with no GPU required, which is the whole reason the web app is practical to deploy at all.

Honest about the limits

It isn't perfect — some fakes still slip through, and the decision threshold needs calibration per compression level. Reconstruction error is a spectrum, not a clean line. I'd rather state that plainly than oversell the system.

06Wrapping it in something usable

A model in a notebook isn't a project — it's a draft. So I deployed it as a Flask web application: a user logs in, uploads an MP4 or AVI, and gets a REAL/FAKE verdict back with a confidence score — typically analysing 8–20 sampled frames per clip. Every prediction is stored per account in a SQLite database, so there's a full history dashboard. The point was to take it the full distance — from a research idea to something a non-technical person could click through.

What I'd build next

Add Grad-CAM so the app can show which facial regions drove a FAKE verdict — turning a black-box score into something a user can actually trust. After that: live-stream detection and separating manipulation types (Face2Face, FaceSwap) into their own classes.

07What I took away

The lesson that stuck with me wasn't about model architecture — it was about framing. Treating detection as "recognise the fake" quietly assumes you'll always have examples of the fake. Treating it as "recognise the real" removes that dependency entirely. Same data, same tools, a very different and more durable system — just from asking the question the other way around.

That instinct — interrogate the framing before reaching for the model — is the one I carry into the AI work I do now.

Detecting deepfakes without ever seeing one.

01The problem with training on fakes

02Reframing it as anomaly detection

03Why Convolutional Autoencoder + LSTM

04The pipeline, end to end

Frame extraction & face cropping

Sequence assembly

Reconstruction

Scoring & verdict

05Does it actually work?

06Wrapping it in something usable

07What I took away

Want the full breakdown?