The Complete Detection Flow
When you visit a webpage with images, Qwip analyzes each image through a multi-step process that
never uploads your images to our servers. Here's exactly what happens:
1
Image Discovery
The extension uses a MutationObserver to detect when images appear on the page (including lazy-loaded images).
Small images (<128ร128px) and huge images (>4096px) are automatically skipped to save resources.
2
Hash Computation (Local)
Your browser computes 6 cryptographic/perceptual hashes of the image using WebAssembly:
5 perceptual hashes (pHash variants) and 1 BLAKE3 content hash. This happens entirely in your browser.
3
Database Query (Optional)
If enabled, the BLAKE3 hash (just the hash, not the image) is sent to api.qwip.io to check if
this image has been analyzed before by the community. If found, you get instant results from the
community database.
4
Color Pre-filter
Before running the ML model, a color-entropy analysis checks whether the image looks photorealistic.
Illustrations, cartoons, icons, and heavily-edited artwork are automatically skipped โ reducing
false positives and avoiding wasted compute on non-photographic content.
5
Local ML Inference
If not in the database and not filtered by the pre-filter, the image is analyzed using ONNX Runtime Web.
The default model is MobileCLIP (256ร256 input), with MobileNetV2 and Swin Transformer
also available. Inference runs entirely in your browser via WASM or WebGPU โ no image data is transmitted.
6
Visual Annotation
AI-generated images get a red border, real images get a subtle green checkmark. The extension also
increments a counter showing how many AI images you've encountered.
7
Community Contribution (Optional)
If enabled, your detection result (hash + confidence + model) is anonymously contributed to the
community database to help future users. Multiple detections are aggregated using weighted averaging.
The ML Models
Three models are available, selectable in the extension popup. All run entirely in your browser.
MobileCLIP Fake Detector (Default)
Model Specifications
Architecture: MobileCLIP (CLIP-based vision encoder)
Input Size: 256ร256ร3 RGB
Normalization: CLIP-style (mean/std per channel)
Output: Single probability (AI likelihood)
Inference Time: ~40ms (WASM) / ~15ms (WebGPU)
Model Size: 43MB
The default model uses a CLIP-based architecture fine-tuned to detect AI-generated images.
Its higher input resolution (256ร256) and semantic understanding make it the most robust option
for modern AI generators.
MobileNetV2 (Fast)
Model Specifications
Architecture: MobileNetV2
Input Size: 224ร224ร3 RGB
Normalization: ImageNet standard
Output: [AI, Real] class probabilities
Inference Time: ~20ms (WASM) / ~8ms (WebGPU)
Model Size: ~9MB
A lightweight, fast option best suited for lower-powered devices or when you want lower latency.
Slightly less accurate than MobileCLIP on newer generators.
Swin Transformer (High Accuracy)
Model Specifications
Architecture: Swin Transformer
Input Size: 224ร224ร3 RGB
Normalization: ImageNet standard
Output: [AI, Real] class probabilities
Inference Time: ~200ms (WASM) / ~60ms (WebGPU)
Model Size: 91MB
The most accurate model in the lineup. Recommended for cases where precision matters more than
speed. WebGPU acceleration is strongly recommended โ WASM is noticeably slower.
Known Limitation: All models can produce false positives on heavily-edited real images
(vibrant photos, YouTube thumbnails, HDR shots). We're actively expanding training data with pre-2020
edited photography to reduce these cases.
Hash-Based Privacy System
Why Hashes Instead of Images?
Instead of uploading your images to our servers (which would be a privacy nightmare), we compute
mathematical "fingerprints" called hashes. These hashes can be used to identify similar images
without ever seeing the actual image content.
The 6 Hashes We Use
- Mean Hash (pHash): Average pixel values in 8ร8 grid
- Gradient Hash: Edge detection-based fingerprint
- Double Gradient Hash: Enhanced gradient with dual passes
- Block Hash: Block-median based hash
- DCT Hash: Discrete Cosine Transform-based
- BLAKE3 Content Hash: Exact cryptographic hash for deduplication
Example Hash Vector
{
"mean": "18379468920823898112",
"gradient": "18015498021093556224",
"doubleGradient": "18374389475892961280",
"block": "18302628773641904128",
"dct": "18374673854875439104",
"blake3": "a7f3d8c9e2b4f1a6..." // 64 hex characters
}
These hashes allow us to detect if you've seen the same (or very similar) image before without
storing or transmitting the actual image. The BLAKE3 hash is used for exact matches, while the
perceptual hashes can detect near-duplicates and edited versions.
Community Database
How It Works
The community database at api.qwip.io stores detection results contributed by users:
- No images stored: Only hashes and metadata
- No user tracking: Contributions are completely anonymous
- Vote aggregation: Multiple detections are combined using weighted averaging
- Open API: Anyone can query and contribute (rate-limited by IP)
Database Schema
Images Table
CREATE TABLE images (
blake3_hash VARCHAR(64) PRIMARY KEY,
hash_mean BIGINT,
hash_gradient BIGINT,
hash_double_gradient BIGINT,
hash_block BIGINT,
hash_dct BIGINT,
likely_ai BOOLEAN,
confidence FLOAT,
vote_count INTEGER,
model_used VARCHAR(50),
first_seen TIMESTAMP,
last_seen TIMESTAMP
);
When multiple users analyze the same image, their confidence scores are aggregated:
Vote Aggregation Algorithm
new_confidence = (old_confidence ร vote_count + new_confidence)
รท (vote_count + 1)
vote_count = vote_count + 1
This simple weighted average ensures that as more people analyze an image, the confidence score
becomes more reliable.
Performance Optimizations
What We Do to Keep Things Fast
- Lazy loading detection: Only processes images when they become visible
- Intelligent caching: Results cached locally to avoid re-processing
- Size filtering: Skips tiny icons and massive images
- WebGPU acceleration: Optional GPU acceleration on Chrome 120+
- Race condition prevention: Atomic checks prevent duplicate processing
- Memory management: Aggressive cleanup of blob URLs and DOM elements
Typical Performance
Processing Times (Average)
Color pre-filter: ~5ms
Hash computation: 5-10ms
Model inference (MobileCLIP, default): ~40ms WASM / ~15ms WebGPU
Model inference (MobileNetV2, fast): ~20ms WASM / ~8ms WebGPU
Model inference (Swin Transformer, accurate): ~200ms WASM / ~60ms WebGPU
Server query: 50-200ms (if not cached)
Total (typical, MobileCLIP + server): 50-260ms per image
What Happens Offline?
The extension works completely offline! Here's what happens when you're not connected to the internet:
- โ
Local ML inference still works (no internet needed)
- โ
Images are still analyzed and annotated
- โ
All privacy protections remain active
- โ Can't query community database for cached results
- โ Can't contribute results back to community
When you reconnect, pending contributions are NOT automatically sent (we never queue data without your knowledge).
Want More Technical Details?
Check out our full documentation and open-source code.