FAQ - SFAI.agency

How do you know if AI was trained on my work?

We have two methods for figuring this out. The first method is called a Dataset Search. That’s when we look for your name or work in the “textbooks,” or giant piles of data—including audio, images, photos, text, and video—used to train AI models.

The second method is to probe the AI models themselves. That’s called Membership Inference (more about that below).

What exactly is a dataset?

A dataset is basically just an enormous collection of material, often billions of pieces of data scraped from the internet and used to train AI models. Some datasets are private and some you have to pay to get access to. But since AI is an industry where even big companies do some open-source work, there are also a lot that are public. If your work is included in a dataset, or if your name (or a different unique identifier describing your work) appears, there is a very high chance that your work was used by an AI model for training.

How does Croquis search a dataset?

We search by looking for your name and the title of your work. We also look for a visual or perceptual (for audio/video) match of your content and/or style. We filter out known false positives like generic stock-photo captions (“Stock Image of…,” “Dinnerware Set,” etc.).

For the technically curious

Datasets are split into manageable chunks called “shards.” We get the list of shards from Hugging Face, the main public library for dataset files. We use what you might call a very fast spreadsheet program, a tool named DuckDB, to read the shards over the internet without downloading them. The shards have columns where captions are stored, and we search those columns for your name or work titles (case-insensitive, substring match).

We also perform a visual/perceptual match looking for your content and/or style. Because the largest datasets contain billions of data points, our basic search spreads our scan evenly across the shards to ensure the datasets get sampled fairly. Our premium service conducts a full search. We filter out known false positives like generic stock-photo captions (“Stock Image of…,” “Dinnerware Set,” etc.).

Anything else I should know here?

Yes. Just because your work, name, or work title is included in a training dataset doesn’t mean that a model memorized your work. But if a model has memorized your work, it has the ability to reproduce it word-by-word or pixel-by-pixel.

In case you’re interested, there’s a list of public and semi-public datasets further down. But like we mentioned above, a lot of datasets are private, or walled off from the public, and so we can’t search those. That’s where membership inference comes in.

What’s Membership Inference, and how do you use it to tell if AI was trained on my work?

Membership inference is a set of techniques that probe AI models to estimate whether your work was part of its training data. This is the second method we use.

Just like with datasets, some AI models are accessible—or what are called “open source”—and some aren’t. There are two techniques for membership inference: white-box and black-box.

When a model is open source, you can download the full model and run it locally, either on your laptop or, if it’s too big, on a server. This is called white-box inference. It means you can look at and observe the whole AI model—run it and look at its internals and see how it works. White-box inference allows you to be much more confident because you can really see everything.

Black-box inference is used for models that aren’t open source, like ChatGPT, Claude, or DALL·E. Those models are running on their own servers. We can’t download them or observe their internals. We can only submit input and receive output.

Think of white-box access like a clock with a transparent case, where you can watch every gear turn and trace exactly how the hands end up where they do. Black-box access is like a sealed clock on the wall: you can compare the time it shows against the time you expected, but you have to infer everything about the mechanism from the outside.

Either way, we’re on the lookout. There are certain types of behaviors and patterns that models tend to exhibit when we present them with work—like audio, images, and video—that they’ve seen before, versus something new.

Tests by medium

Different kinds of work call for different membership-inference techniques. Here’s how we approach each medium.

If you are a writer

For writing (essays, novels, articles) we run four distinct tests, then combine them into an overall score predicting the likelihood that your work was used to train a Large Language Model (LLM).

1. Content Recall

We see if the model can finish your sentences. We take your text, break it into short paragraphs of three sentences, and ask the model to finish. We do this ten times, then compare the output to what you actually wrote. If the model reproduces phrases that are highly similar or identical to your work, that’s strong direct evidence of memorization.

2. Jailbreak Probing

Modern models are trained to refuse to reproduce copyrighted text. But if they can reproduce your work, it’s a strong sign that the model was trained on it. So we try to ask as persuasively as we can:

Strategy	Framing
Ethos	Appeal to authority, “As a respected researcher…”
Pathos	Emotional appeal, “I’m a disabled student in crisis…”
Alliance	Collaboration, “We’re working on this together.”
Reciprocity	Quid pro quo, “I’ve been helping you, now help me.”
Foot-in-the-door	Social norm, “You’ve been so helpful already…”

3. Knowledge Memorization

Does the model know the facts in your writing? We generate 8 factual questions from your text—like “On what date does character X meet character Y?” or “What’s the name of the shop in chapter 3?” We ask the model those questions, then let another LLM act as a judge to score the answers. If the model does well and knows a lot of the facts in your writing, that’s a strong signal it has memorized your work.

4. The Multiple-Choice Test

We give the model a multiple-choice test: “Which of the following is the exact verbatim passage from {title} by {author}?” We shuffle the options 12 times and ask again each time. If the model gets 25% or less right, that means it’s probably just guessing. If it gets more than 25% of the questions right, it’s probably seen your work before.

If you are a visual artist

Technique	Paper	What it does
GenAI Confessions	Bohacek et al., arXiv:2501.06399	Feeds the candidate image back through the model and measures how faithfully the outputs reconstruct it (via DreamSim), utilizing image manifold properties. Black-box, so it works on closed commercial APIs where the other three methods can’t.
Loss-based MIA	Matsumoto et al., arXiv:2302.03262	Members of the training set tend to produce lower reconstruction loss than unseen images. Compare the user’s image’s loss to a baseline distribution.
SecMI	Duan et al., arXiv:2302.01316	Steps the image through the diffusion process and measures posterior estimation errors at specific noise timesteps. Published at ICML 2023.
PIA (Proximal Initialization Attack)	Kong et al., arXiv:2305.18355	Uses the model’s own denoising trajectory; more efficient than SecMI with similar accuracy.
Training-data extraction	Carlini et al., arXiv:2301.13188	Prompts the model aggressively and checks whether it literally regenerates training images. Famous for recovering images of real people from Stable Diffusion.

If you are a musician or audio creator

Loss / perplexity on generative audio models (MusicGen, AudioGen, Stable Audio), based on Miranda et al. 2024, arXiv:2405.08487.
Diffusion-style SecMI adapted to spectrograms: a direct transfer of Duan et al.’s method to audio diffusion models.
Watermark / lyric-cloze probing: give the model the first verse and see if it completes with the real second verse (analogous to our Content Recall test for writers).

If you are a video creator

Frame-level SecMI / PIA: treat each keyframe as an image and apply the diffusion attacks from the visual-artist section above.
Temporal loss attack: Chen et al. 2024, arXiv:2406.03357, which exploits that video-diffusion models show lower loss on training clips across frames, not just per-frame.
Caption-completion probing: analogous to our text Content Recall test, but for video captions.

If you are a podcaster or spoken-word radio creator

Membership-inference tools are the same as Music & Audio above, plus speaker-identification leakage tests—does the model’s speech synthesis accurately reproduce the user’s voice? See Wenger et al., arXiv:2302.00739.

Public datasets we search

Some of the public and semi-public datasets we scan when looking for your work.

Image datasets

LAION Re-2B English: ~2B image-caption pairs; Stable Diffusion 1.x, 2.x.
LAION Re-2B Multilingual: ~2B pairs; Stable Diffusion, many fine-tunes.
LAION Re-1B No-Language: ~1B pairs; vision encoders.
COYO-700M: 700M pairs; Imagen-style models, research.
OBELICS: 141M web documents; multimodal models (IDEFICS etc.).

Music & Audio datasets

AudioSet: 2M labeled YouTube clips.
LibriSpeech: 1K hours of audiobooks.
Mozilla Common Voice: 30K+ hours of donated speech.
Free Music Archive (FMA): 1K tracks, full audio.
MTG-Jamendo: 55K full tracks with tags.
Million Song Dataset: metadata only, but maps to tracks.
WavCaps, AudioCaps, Clotho: audio-caption datasets used by MusicGen/AudioGen.

Podcasting / Spoken-word radio datasets

Spotify Podcast Dataset: 100K episodes.
The People’s Speech (MLCommons): 30K+ hours.

Video datasets

WebVid-10M (Bain et al.): scraped from Shutterstock, used by many early text-to-video models; subject of an ongoing lawsuit.
HD-VILA-100M (Microsoft).
Panda-70M (Nvidia).
HowTo100M: instructional videos.
YouTube-8M, Kinetics-700: action-recognition corpora.