Llm November 21, 2025

Wikipedia’s Signs of AI Writing is a better guide than most AI detectors

Wikipedia’s editors have published something the AI detection industry keeps missing: a practical guide to spotting LLM-written prose that people can actually use. The page is called Signs of AI writing. It grew out of Project AI Cleanup, a volunteer...

Wikipedia’s Signs of AI Writing is a better guide than most AI detectors

Wikipedia built a better AI writing detector than most AI detector startups

Wikipedia’s editors have published something the AI detection industry keeps missing: a practical guide to spotting LLM-written prose that people can actually use.

The page is called Signs of AI writing. It grew out of Project AI Cleanup, a volunteer effort Wikipedia started in 2023 after editors began dealing with a flood of AI-assisted submissions. The useful part is the source of the guide. It comes from patterns editors kept seeing across a large volume of real edits.

That gives it more credibility than most detector products. It comes from editorial failure modes, not vendor slides.

The patterns will be familiar to anyone who’s reviewed AI-heavy docs, SEO copy, autogenerated bios, or messy knowledge base content.

What Wikipedia is flagging

The guide doesn’t fixate on a list of forbidden words. It looks at habits.

A few examples:

  • importance inflation, like “a significant milestone” or “part of a broader movement”
  • sentence endings that drift into -ing clauses, especially ones that explain significance, like “reflecting the continued relevance of…”
  • vague promotional adjectives, such as “breathtaking,” “scenic,” or “clean and modern”
  • long lists of weak media mentions used to manufacture notability

That’s a better frame than black-box detection because it tracks rhetorical structure, not a brittle token list.

Anyone who spends time with current LLM output has seen this style. The model keeps trying to explain why the facts matter, even when nobody asked. It pads. It flatters the subject. It smooths over uncertainty with polished filler. That habit is all over the training data, especially SEO pages, low-grade explainers, PR copy, and marketing sludge. So it shows up in generation too.

Wikipedia cares because that style is poison for encyclopedia writing. Developers should care because the same style leaks into product docs, support content, dataset curation, and internal knowledge systems.

Why these signals persist

A lot of AI detection pitches still treat this as a lexical problem. Find the suspicious word, score the paragraph, raise the alert. That worked badly last year. It still works badly now.

Models have improved at paraphrase. Vocabulary changes fast. Prompting can suppress obvious tells. The patterns that persist sit deeper than word choice.

Three things matter here.

Training data bias

LLMs are trained on a web corpus where neutral, citation-driven writing is badly outnumbered by content built to sell, persuade, summarize, and perform authority. So the model’s default prose often comes out slightly promotional, even when the prompt asks for straight description.

Wikipedia’s editors are documenting the residue of that training mix.

Alignment rewards explanatory overreach

RLHF and related tuning methods push models toward “helpful” behavior. In writing tasks, that often turns into gratuitous framing. The model adds significance, context, and takeaways whether they’re warranted or not.

That’s how you get sentences ending by “underscoring the importance” of something obvious. The system has learned that explicit relevance signals read as useful.

Syntax is harder to scrub than vocabulary

You can tell a model to stop saying “significant” or “notable.” That doesn’t reliably remove the deeper pattern. The sentence often keeps the same shape. You still get adjective-heavy noun phrases, abstract claims about relevance, and those trailing participle clauses that feel smooth on first read and empty on the second.

That’s why Wikipedia’s guide is smarter than most AI detector dashboards. It points to structures that survive paraphrase.

Stylometry, with editorial common sense

There’s a technical term for what Wikipedia is doing: stylometry. Measure the shape of the writing instead of hunting for one smoking gun.

That matters in production systems.

You can turn these cues into lightweight features:

  • adjective-to-noun ratio
  • frequency of abstract importance terms like relevance, significance, impact
  • sentence-final present participle clauses
  • generic opener phrases like “In recent years” or “Throughout history”
  • entity mentions backed by weak citations or trivial media coverage

That gives you a triage score. Not a verdict.

The distinction matters because automated AI detectors are still unreliable in exactly the way operators hate most. They’re overconfident when they’re wrong. They also tend to punish non-native writers, formal prose, and heavily edited text. A transparent feature-based system is less magical and more useful. Reviewers can see why a piece got flagged.

A simple implementation doesn’t need a giant model. spaCy, some regex, a citation-quality layer, and a few weighted heuristics will get you surprisingly far. The source material’s sample Python sketch is basic, but sensible: count participle tails, score phrase matches, estimate adjective density, and add a weak signal for “media mention padding.”

That’s enough for a pre-publish lint pass or a moderation queue.

Where this helps right away

Wikipedia’s use case is encyclopedic content, but the pattern carries.

Developer docs

AI-generated docs often explain importance when they should explain behavior. You get lines like “This feature plays a significant role in improving workflow efficiency” when the reader just needs to know what the flag does, what it returns, and what breaks if it’s unset.

That kind of prose wastes space and hides missing specifics. A style checker that flags puffery is useful even if you don’t care whether a human or model wrote the first draft.

CMS and publishing systems

If you run a content platform, the job usually isn’t “detect all AI.” It’s “catch low-quality synthetic text before it ships.”

A composite score built from rhetorical and citation features is a good filter for human review. It scales better than line-by-line manual editing, and it’s easier to defend than opaque classifier scores.

Dataset curation

This is the underrated part.

Teams scraping the web for training or evaluation corpora already know synthetic contamination is getting worse. Filtering obvious LLM fingerprints out of the corpus is basic hygiene now. Wikipedia’s signals help because they’re human-readable and measurable. You can run them before deduping, before quality scoring, before downstream labeling.

Cleaner corpora matter. If your evaluation set is full of AI-padded prose, your model can look better than it is.

Education and research workflows

Detector screenshots have become a dead end in academic settings. They’re easy to challenge, and often deserve to be challenged. A transparent rubric built around citation quality, unsupported importance claims, and repeated rhetorical patterns is still imperfect, but at least it gives instructors and reviewers something concrete.

The limits

This approach is better than detector theater, but it has obvious failure modes.

Good human writers sometimes use polished, formal language. Academic prose is full of abstract nouns. Travel writing uses descriptive adjectives on purpose. Marketing pages are supposed to sound like marketing pages. Apply Wikipedia-style scoring blindly across all genres and you’ll flag plenty of legitimate text.

So domain-specific tuning matters. An encyclopedia article should have a very different threshold from a landing page, release note, or research summary.

There’s an adversarial angle too. Once people know which phrases get flagged, they’ll prompt models to suppress them. Some already do. That won’t erase the problem because syntax and citation behavior are harder to control consistently, but it will wipe out the easy wins.

Then there’s the social problem. Any “AI writing” flag can become a lazy proxy for “bad writing” or worse, “writing I don’t like.” Teams need an appeal path, reviewer training, and logs that show why content was scored the way it was.

If the system affects publishing, grading, moderation, or trust and safety decisions, auditability is mandatory.

How this fits into a content stack

Use it as a quality gate.

A sensible pipeline looks like this:

  1. Run cheap text features at draft or ingest time.
  2. Add source checks for citations and outbound links.
  3. Score paragraphs or sections, not just the full document.
  4. Route high-scoring items to human review.
  5. Feed reviewer outcomes back into thresholds.

A few practical checks are low effort and high value:

  • warn on sentence endings like ..., reflecting the importance of
  • flag paragraphs that claim relevance without an independent citation
  • cap adjective density in encyclopedia-style or documentation sections
  • downrank trivial “featured in” media lists from weak outlets
  • log feature values so editors can see patterns over time

That’s boring infrastructure work. Good. This is exactly the kind of problem boring infrastructure should handle.

The broader lesson from Wikipedia is simple. The best defense against synthetic sludge may be a tighter editorial spec, expressed as code where it helps and enforced by humans where it matters.

Wikipedia got there because it had to. The rest of the web is still catching up.

What to watch

The main caveat is that an announcement does not prove durable production value. The practical test is whether teams can use this reliably, measure the benefit, control the failure modes, and justify the cost once the initial novelty wears off.

Keep going from here

Useful next reads and implementation paths

If this topic connects to a real workflow, these links give you the service path, a proof point, and related articles worth reading next.

Relevant service
AI video automation

Speed up clipping, transcripts, subtitles, tagging, repurposing, and review workflows.

Related proof
AI video content operations

How an AI video workflow cut content repurposing time by 54%.

Related article
Moonbounce raises $12M to build a real-time moderation layer for AI

Moonbounce, a startup founded by former Facebook and Apple trust and safety leader Brett Levenson and Ash Bhardwaj, has raised $12 million to sell a specific piece of infrastructure: a real-time moderation layer that sits between users and AI systems...

Related article
Elloe AI adds a verification layer for LLM safety and inspectable decisions

Elloe AI has a clear pitch: put a safety and verification layer between the model and the user, and make the system's decisions inspectable. That may sound like familiar guardrails territory, but Elloe is aiming at a specific spot in the stack. The c...

Related article
Character.AI picks ex-Meta executive Karandeep Anand as CEO amid product scaling challenges

Character.AI has named Karandeep Anand as CEO. He’s run business products at Meta, worked on Azure at Microsoft, and spent time at Brex. That resume matters because Character.AI’s biggest issues don’t look like research issues anymore. They look like...