❎ anti-AI measures

2025-11-15 15:43:32 +10:00
parent 1dac042d25
commit 14415dfcd2
5 changed files with 322 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -24,6 +24,13 @@ A Flask-based webcomic website with server-side rendering using Jinja2 templates
  - [SEO Best Practices for Webcomics](#seo-best-practices-for-webcomics)
  - [SEO Checklist for Launch](#seo-checklist-for-launch)
  - [Common SEO Questions](#common-seo-questions)
+- [Content Protection & AI Scraping Prevention](#content-protection--ai-scraping-prevention)
+  - [Protection Features](#protection-features)
+  - [Optional: Additional Protection Measures](#optional-additional-protection-measures)
+  - [Important Limitations](#important-limitations)
+  - [Customizing Your Terms](#customizing-your-terms)
+  - [Testing Your Protection](#testing-your-protection)
+  - [Reporting Violations](#reporting-violations)
 - [Project Structure](#project-structure)
 - [Setup](#setup)
 - [Environment Variables](#environment-variables)
@@ -457,6 +464,143 @@ A: Hashtags don't directly affect search engine SEO, but they help social media
 **Q: Should I create a blog for my comic?**
 A: Optional, but regular blog content about your comic's development can improve SEO through fresh content and more keywords.

+## Content Protection & AI Scraping Prevention
+
+Sunday Comics includes built-in measures to discourage AI web scrapers from using your creative work for training machine learning models without permission.
+
+### Protection Features
+
+#### robots.txt Blocking
+The dynamically generated `robots.txt` file blocks known AI crawlers while still allowing legitimate search engines:
+
+**Blocked AI bots:**
+- **GPTBot** & **ChatGPT-User** (OpenAI)
+- **CCBot** (Common Crawl - used by many AI companies)
+- **anthropic-ai** & **Claude-Web** (Anthropic)
+- **Google-Extended** (Google's AI training crawler, separate from Googlebot)
+- **PerplexityBot** (Perplexity AI)
+- **Omgilibot**, **Diffbot**, **Bytespider**, **FacebookBot**, **ImagesiftBot**, **cohere-ai**
+
+**Note:** Regular search engine crawlers (Googlebot, Bingbot, etc.) are still allowed so your comic can be discovered through search.
+
+The robots.txt also includes a reference to your Terms of Service for transparency.
+
+#### HTML Meta Tags
+Every page includes meta tags that signal to AI scrapers not to use the content:
+
+```html
+<meta name="robots" content="noai, noimageai">
+<meta name="googlebot" content="noai, noimageai">
+```
+
+- `noai` - Prevents AI training on text content
+- `noimageai` - Prevents AI training on images (your comics)
+
+#### Terms of Service
+A comprehensive Terms of Service page at `/terms` legally prohibits:
+- Using content for AI training or machine learning
+- Scraping or harvesting content for datasets
+- Creating derivative works using AI trained on your content
+- Text and Data Mining (TDM) without permission
+
+The Terms page is automatically linked in your footer and includes:
+- Copyright protection assertions
+- DMCA enforcement information
+- TDM rights reservation (EU Directive 2019/790 Article 4)
+- Clear permitted use guidelines
+
+### Optional: Additional Protection Measures
+
+#### HTTP Headers (Advanced)
+For stronger enforcement, you can add HTTP headers. Add this to `app.py` after the imports:
+
+```python
+@app.after_request
+def add_ai_blocking_headers(response):
+    """Add headers to discourage AI scraping"""
+    response.headers['X-Robots-Tag'] = 'noai, noimageai'
+    return response
+```
+
+#### TDM Reservation File (Advanced)
+Create a `/tdmrep.json` endpoint to formally reserve Text and Data Mining rights:
+
+```python
+@app.route('/tdmrep.json')
+def tdm_reservation():
+    """TDM (Text and Data Mining) reservation"""
+    from flask import jsonify
+    return jsonify({
+        "tdm": {
+            "reservation": 1,
+            "policy": f"{SITE_URL}/terms"
+        }
+    })
+```
+
+### Important Limitations
+
+**These measures are voluntary** - they only work if AI companies respect them:
+
+✅ **What this does:**
+- Signals your intent to protect your content
+- Provides legal grounding for DMCA takedowns
+- Blocks responsible AI companies that honor robots.txt
+- Makes your copyright stance clear to users and crawlers
+
+❌ **What this doesn't do:**
+- Cannot physically prevent determined bad actors from scraping
+- Cannot remove already-scraped historical data from existing datasets
+- No guarantee all AI companies will honor these signals
+
+**Companies that claim to honor robots.txt:**
+- OpenAI (GPTBot blocking)
+- Anthropic (anthropic-ai blocking)
+- Google (Google-Extended blocking, separate from search)
+
+### Customizing Your Terms
+
+Edit `/Users/pori/PycharmProjects/sunday/content/terms.md` to customize:
+
+1. **Jurisdiction** - Add your country/state for legal clarity
+2. **Permitted use** - Adjust what you allow (fan art, sharing, etc.)
+3. **Contact info** - Automatically populated from `comics_data.py`
+
+The Terms page uses Jinja2 template variables that pull from your configuration:
+- `{{ copyright_name }}` - From `COPYRIGHT_NAME` in `comics_data.py`
+- `{{ social_email }}` - From `SOCIAL_EMAIL` in `comics_data.py`
+
+### Testing Your Protection
+
+**Verify robots.txt:**
+```bash
+curl https://yourcomic.com/robots.txt
+```
+
+You should see AI bot blocks and a link to your terms.
+
+**Check meta tags:**
+View page source and look for:
+```html
+<meta name="robots" content="noai, noimageai">
+```
+
+**Validate Terms page:**
+Visit `https://yourcomic.com/terms` to ensure it renders correctly.
+
+### Reporting Violations
+
+If you discover your work in an AI training dataset or being used without permission:
+
+1. **Document the violation** - Screenshots, URLs, timestamps
+2. **Review their TOS** - Many AI services have content dispute processes
+3. **Send DMCA takedown** - Your Terms of Service provides legal standing
+4. **Contact the platform** - Use your `SOCIAL_EMAIL` from the Terms page
+
+Resources:
+- [US Copyright Office DMCA](https://www.copyright.gov/dmca/)
+- [EU Copyright Directive](https://digital-strategy.ec.europa.eu/en/policies/copyright-legislation)
+
 ## Project Structure

 ```