anti-AI measures

This commit is contained in:
mi
2025-11-15 15:43:32 +10:00
parent 1dac042d25
commit 14415dfcd2
5 changed files with 322 additions and 1 deletions

144
README.md
View File

@@ -24,6 +24,13 @@ A Flask-based webcomic website with server-side rendering using Jinja2 templates
- [SEO Best Practices for Webcomics](#seo-best-practices-for-webcomics)
- [SEO Checklist for Launch](#seo-checklist-for-launch)
- [Common SEO Questions](#common-seo-questions)
- [Content Protection & AI Scraping Prevention](#content-protection--ai-scraping-prevention)
- [Protection Features](#protection-features)
- [Optional: Additional Protection Measures](#optional-additional-protection-measures)
- [Important Limitations](#important-limitations)
- [Customizing Your Terms](#customizing-your-terms)
- [Testing Your Protection](#testing-your-protection)
- [Reporting Violations](#reporting-violations)
- [Project Structure](#project-structure)
- [Setup](#setup)
- [Environment Variables](#environment-variables)
@@ -457,6 +464,143 @@ A: Hashtags don't directly affect search engine SEO, but they help social media
**Q: Should I create a blog for my comic?**
A: Optional, but regular blog content about your comic's development can improve SEO through fresh content and more keywords.
## Content Protection & AI Scraping Prevention
Sunday Comics includes built-in measures to discourage AI web scrapers from using your creative work for training machine learning models without permission.
### Protection Features
#### robots.txt Blocking
The dynamically generated `robots.txt` file blocks known AI crawlers while still allowing legitimate search engines:
**Blocked AI bots:**
- **GPTBot** & **ChatGPT-User** (OpenAI)
- **CCBot** (Common Crawl - used by many AI companies)
- **anthropic-ai** & **Claude-Web** (Anthropic)
- **Google-Extended** (Google's AI training crawler, separate from Googlebot)
- **PerplexityBot** (Perplexity AI)
- **Omgilibot**, **Diffbot**, **Bytespider**, **FacebookBot**, **ImagesiftBot**, **cohere-ai**
**Note:** Regular search engine crawlers (Googlebot, Bingbot, etc.) are still allowed so your comic can be discovered through search.
The robots.txt also includes a reference to your Terms of Service for transparency.
#### HTML Meta Tags
Every page includes meta tags that signal to AI scrapers not to use the content:
```html
<meta name="robots" content="noai, noimageai">
<meta name="googlebot" content="noai, noimageai">
```
- `noai` - Prevents AI training on text content
- `noimageai` - Prevents AI training on images (your comics)
#### Terms of Service
A comprehensive Terms of Service page at `/terms` legally prohibits:
- Using content for AI training or machine learning
- Scraping or harvesting content for datasets
- Creating derivative works using AI trained on your content
- Text and Data Mining (TDM) without permission
The Terms page is automatically linked in your footer and includes:
- Copyright protection assertions
- DMCA enforcement information
- TDM rights reservation (EU Directive 2019/790 Article 4)
- Clear permitted use guidelines
### Optional: Additional Protection Measures
#### HTTP Headers (Advanced)
For stronger enforcement, you can add HTTP headers. Add this to `app.py` after the imports:
```python
@app.after_request
def add_ai_blocking_headers(response):
"""Add headers to discourage AI scraping"""
response.headers['X-Robots-Tag'] = 'noai, noimageai'
return response
```
#### TDM Reservation File (Advanced)
Create a `/tdmrep.json` endpoint to formally reserve Text and Data Mining rights:
```python
@app.route('/tdmrep.json')
def tdm_reservation():
"""TDM (Text and Data Mining) reservation"""
from flask import jsonify
return jsonify({
"tdm": {
"reservation": 1,
"policy": f"{SITE_URL}/terms"
}
})
```
### Important Limitations
**These measures are voluntary** - they only work if AI companies respect them:
✅ **What this does:**
- Signals your intent to protect your content
- Provides legal grounding for DMCA takedowns
- Blocks responsible AI companies that honor robots.txt
- Makes your copyright stance clear to users and crawlers
❌ **What this doesn't do:**
- Cannot physically prevent determined bad actors from scraping
- Cannot remove already-scraped historical data from existing datasets
- No guarantee all AI companies will honor these signals
**Companies that claim to honor robots.txt:**
- OpenAI (GPTBot blocking)
- Anthropic (anthropic-ai blocking)
- Google (Google-Extended blocking, separate from search)
### Customizing Your Terms
Edit `/Users/pori/PycharmProjects/sunday/content/terms.md` to customize:
1. **Jurisdiction** - Add your country/state for legal clarity
2. **Permitted use** - Adjust what you allow (fan art, sharing, etc.)
3. **Contact info** - Automatically populated from `comics_data.py`
The Terms page uses Jinja2 template variables that pull from your configuration:
- `{{ copyright_name }}` - From `COPYRIGHT_NAME` in `comics_data.py`
- `{{ social_email }}` - From `SOCIAL_EMAIL` in `comics_data.py`
### Testing Your Protection
**Verify robots.txt:**
```bash
curl https://yourcomic.com/robots.txt
```
You should see AI bot blocks and a link to your terms.
**Check meta tags:**
View page source and look for:
```html
<meta name="robots" content="noai, noimageai">
```
**Validate Terms page:**
Visit `https://yourcomic.com/terms` to ensure it renders correctly.
### Reporting Violations
If you discover your work in an AI training dataset or being used without permission:
1. **Document the violation** - Screenshots, URLs, timestamps
2. **Review their TOS** - Many AI services have content dispute processes
3. **Send DMCA takedown** - Your Terms of Service provides legal standing
4. **Contact the platform** - Use your `SOCIAL_EMAIL` from the Terms page
Resources:
- [US Copyright Office DMCA](https://www.copyright.gov/dmca/)
- [EU Copyright Directive](https://digital-strategy.ec.europa.eu/en/policies/copyright-legislation)
## Project Structure
```