❎ anti-AI measures

2025-11-15 15:43:32 +10:00
parent 1dac042d25
commit 14415dfcd2
5 changed files with 322 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -24,6 +24,13 @@ A Flask-based webcomic website with server-side rendering using Jinja2 templates
  - [SEO Best Practices for Webcomics](#seo-best-practices-for-webcomics)
  - [SEO Checklist for Launch](#seo-checklist-for-launch)
  - [Common SEO Questions](#common-seo-questions)
 - [Content Protection & AI Scraping Prevention](#content-protection--ai-scraping-prevention)
  - [Protection Features](#protection-features)
  - [Optional: Additional Protection Measures](#optional-additional-protection-measures)
  - [Important Limitations](#important-limitations)
  - [Customizing Your Terms](#customizing-your-terms)
  - [Testing Your Protection](#testing-your-protection)
  - [Reporting Violations](#reporting-violations)
 - [Project Structure](#project-structure)
 - [Setup](#setup)
 - [Environment Variables](#environment-variables)
@@ -457,6 +464,143 @@ A: Hashtags don't directly affect search engine SEO, but they help social media
 **Q: Should I create a blog for my comic?**
 A: Optional, but regular blog content about your comic's development can improve SEO through fresh content and more keywords.
 ## Content Protection & AI Scraping Prevention
 Sunday Comics includes built-in measures to discourage AI web scrapers from using your creative work for training machine learning models without permission.
 ### Protection Features
 #### robots.txt Blocking
 The dynamically generated `robots.txt` file blocks known AI crawlers while still allowing legitimate search engines:
 **Blocked AI bots:**
 - **GPTBot** & **ChatGPT-User** (OpenAI)
 - **CCBot** (Common Crawl - used by many AI companies)
 - **anthropic-ai** & **Claude-Web** (Anthropic)
 - **Google-Extended** (Google's AI training crawler, separate from Googlebot)
 - **PerplexityBot** (Perplexity AI)
 - **Omgilibot**, **Diffbot**, **Bytespider**, **FacebookBot**, **ImagesiftBot**, **cohere-ai**
 **Note:** Regular search engine crawlers (Googlebot, Bingbot, etc.) are still allowed so your comic can be discovered through search.
 The robots.txt also includes a reference to your Terms of Service for transparency.
 #### HTML Meta Tags
 Every page includes meta tags that signal to AI scrapers not to use the content:
 ```html
 <meta name="robots" content="noai, noimageai">
 <meta name="googlebot" content="noai, noimageai">
 ```
 - `noai` - Prevents AI training on text content
 - `noimageai` - Prevents AI training on images (your comics)
 #### Terms of Service
 A comprehensive Terms of Service page at `/terms` legally prohibits:
 - Using content for AI training or machine learning
 - Scraping or harvesting content for datasets
 - Creating derivative works using AI trained on your content
 - Text and Data Mining (TDM) without permission
 The Terms page is automatically linked in your footer and includes:
 - Copyright protection assertions
 - DMCA enforcement information
 - TDM rights reservation (EU Directive 2019/790 Article 4)
 - Clear permitted use guidelines
 ### Optional: Additional Protection Measures
 #### HTTP Headers (Advanced)
 For stronger enforcement, you can add HTTP headers. Add this to `app.py` after the imports:
 ```python
@app.after_request
 def add_ai_blocking_headers(response):
    """Add headers to discourage AI scraping"""
    response.headers['X-Robots-Tag'] = 'noai, noimageai'
    return response
 ```
 #### TDM Reservation File (Advanced)
 Create a `/tdmrep.json` endpoint to formally reserve Text and Data Mining rights:
 ```python
@app.route('/tdmrep.json')
 def tdm_reservation():
    """TDM (Text and Data Mining) reservation"""
    from flask import jsonify
    return jsonify({
        "tdm": {
            "reservation": 1,
            "policy": f"{SITE_URL}/terms"
        }
    })
 ```
 ### Important Limitations
 **These measures are voluntary** - they only work if AI companies respect them:
 ✅ **What this does:**
 - Signals your intent to protect your content
 - Provides legal grounding for DMCA takedowns
 - Blocks responsible AI companies that honor robots.txt
 - Makes your copyright stance clear to users and crawlers
 ❌ **What this doesn't do:**
 - Cannot physically prevent determined bad actors from scraping
 - Cannot remove already-scraped historical data from existing datasets
 - No guarantee all AI companies will honor these signals
 **Companies that claim to honor robots.txt:**
 - OpenAI (GPTBot blocking)
 - Anthropic (anthropic-ai blocking)
 - Google (Google-Extended blocking, separate from search)
 ### Customizing Your Terms
 Edit `/Users/pori/PycharmProjects/sunday/content/terms.md` to customize:
 1. **Jurisdiction** - Add your country/state for legal clarity
 2. **Permitted use** - Adjust what you allow (fan art, sharing, etc.)
 3. **Contact info** - Automatically populated from `comics_data.py`
 The Terms page uses Jinja2 template variables that pull from your configuration:
 - `{{ copyright_name }}` - From `COPYRIGHT_NAME` in `comics_data.py`
 - `{{ social_email }}` - From `SOCIAL_EMAIL` in `comics_data.py`
 ### Testing Your Protection
 **Verify robots.txt:**
 ```bash
 curl https://yourcomic.com/robots.txt
 ```
 You should see AI bot blocks and a link to your terms.
 **Check meta tags:**
 View page source and look for:
 ```html
 <meta name="robots" content="noai, noimageai">
 ```
 **Validate Terms page:**
 Visit `https://yourcomic.com/terms` to ensure it renders correctly.
 ### Reporting Violations
 If you discover your work in an AI training dataset or being used without permission:
 1. **Document the violation** - Screenshots, URLs, timestamps
 2. **Review their TOS** - Many AI services have content dispute processes
 3. **Send DMCA takedown** - Your Terms of Service provides legal standing
 4. **Contact the platform** - Use your `SOCIAL_EMAIL` from the Terms page
 Resources:
 - [US Copyright Office DMCA](https://www.copyright.gov/dmca/)
 - [EU Copyright Directive](https://digital-strategy.ec.europa.eu/en/policies/copyright-legislation)
 ## Project Structure
 ```
--- a/app.py
+++ b/app.py
@@ -217,6 +217,28 @@ def about():
    return render_template('page.html', title='About', content=html_content)
@app.route('/terms')
 def terms():
    """Terms of Service page"""
    from jinja2 import Template
    # Read and render the markdown file with template variables
    terms_path = os.path.join(os.path.dirname(__file__), 'content', 'terms.md')
    try:
        with open(terms_path, 'r', encoding='utf-8') as f:
            content = f.read()
        # First render as Jinja template to substitute variables
        template = Template(content)
        rendered_content = template.render(
            copyright_name=COPYRIGHT_NAME,
            social_email=SOCIAL_EMAIL if SOCIAL_EMAIL else '[Contact Email]'
        )
        # Then convert markdown to HTML
        html_content = markdown.markdown(rendered_content)
    except FileNotFoundError:
        html_content = '<p>Terms of Service content not found.</p>'
    return render_template('page.html', title='Terms of Service', content=html_content)
@app.route('/api/comics')
 def api_comics():
    """API endpoint - returns all comics as JSON"""
@@ -244,6 +266,9 @@ def robots():
    """Generate robots.txt dynamically with correct SITE_URL"""
    from flask import Response
    robots_txt = f"""# Sunday Comics - Robots.txt
 # Content protected by copyright. AI training prohibited.
 # See terms: {SITE_URL}/terms
 User-agent: *
 Allow: /
@@ -252,6 +277,46 @@ Sitemap: {SITE_URL}/sitemap.xml
 # Disallow API endpoints from indexing
 Disallow: /api/
 # Block AI crawlers and scrapers
 User-agent: GPTBot
 Disallow: /
 User-agent: ChatGPT-User
 Disallow: /
 User-agent: CCBot
 Disallow: /
 User-agent: anthropic-ai
 Disallow: /
 User-agent: Claude-Web
 Disallow: /
 User-agent: Google-Extended
 Disallow: /
 User-agent: PerplexityBot
 Disallow: /
 User-agent: Omgilibot
 Disallow: /
 User-agent: Diffbot
 Disallow: /
 User-agent: Bytespider
 Disallow: /
 User-agent: FacebookBot
 Disallow: /
 User-agent: ImagesiftBot
 Disallow: /
 User-agent: cohere-ai
 Disallow: /
 """
    return Response(robots_txt, mimetype='text/plain')
--- a/content/terms.md
+++ b/content/terms.md
@@ -0,0 +1,93 @@
 # Terms of Service
 **Last Updated:** January 2025
 By accessing and using this website, you agree to be bound by these Terms of Service. If you do not agree to these terms, please do not use this site.
 ## Copyright and Ownership
 All comics, artwork, text, graphics, and other content on this website are protected by copyright and owned by {{ copyright_name }}. All rights reserved.
 ## Permitted Use
 **Personal Use:** You may:
 - Read and enjoy the comics for personal, non-commercial purposes
 - Share links to individual comic pages on social media
 - Embed comics on personal websites with proper attribution and a link back to the original
 **Attribution Required:** When sharing or embedding, you must:
 - Provide clear credit to {{ copyright_name }}
 - Include a link back to this website
 - Not alter, crop, or modify the comic images
 ## Prohibited Use
 You are **expressly prohibited** from:
 ### AI Training and Machine Learning
 - Using any content from this site for training artificial intelligence models
 - Scraping, crawling, or harvesting content for machine learning purposes
 - Including any images, text, or data in AI training datasets
 - Using content to develop, train, or improve generative AI systems
 - Creating derivative works using AI trained on this content
 ### Commercial Use
 - Reproducing, distributing, or selling comics without explicit written permission
 - Using comics or artwork for commercial purposes without a license
 - Printing comics on merchandise (t-shirts, mugs, etc.) without authorization
 ### Modification and Redistribution
 - Altering, editing, or creating derivative works from the comics
 - Removing watermarks, signatures, or attribution
 - Rehosting images on other servers or websites
 - Claiming comics as your own work
 ## Data Mining and Web Scraping
 **Automated Access Prohibition:** Automated scraping, crawling, or systematic downloading of content is strictly prohibited without prior written consent. This includes but is not limited to:
 - Web scrapers and bots (except authorized search engines)
 - Automated downloads of images or data
 - RSS feed abuse or bulk downloading
 - Any form of data harvesting for commercial purposes
 **Text and Data Mining (TDM) Reservation:** We formally reserve all rights under applicable copyright law regarding text and data mining, including but not limited to EU Directive 2019/790 Article 4. No TDM exceptions apply to this content.
 ## DMCA and Copyright Enforcement
 Unauthorized use of copyrighted material from this site may violate copyright law and be subject to legal action under the Digital Millennium Copyright Act (DMCA) and other applicable laws.
 If you discover unauthorized use of content from this site, please report it to {{ social_email }}.
 ## Fair Use
 Limited use for purposes of commentary, criticism, news reporting, teaching, or research may qualify as fair use. If you believe your use qualifies as fair use, please contact us first.
 ## License Requests
 If you wish to use content in ways not permitted by these terms, please contact us to discuss licensing arrangements.
 ## Privacy
 We respect your privacy. This site may use cookies for basic functionality and analytics. We do not sell personal information to third parties.
 ## External Links
 This site may contain links to external websites. We are not responsible for the content or practices of third-party sites.
 ## Modifications to Terms
 We reserve the right to modify these Terms of Service at any time. Changes will be posted on this page with an updated "Last Updated" date.
 ## Contact
 For questions about these terms, licensing requests, or to report copyright violations:
 {{ social_email }}
 ## Governing Law
 These Terms of Service are governed by applicable copyright law and the laws of [Your Jurisdiction].
 ---
 **Summary:** You can read and share links to comics, but you cannot use them for AI training, scrape the site, use them commercially, or create modified versions without permission.
--- a/static/css/style.css
+++ b/static/css/style.css
@@ -754,7 +754,8 @@ main {
        gap: var(--space-sm);
    }
-    .footer-bottom p {
+    .footer-bottom p,
    .footer-terms {
        flex-basis: 100%;
        text-align: center;
    }
@@ -963,6 +964,18 @@ footer {
    text-decoration: underline;
 }
 .footer-terms {
    color: var(--color-text);
    text-decoration: none;
    font-size: var(--font-size-md);
    transition: opacity 0.2s ease;
 }
 .footer-terms:hover {
    text-decoration: underline;
    opacity: 0.8;
 }
 /* Compact Footer Mode */
 footer.compact-footer {
    border-top: none;
--- a/templates/base.html
+++ b/templates/base.html
@@ -9,6 +9,10 @@
    <meta name="description" content="{% block meta_description %}A webcomic about life, the universe, and everything{% endblock %}">
    <link rel="canonical" href="{% block canonical %}{{ site_url }}{{ request.path }}{% endblock %}">
    <!-- AI Scraping Prevention -->
    <meta name="robots" content="noai, noimageai">
    <meta name="googlebot" content="noai, noimageai">
    <!-- Open Graph / Facebook -->
    <meta property="og:type" content="website">
    <meta property="og:url" content="{% block meta_url %}{{ site_url }}{{ request.path }}{% endblock %}">
@@ -164,6 +168,8 @@
            <div class="footer-bottom">
                <p>&copy; {{ current_year }} {{ copyright_name }}. All rights reserved.</p>
                <span class="footer-divider" aria-hidden="true">|</span>
                <a href="{{ url_for('terms') }}" class="footer-terms">Terms of Service</a>
                <span class="footer-divider" aria-hidden="true">|</span>
                <div class="site-credit">
                    <a href="https://git.puercito.net/mi/sunday" target="_blank" rel="noopener noreferrer" aria-label="Sunday Comics - Webcomic platform">
                        <img src="{{ url_for('static', filename='images/sunday.jpg') }}" alt="Sunday Comics" class="credit-image">