anti-AI measures

This commit is contained in:
mi
2025-11-15 15:43:32 +10:00
parent 1dac042d25
commit 14415dfcd2
5 changed files with 322 additions and 1 deletions

144
README.md
View File

@@ -24,6 +24,13 @@ A Flask-based webcomic website with server-side rendering using Jinja2 templates
- [SEO Best Practices for Webcomics](#seo-best-practices-for-webcomics) - [SEO Best Practices for Webcomics](#seo-best-practices-for-webcomics)
- [SEO Checklist for Launch](#seo-checklist-for-launch) - [SEO Checklist for Launch](#seo-checklist-for-launch)
- [Common SEO Questions](#common-seo-questions) - [Common SEO Questions](#common-seo-questions)
- [Content Protection & AI Scraping Prevention](#content-protection--ai-scraping-prevention)
- [Protection Features](#protection-features)
- [Optional: Additional Protection Measures](#optional-additional-protection-measures)
- [Important Limitations](#important-limitations)
- [Customizing Your Terms](#customizing-your-terms)
- [Testing Your Protection](#testing-your-protection)
- [Reporting Violations](#reporting-violations)
- [Project Structure](#project-structure) - [Project Structure](#project-structure)
- [Setup](#setup) - [Setup](#setup)
- [Environment Variables](#environment-variables) - [Environment Variables](#environment-variables)
@@ -457,6 +464,143 @@ A: Hashtags don't directly affect search engine SEO, but they help social media
**Q: Should I create a blog for my comic?** **Q: Should I create a blog for my comic?**
A: Optional, but regular blog content about your comic's development can improve SEO through fresh content and more keywords. A: Optional, but regular blog content about your comic's development can improve SEO through fresh content and more keywords.
## Content Protection & AI Scraping Prevention
Sunday Comics includes built-in measures to discourage AI web scrapers from using your creative work for training machine learning models without permission.
### Protection Features
#### robots.txt Blocking
The dynamically generated `robots.txt` file blocks known AI crawlers while still allowing legitimate search engines:
**Blocked AI bots:**
- **GPTBot** & **ChatGPT-User** (OpenAI)
- **CCBot** (Common Crawl - used by many AI companies)
- **anthropic-ai** & **Claude-Web** (Anthropic)
- **Google-Extended** (Google's AI training crawler, separate from Googlebot)
- **PerplexityBot** (Perplexity AI)
- **Omgilibot**, **Diffbot**, **Bytespider**, **FacebookBot**, **ImagesiftBot**, **cohere-ai**
**Note:** Regular search engine crawlers (Googlebot, Bingbot, etc.) are still allowed so your comic can be discovered through search.
The robots.txt also includes a reference to your Terms of Service for transparency.
#### HTML Meta Tags
Every page includes meta tags that signal to AI scrapers not to use the content:
```html
<meta name="robots" content="noai, noimageai">
<meta name="googlebot" content="noai, noimageai">
```
- `noai` - Prevents AI training on text content
- `noimageai` - Prevents AI training on images (your comics)
#### Terms of Service
A comprehensive Terms of Service page at `/terms` legally prohibits:
- Using content for AI training or machine learning
- Scraping or harvesting content for datasets
- Creating derivative works using AI trained on your content
- Text and Data Mining (TDM) without permission
The Terms page is automatically linked in your footer and includes:
- Copyright protection assertions
- DMCA enforcement information
- TDM rights reservation (EU Directive 2019/790 Article 4)
- Clear permitted use guidelines
### Optional: Additional Protection Measures
#### HTTP Headers (Advanced)
For stronger enforcement, you can add HTTP headers. Add this to `app.py` after the imports:
```python
@app.after_request
def add_ai_blocking_headers(response):
"""Add headers to discourage AI scraping"""
response.headers['X-Robots-Tag'] = 'noai, noimageai'
return response
```
#### TDM Reservation File (Advanced)
Create a `/tdmrep.json` endpoint to formally reserve Text and Data Mining rights:
```python
@app.route('/tdmrep.json')
def tdm_reservation():
"""TDM (Text and Data Mining) reservation"""
from flask import jsonify
return jsonify({
"tdm": {
"reservation": 1,
"policy": f"{SITE_URL}/terms"
}
})
```
### Important Limitations
**These measures are voluntary** - they only work if AI companies respect them:
✅ **What this does:**
- Signals your intent to protect your content
- Provides legal grounding for DMCA takedowns
- Blocks responsible AI companies that honor robots.txt
- Makes your copyright stance clear to users and crawlers
❌ **What this doesn't do:**
- Cannot physically prevent determined bad actors from scraping
- Cannot remove already-scraped historical data from existing datasets
- No guarantee all AI companies will honor these signals
**Companies that claim to honor robots.txt:**
- OpenAI (GPTBot blocking)
- Anthropic (anthropic-ai blocking)
- Google (Google-Extended blocking, separate from search)
### Customizing Your Terms
Edit `/Users/pori/PycharmProjects/sunday/content/terms.md` to customize:
1. **Jurisdiction** - Add your country/state for legal clarity
2. **Permitted use** - Adjust what you allow (fan art, sharing, etc.)
3. **Contact info** - Automatically populated from `comics_data.py`
The Terms page uses Jinja2 template variables that pull from your configuration:
- `{{ copyright_name }}` - From `COPYRIGHT_NAME` in `comics_data.py`
- `{{ social_email }}` - From `SOCIAL_EMAIL` in `comics_data.py`
### Testing Your Protection
**Verify robots.txt:**
```bash
curl https://yourcomic.com/robots.txt
```
You should see AI bot blocks and a link to your terms.
**Check meta tags:**
View page source and look for:
```html
<meta name="robots" content="noai, noimageai">
```
**Validate Terms page:**
Visit `https://yourcomic.com/terms` to ensure it renders correctly.
### Reporting Violations
If you discover your work in an AI training dataset or being used without permission:
1. **Document the violation** - Screenshots, URLs, timestamps
2. **Review their TOS** - Many AI services have content dispute processes
3. **Send DMCA takedown** - Your Terms of Service provides legal standing
4. **Contact the platform** - Use your `SOCIAL_EMAIL` from the Terms page
Resources:
- [US Copyright Office DMCA](https://www.copyright.gov/dmca/)
- [EU Copyright Directive](https://digital-strategy.ec.europa.eu/en/policies/copyright-legislation)
## Project Structure ## Project Structure
``` ```

65
app.py
View File

@@ -217,6 +217,28 @@ def about():
return render_template('page.html', title='About', content=html_content) return render_template('page.html', title='About', content=html_content)
@app.route('/terms')
def terms():
"""Terms of Service page"""
from jinja2 import Template
# Read and render the markdown file with template variables
terms_path = os.path.join(os.path.dirname(__file__), 'content', 'terms.md')
try:
with open(terms_path, 'r', encoding='utf-8') as f:
content = f.read()
# First render as Jinja template to substitute variables
template = Template(content)
rendered_content = template.render(
copyright_name=COPYRIGHT_NAME,
social_email=SOCIAL_EMAIL if SOCIAL_EMAIL else '[Contact Email]'
)
# Then convert markdown to HTML
html_content = markdown.markdown(rendered_content)
except FileNotFoundError:
html_content = '<p>Terms of Service content not found.</p>'
return render_template('page.html', title='Terms of Service', content=html_content)
@app.route('/api/comics') @app.route('/api/comics')
def api_comics(): def api_comics():
"""API endpoint - returns all comics as JSON""" """API endpoint - returns all comics as JSON"""
@@ -244,6 +266,9 @@ def robots():
"""Generate robots.txt dynamically with correct SITE_URL""" """Generate robots.txt dynamically with correct SITE_URL"""
from flask import Response from flask import Response
robots_txt = f"""# Sunday Comics - Robots.txt robots_txt = f"""# Sunday Comics - Robots.txt
# Content protected by copyright. AI training prohibited.
# See terms: {SITE_URL}/terms
User-agent: * User-agent: *
Allow: / Allow: /
@@ -252,6 +277,46 @@ Sitemap: {SITE_URL}/sitemap.xml
# Disallow API endpoints from indexing # Disallow API endpoints from indexing
Disallow: /api/ Disallow: /api/
# Block AI crawlers and scrapers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: ImagesiftBot
Disallow: /
User-agent: cohere-ai
Disallow: /
""" """
return Response(robots_txt, mimetype='text/plain') return Response(robots_txt, mimetype='text/plain')

93
content/terms.md Normal file
View File

@@ -0,0 +1,93 @@
# Terms of Service
**Last Updated:** January 2025
By accessing and using this website, you agree to be bound by these Terms of Service. If you do not agree to these terms, please do not use this site.
## Copyright and Ownership
All comics, artwork, text, graphics, and other content on this website are protected by copyright and owned by {{ copyright_name }}. All rights reserved.
## Permitted Use
**Personal Use:** You may:
- Read and enjoy the comics for personal, non-commercial purposes
- Share links to individual comic pages on social media
- Embed comics on personal websites with proper attribution and a link back to the original
**Attribution Required:** When sharing or embedding, you must:
- Provide clear credit to {{ copyright_name }}
- Include a link back to this website
- Not alter, crop, or modify the comic images
## Prohibited Use
You are **expressly prohibited** from:
### AI Training and Machine Learning
- Using any content from this site for training artificial intelligence models
- Scraping, crawling, or harvesting content for machine learning purposes
- Including any images, text, or data in AI training datasets
- Using content to develop, train, or improve generative AI systems
- Creating derivative works using AI trained on this content
### Commercial Use
- Reproducing, distributing, or selling comics without explicit written permission
- Using comics or artwork for commercial purposes without a license
- Printing comics on merchandise (t-shirts, mugs, etc.) without authorization
### Modification and Redistribution
- Altering, editing, or creating derivative works from the comics
- Removing watermarks, signatures, or attribution
- Rehosting images on other servers or websites
- Claiming comics as your own work
## Data Mining and Web Scraping
**Automated Access Prohibition:** Automated scraping, crawling, or systematic downloading of content is strictly prohibited without prior written consent. This includes but is not limited to:
- Web scrapers and bots (except authorized search engines)
- Automated downloads of images or data
- RSS feed abuse or bulk downloading
- Any form of data harvesting for commercial purposes
**Text and Data Mining (TDM) Reservation:** We formally reserve all rights under applicable copyright law regarding text and data mining, including but not limited to EU Directive 2019/790 Article 4. No TDM exceptions apply to this content.
## DMCA and Copyright Enforcement
Unauthorized use of copyrighted material from this site may violate copyright law and be subject to legal action under the Digital Millennium Copyright Act (DMCA) and other applicable laws.
If you discover unauthorized use of content from this site, please report it to {{ social_email }}.
## Fair Use
Limited use for purposes of commentary, criticism, news reporting, teaching, or research may qualify as fair use. If you believe your use qualifies as fair use, please contact us first.
## License Requests
If you wish to use content in ways not permitted by these terms, please contact us to discuss licensing arrangements.
## Privacy
We respect your privacy. This site may use cookies for basic functionality and analytics. We do not sell personal information to third parties.
## External Links
This site may contain links to external websites. We are not responsible for the content or practices of third-party sites.
## Modifications to Terms
We reserve the right to modify these Terms of Service at any time. Changes will be posted on this page with an updated "Last Updated" date.
## Contact
For questions about these terms, licensing requests, or to report copyright violations:
{{ social_email }}
## Governing Law
These Terms of Service are governed by applicable copyright law and the laws of [Your Jurisdiction].
---
**Summary:** You can read and share links to comics, but you cannot use them for AI training, scrape the site, use them commercially, or create modified versions without permission.

View File

@@ -754,7 +754,8 @@ main {
gap: var(--space-sm); gap: var(--space-sm);
} }
.footer-bottom p { .footer-bottom p,
.footer-terms {
flex-basis: 100%; flex-basis: 100%;
text-align: center; text-align: center;
} }
@@ -963,6 +964,18 @@ footer {
text-decoration: underline; text-decoration: underline;
} }
.footer-terms {
color: var(--color-text);
text-decoration: none;
font-size: var(--font-size-md);
transition: opacity 0.2s ease;
}
.footer-terms:hover {
text-decoration: underline;
opacity: 0.8;
}
/* Compact Footer Mode */ /* Compact Footer Mode */
footer.compact-footer { footer.compact-footer {
border-top: none; border-top: none;

View File

@@ -9,6 +9,10 @@
<meta name="description" content="{% block meta_description %}A webcomic about life, the universe, and everything{% endblock %}"> <meta name="description" content="{% block meta_description %}A webcomic about life, the universe, and everything{% endblock %}">
<link rel="canonical" href="{% block canonical %}{{ site_url }}{{ request.path }}{% endblock %}"> <link rel="canonical" href="{% block canonical %}{{ site_url }}{{ request.path }}{% endblock %}">
<!-- AI Scraping Prevention -->
<meta name="robots" content="noai, noimageai">
<meta name="googlebot" content="noai, noimageai">
<!-- Open Graph / Facebook --> <!-- Open Graph / Facebook -->
<meta property="og:type" content="website"> <meta property="og:type" content="website">
<meta property="og:url" content="{% block meta_url %}{{ site_url }}{{ request.path }}{% endblock %}"> <meta property="og:url" content="{% block meta_url %}{{ site_url }}{{ request.path }}{% endblock %}">
@@ -164,6 +168,8 @@
<div class="footer-bottom"> <div class="footer-bottom">
<p>&copy; {{ current_year }} {{ copyright_name }}. All rights reserved.</p> <p>&copy; {{ current_year }} {{ copyright_name }}. All rights reserved.</p>
<span class="footer-divider" aria-hidden="true">|</span> <span class="footer-divider" aria-hidden="true">|</span>
<a href="{{ url_for('terms') }}" class="footer-terms">Terms of Service</a>
<span class="footer-divider" aria-hidden="true">|</span>
<div class="site-credit"> <div class="site-credit">
<a href="https://git.puercito.net/mi/sunday" target="_blank" rel="noopener noreferrer" aria-label="Sunday Comics - Webcomic platform"> <a href="https://git.puercito.net/mi/sunday" target="_blank" rel="noopener noreferrer" aria-label="Sunday Comics - Webcomic platform">
<img src="{{ url_for('static', filename='images/sunday.jpg') }}" alt="Sunday Comics" class="credit-image"> <img src="{{ url_for('static', filename='images/sunday.jpg') }}" alt="Sunday Comics" class="credit-image">