From 14415dfcd2b6f16f79c70dcf5733c839bbae77ba Mon Sep 17 00:00:00 2001 From: mi Date: Sat, 15 Nov 2025 15:43:32 +1000 Subject: [PATCH] :negative_squared_cross_mark: anti-AI measures --- README.md | 144 +++++++++++++++++++++++++++++++++++++++++++ app.py | 65 +++++++++++++++++++ content/terms.md | 93 ++++++++++++++++++++++++++++ static/css/style.css | 15 ++++- templates/base.html | 6 ++ 5 files changed, 322 insertions(+), 1 deletion(-) create mode 100644 content/terms.md diff --git a/README.md b/README.md index 980b459..1eb43e6 100644 --- a/README.md +++ b/README.md @@ -24,6 +24,13 @@ A Flask-based webcomic website with server-side rendering using Jinja2 templates - [SEO Best Practices for Webcomics](#seo-best-practices-for-webcomics) - [SEO Checklist for Launch](#seo-checklist-for-launch) - [Common SEO Questions](#common-seo-questions) +- [Content Protection & AI Scraping Prevention](#content-protection--ai-scraping-prevention) + - [Protection Features](#protection-features) + - [Optional: Additional Protection Measures](#optional-additional-protection-measures) + - [Important Limitations](#important-limitations) + - [Customizing Your Terms](#customizing-your-terms) + - [Testing Your Protection](#testing-your-protection) + - [Reporting Violations](#reporting-violations) - [Project Structure](#project-structure) - [Setup](#setup) - [Environment Variables](#environment-variables) @@ -457,6 +464,143 @@ A: Hashtags don't directly affect search engine SEO, but they help social media **Q: Should I create a blog for my comic?** A: Optional, but regular blog content about your comic's development can improve SEO through fresh content and more keywords. +## Content Protection & AI Scraping Prevention + +Sunday Comics includes built-in measures to discourage AI web scrapers from using your creative work for training machine learning models without permission. + +### Protection Features + +#### robots.txt Blocking +The dynamically generated `robots.txt` file blocks known AI crawlers while still allowing legitimate search engines: + +**Blocked AI bots:** +- **GPTBot** & **ChatGPT-User** (OpenAI) +- **CCBot** (Common Crawl - used by many AI companies) +- **anthropic-ai** & **Claude-Web** (Anthropic) +- **Google-Extended** (Google's AI training crawler, separate from Googlebot) +- **PerplexityBot** (Perplexity AI) +- **Omgilibot**, **Diffbot**, **Bytespider**, **FacebookBot**, **ImagesiftBot**, **cohere-ai** + +**Note:** Regular search engine crawlers (Googlebot, Bingbot, etc.) are still allowed so your comic can be discovered through search. + +The robots.txt also includes a reference to your Terms of Service for transparency. + +#### HTML Meta Tags +Every page includes meta tags that signal to AI scrapers not to use the content: + +```html + + +``` + +- `noai` - Prevents AI training on text content +- `noimageai` - Prevents AI training on images (your comics) + +#### Terms of Service +A comprehensive Terms of Service page at `/terms` legally prohibits: +- Using content for AI training or machine learning +- Scraping or harvesting content for datasets +- Creating derivative works using AI trained on your content +- Text and Data Mining (TDM) without permission + +The Terms page is automatically linked in your footer and includes: +- Copyright protection assertions +- DMCA enforcement information +- TDM rights reservation (EU Directive 2019/790 Article 4) +- Clear permitted use guidelines + +### Optional: Additional Protection Measures + +#### HTTP Headers (Advanced) +For stronger enforcement, you can add HTTP headers. Add this to `app.py` after the imports: + +```python +@app.after_request +def add_ai_blocking_headers(response): + """Add headers to discourage AI scraping""" + response.headers['X-Robots-Tag'] = 'noai, noimageai' + return response +``` + +#### TDM Reservation File (Advanced) +Create a `/tdmrep.json` endpoint to formally reserve Text and Data Mining rights: + +```python +@app.route('/tdmrep.json') +def tdm_reservation(): + """TDM (Text and Data Mining) reservation""" + from flask import jsonify + return jsonify({ + "tdm": { + "reservation": 1, + "policy": f"{SITE_URL}/terms" + } + }) +``` + +### Important Limitations + +**These measures are voluntary** - they only work if AI companies respect them: + +✅ **What this does:** +- Signals your intent to protect your content +- Provides legal grounding for DMCA takedowns +- Blocks responsible AI companies that honor robots.txt +- Makes your copyright stance clear to users and crawlers + +❌ **What this doesn't do:** +- Cannot physically prevent determined bad actors from scraping +- Cannot remove already-scraped historical data from existing datasets +- No guarantee all AI companies will honor these signals + +**Companies that claim to honor robots.txt:** +- OpenAI (GPTBot blocking) +- Anthropic (anthropic-ai blocking) +- Google (Google-Extended blocking, separate from search) + +### Customizing Your Terms + +Edit `/Users/pori/PycharmProjects/sunday/content/terms.md` to customize: + +1. **Jurisdiction** - Add your country/state for legal clarity +2. **Permitted use** - Adjust what you allow (fan art, sharing, etc.) +3. **Contact info** - Automatically populated from `comics_data.py` + +The Terms page uses Jinja2 template variables that pull from your configuration: +- `{{ copyright_name }}` - From `COPYRIGHT_NAME` in `comics_data.py` +- `{{ social_email }}` - From `SOCIAL_EMAIL` in `comics_data.py` + +### Testing Your Protection + +**Verify robots.txt:** +```bash +curl https://yourcomic.com/robots.txt +``` + +You should see AI bot blocks and a link to your terms. + +**Check meta tags:** +View page source and look for: +```html + +``` + +**Validate Terms page:** +Visit `https://yourcomic.com/terms` to ensure it renders correctly. + +### Reporting Violations + +If you discover your work in an AI training dataset or being used without permission: + +1. **Document the violation** - Screenshots, URLs, timestamps +2. **Review their TOS** - Many AI services have content dispute processes +3. **Send DMCA takedown** - Your Terms of Service provides legal standing +4. **Contact the platform** - Use your `SOCIAL_EMAIL` from the Terms page + +Resources: +- [US Copyright Office DMCA](https://www.copyright.gov/dmca/) +- [EU Copyright Directive](https://digital-strategy.ec.europa.eu/en/policies/copyright-legislation) + ## Project Structure ``` diff --git a/app.py b/app.py index cb58073..060108a 100644 --- a/app.py +++ b/app.py @@ -217,6 +217,28 @@ def about(): return render_template('page.html', title='About', content=html_content) +@app.route('/terms') +def terms(): + """Terms of Service page""" + from jinja2 import Template + # Read and render the markdown file with template variables + terms_path = os.path.join(os.path.dirname(__file__), 'content', 'terms.md') + try: + with open(terms_path, 'r', encoding='utf-8') as f: + content = f.read() + # First render as Jinja template to substitute variables + template = Template(content) + rendered_content = template.render( + copyright_name=COPYRIGHT_NAME, + social_email=SOCIAL_EMAIL if SOCIAL_EMAIL else '[Contact Email]' + ) + # Then convert markdown to HTML + html_content = markdown.markdown(rendered_content) + except FileNotFoundError: + html_content = '

Terms of Service content not found.

' + return render_template('page.html', title='Terms of Service', content=html_content) + + @app.route('/api/comics') def api_comics(): """API endpoint - returns all comics as JSON""" @@ -244,6 +266,9 @@ def robots(): """Generate robots.txt dynamically with correct SITE_URL""" from flask import Response robots_txt = f"""# Sunday Comics - Robots.txt +# Content protected by copyright. AI training prohibited. +# See terms: {SITE_URL}/terms + User-agent: * Allow: / @@ -252,6 +277,46 @@ Sitemap: {SITE_URL}/sitemap.xml # Disallow API endpoints from indexing Disallow: /api/ + +# Block AI crawlers and scrapers +User-agent: GPTBot +Disallow: / + +User-agent: ChatGPT-User +Disallow: / + +User-agent: CCBot +Disallow: / + +User-agent: anthropic-ai +Disallow: / + +User-agent: Claude-Web +Disallow: / + +User-agent: Google-Extended +Disallow: / + +User-agent: PerplexityBot +Disallow: / + +User-agent: Omgilibot +Disallow: / + +User-agent: Diffbot +Disallow: / + +User-agent: Bytespider +Disallow: / + +User-agent: FacebookBot +Disallow: / + +User-agent: ImagesiftBot +Disallow: / + +User-agent: cohere-ai +Disallow: / """ return Response(robots_txt, mimetype='text/plain') diff --git a/content/terms.md b/content/terms.md new file mode 100644 index 0000000..959e417 --- /dev/null +++ b/content/terms.md @@ -0,0 +1,93 @@ +# Terms of Service + +**Last Updated:** January 2025 + +By accessing and using this website, you agree to be bound by these Terms of Service. If you do not agree to these terms, please do not use this site. + +## Copyright and Ownership + +All comics, artwork, text, graphics, and other content on this website are protected by copyright and owned by {{ copyright_name }}. All rights reserved. + +## Permitted Use + +**Personal Use:** You may: +- Read and enjoy the comics for personal, non-commercial purposes +- Share links to individual comic pages on social media +- Embed comics on personal websites with proper attribution and a link back to the original + +**Attribution Required:** When sharing or embedding, you must: +- Provide clear credit to {{ copyright_name }} +- Include a link back to this website +- Not alter, crop, or modify the comic images + +## Prohibited Use + +You are **expressly prohibited** from: + +### AI Training and Machine Learning +- Using any content from this site for training artificial intelligence models +- Scraping, crawling, or harvesting content for machine learning purposes +- Including any images, text, or data in AI training datasets +- Using content to develop, train, or improve generative AI systems +- Creating derivative works using AI trained on this content + +### Commercial Use +- Reproducing, distributing, or selling comics without explicit written permission +- Using comics or artwork for commercial purposes without a license +- Printing comics on merchandise (t-shirts, mugs, etc.) without authorization + +### Modification and Redistribution +- Altering, editing, or creating derivative works from the comics +- Removing watermarks, signatures, or attribution +- Rehosting images on other servers or websites +- Claiming comics as your own work + +## Data Mining and Web Scraping + +**Automated Access Prohibition:** Automated scraping, crawling, or systematic downloading of content is strictly prohibited without prior written consent. This includes but is not limited to: +- Web scrapers and bots (except authorized search engines) +- Automated downloads of images or data +- RSS feed abuse or bulk downloading +- Any form of data harvesting for commercial purposes + +**Text and Data Mining (TDM) Reservation:** We formally reserve all rights under applicable copyright law regarding text and data mining, including but not limited to EU Directive 2019/790 Article 4. No TDM exceptions apply to this content. + +## DMCA and Copyright Enforcement + +Unauthorized use of copyrighted material from this site may violate copyright law and be subject to legal action under the Digital Millennium Copyright Act (DMCA) and other applicable laws. + +If you discover unauthorized use of content from this site, please report it to {{ social_email }}. + +## Fair Use + +Limited use for purposes of commentary, criticism, news reporting, teaching, or research may qualify as fair use. If you believe your use qualifies as fair use, please contact us first. + +## License Requests + +If you wish to use content in ways not permitted by these terms, please contact us to discuss licensing arrangements. + +## Privacy + +We respect your privacy. This site may use cookies for basic functionality and analytics. We do not sell personal information to third parties. + +## External Links + +This site may contain links to external websites. We are not responsible for the content or practices of third-party sites. + +## Modifications to Terms + +We reserve the right to modify these Terms of Service at any time. Changes will be posted on this page with an updated "Last Updated" date. + +## Contact + +For questions about these terms, licensing requests, or to report copyright violations: + +{{ social_email }} + +## Governing Law + +These Terms of Service are governed by applicable copyright law and the laws of [Your Jurisdiction]. + +--- + +**Summary:** You can read and share links to comics, but you cannot use them for AI training, scrape the site, use them commercially, or create modified versions without permission. diff --git a/static/css/style.css b/static/css/style.css index 3b3e9f4..a7dd522 100644 --- a/static/css/style.css +++ b/static/css/style.css @@ -754,7 +754,8 @@ main { gap: var(--space-sm); } - .footer-bottom p { + .footer-bottom p, + .footer-terms { flex-basis: 100%; text-align: center; } @@ -963,6 +964,18 @@ footer { text-decoration: underline; } +.footer-terms { + color: var(--color-text); + text-decoration: none; + font-size: var(--font-size-md); + transition: opacity 0.2s ease; +} + +.footer-terms:hover { + text-decoration: underline; + opacity: 0.8; +} + /* Compact Footer Mode */ footer.compact-footer { border-top: none; diff --git a/templates/base.html b/templates/base.html index 5ca7854..4079964 100644 --- a/templates/base.html +++ b/templates/base.html @@ -9,6 +9,10 @@ + + + + @@ -164,6 +168,8 @@