Technical Documentation
Complete guide to understanding, deploying, and extending the PHT Strategy 5 AI content pipeline.
Table of Contents
System Architecture
PHT Strategy 5 is built as a modular pipeline with three main components:
src/
├── briefs/ # Topic discovery and brief generation
│ ├── fetch_feeds.py # RSS/API data collection
│ ├── cluster_topics.py # ML-based topic clustering
│ └── write_brief.py # Structured brief creation
├── drafts/ # Article generation and optimization
│ ├── expand_article.py # LLM-based article writing
│ ├── fact_check.py # RAG-based verification
│ ├── seo_enrich.py # Metadata and schema generation
│ └── image_prompts.py # Visual content suggestions
├── feedback/ # Analytics and improvement
│ ├── ingest_cf_logs.py # Cloudflare log processing
│ ├── score_topics.py # Performance analysis
│ └── open_tasks.py # Improvement task generation
└── utils/ # Shared utilities
├── gh.py # GitHub API integration
├── rag_store.py # Knowledge base management
└── prompts/ # LLM prompt templates
Data Flow
- Collection: RSS feeds and APIs → Raw topic data
- Processing: Topic clustering → Content gap analysis
- Generation: Structured briefs → GitHub Issues
- Approval: Human review → Approved topics
- Expansion: LLM generation → Full articles
- Verification: Fact-checking → Quality assurance
- Publishing: SEO optimization → Pull requests
- Feedback: Analytics → Improvement suggestions
Topic Discovery & Clustering
The system continuously monitors multiple sources to identify trending privacy topics:
Data Sources
# config/sources.yml
feeds:
- name: eff
url: https://www.eff.org/rss/updates.xml
- name: mozilla
url: https://blog.mozilla.org/en/feed/
- name: proton
url: https://proton.me/blog/feed
# Additional sources (configurable):
# - Reddit communities (r/privacy, r/security)
# - Google Trends privacy categories
# - Hacker News privacy discussions
# - Government policy RSS feeds
Clustering Algorithm
Topics are clustered using machine learning to identify related content:
- Embeddings: Text content converted to vector representations
- Similarity: Cosine similarity between topic vectors
- Clustering: DBSCAN or HDBSCAN for automatic cluster detection
- Gap Analysis: Compare clusters to existing SERP results
# Example clustering output
{
"cluster_id": "android-privacy-2025",
"topics": [
"Android 15 privacy features",
"Google Play privacy changes",
"Android app permissions update"
],
"search_gap": "Missing comprehensive guide for new Android 15 privacy settings",
"confidence": 0.87
}
Brief Generation Process
Each topic cluster is transformed into a structured brief containing all information needed for article creation:
# Example brief structure
{
"slug": "android-15-privacy-guide-2025",
"pillar": "mobile-privacy",
"title": "Complete Android 15 Privacy Guide: New Features & Settings",
"primary_keyword": "android 15 privacy",
"secondary_keywords": ["android privacy settings", "google privacy controls"],
"search_intent": "howto",
"search_gap": "No comprehensive guide covering all Android 15 privacy features",
"outline": [
{
"section": "Introduction",
"summary": "Overview of Android 15 privacy improvements"
},
{
"section": "New Privacy Features",
"summary": "Detailed walkthrough of new privacy controls"
},
{
"section": "Step-by-Step Setup",
"summary": "How to configure optimal privacy settings"
}
],
"sources_hint": ["developer.android.com", "support.google.com"],
"estimated_length": 2500,
"difficulty": "beginner"
}
Brief Quality Criteria
- Clear search intent identification (HowTo, Explainer, Review)
- Quantified search gap with competitive analysis
- Realistic scope and target word count
- Authoritative source suggestions for fact-checking
- SEO keyword analysis and difficulty assessment
Human Approval Workflow
Every brief requires human approval before article generation begins:
GitHub Issues Integration
- Brief generated and posted as GitHub Issue with "brief" label
- Human editor reviews topic relevance and quality
- Editor adds "approved" label to selected briefs
- GitHub Action triggered automatically on label addition
- Approved brief enters article generation pipeline
# GitHub Issue Template
Title: [BRIEF] Complete Android 15 Privacy Guide: New Features & Settings
Labels: brief, mobile-privacy, android
Body:
**Primary Keyword:** android 15 privacy
**Search Gap:** No comprehensive guide covering all Android 15 privacy features
**Estimated Length:** 2500 words
**Difficulty:** Beginner
**Outline:**
1. Introduction - Overview of Android 15 privacy improvements
2. New Privacy Features - Detailed walkthrough of new privacy controls
3. Step-by-Step Setup - How to configure optimal privacy settings
**Sources:** developer.android.com, support.google.com
---
To approve this brief for article generation, add the "approved" label.
Approval Metrics
The system tracks approval rates to improve brief quality:
- Brief approval percentage by topic category
- Time from brief creation to approval decision
- Editor feedback patterns and preferences
- Correlation between brief quality scores and approval rates
Article Expansion & Generation
Approved briefs are expanded into full articles using large language models:
Generation Process
- Section-by-Section: Articles written incrementally for better quality control
- Template Adherence: Consistent structure with intro, key takeaways, instructions, FAQ
- Style Guidelines: Short paragraphs, bullet lists, clear headings
- Citation Integration: Automatic source attribution and link insertion
# Article generation prompt template
You are writing a comprehensive privacy guide for technically-minded users.
Article Details:
- Title: {title}
- Primary Keyword: {primary_keyword}
- Target Length: {target_length} words
- Audience: {difficulty_level}
Content Requirements:
- Short paragraphs (2-3 sentences max)
- Numbered lists for step-by-step instructions
- Bullet points for feature lists and benefits
- Include "Key Takeaways" section near the top
- Add FAQ section at the end
- Cite sources using [source: domain.com] format
Section to write: {section_title}
Section summary: {section_summary}
Write this section now:
Quality Controls
- Length Validation: Meets minimum word count thresholds
- Readability: Flesch Reading Ease score 55-75
- Structure: Proper heading hierarchy and formatting
- Links: Minimum internal links and appropriate external citations
Fact-Checking System
Every article undergoes automated fact-checking using Retrieval-Augmented Generation (RAG):
Knowledge Base
The RAG system maintains a curated database of authoritative privacy sources:
- EFF: Blog posts and privacy guides
- Mozilla: Developer documentation and privacy policies
- Standards: IETF RFCs and W3C specifications
- Vendors: Official Apple, Google, Microsoft documentation
- Legal: GDPR, CCPA, and other privacy regulations
# Fact-checking process
1. Extract claims from generated article
2. Query RAG knowledge base for relevant sources
3. Compare claims against retrieved passages
4. Flag unverifiable or contradictory statements
5. Insert HTML comments for human review
# Example flagged content
<!-- VERIFY: This claim about Android 15 permissions
could not be verified against developer.android.com
documentation. Please confirm accuracy. -->
Verification Levels
- Verified: Direct match with authoritative source
- Supported: Consistent with similar authoritative content
- Unverified: No supporting evidence found - flagged for review
- Contradicted: Conflicts with known authoritative source
SEO & Metadata Optimization
Every article is optimized for both traditional search engines and AI systems:
SEO Elements
# Automated SEO optimization
- Meta title (50-60 characters)
- Meta description (150-160 characters)
- H1, H2, H3 hierarchy with target keywords
- Alt text for all images
- Internal linking to related content
- JSON-LD schema markup (Article, HowTo, FAQPage)
- OpenGraph and Twitter Card metadata
- Canonical URLs and redirects
Schema Markup Examples
{
"@context": "https://schema.org",
"@type": "HowTo",
"name": "Complete Android 15 Privacy Guide",
"description": "Step-by-step guide to configure privacy settings in Android 15",
"image": "https://example.com/android-privacy-guide.jpg",
"supply": [
{
"@type": "HowToSupply",
"name": "Android 15 device"
}
],
"step": [
{
"@type": "HowToStep",
"name": "Open Privacy Settings",
"text": "Navigate to Settings > Privacy",
"image": "https://example.com/step1.jpg"
}
]
}
AI Optimization
Content structured for AI scraping and summarization:
- Key Takeaways: Prominent summary section for AI extraction
- FAQ Format: Question-answer pairs for voice search
- Structured Data: Tables and lists for data extraction
- Citation Format: Clear source attribution for AI verification
Analytics & Feedback Loop
Continuous improvement using Cloudflare analytics and performance data:
Metrics Collection
# Key performance indicators
- Page views and unique visitors
- Average time on page
- Bounce rate and exit rate
- Search engine impressions and clicks
- Social media shares and engagement
- Internal link click-through rates
- Conversion metrics (if applicable)
Performance Thresholds
# config/thresholds.yml
seo:
min_lighthouse_score: 90
min_word_count: 1200
max_external_links: 20
feedback:
min_pageviews: 100 # Monthly threshold
min_time_on_page: 45.0 # Seconds
max_bounce_rate: 70 # Percentage
Automated Improvements
The system automatically identifies optimization opportunities:
- Expand: High-performing content gets additional sections
- Update: Outdated content flagged for refresh
- Merge: Similar low-performing content consolidated
- Optimize: SEO improvements for underperforming pages
Configuration Management
The system uses YAML configuration files for easy customization:
Sources Configuration
# config/sources.yml
feeds:
- name: eff
url: https://www.eff.org/rss/updates.xml
weight: 1.0
category: advocacy
- name: mozilla
url: https://blog.mozilla.org/en/feed/
weight: 0.8
category: browser
reddit:
subreddits:
- privacy
- security
- privacytoolsio
min_score: 50
google_trends:
categories:
- "Internet & Telecom/Internet/Web Services"
- "Computers & Electronics/Software/Internet Software"
Quality Thresholds
# config/thresholds.yml
content:
min_word_count: 1200
max_word_count: 5000
flesch_reading_ease:
min: 55
max: 75
seo:
min_lighthouse_score: 90
max_external_links: 20
min_internal_links: 2
ml:
clustering:
min_cluster_size: 3
eps: 0.3
similarity_threshold: 0.75
Deployment & Setup
Complete guide to deploying PHT Strategy 5 on GitHub Actions and Cloudflare Pages:
Prerequisites
- GitHub repository with Actions enabled
- Cloudflare account with Pages access
- OpenAI or Anthropic API key (or local Ollama setup)
- Python 3.11+ for local development
Environment Variables
# Required GitHub Secrets
OPENAI_API_KEY=sk-...
GITHUB_TOKEN=ghp_...
CLOUDFLARE_API_TOKEN=...
# Optional
ANTHROPIC_API_KEY=...
REDDIT_CLIENT_ID=...
REDDIT_CLIENT_SECRET=...
GOOGLE_TRENDS_API_KEY=...
Local Development Setup
# Clone and setup
git clone https://github.com/michaeljensen/pht-strategy5.git
cd pht-strategy5
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env with your API keys
# Test the pipeline
python src/briefs/fetch_feeds.py
python src/briefs/cluster_topics.py
Development Guide
Guidelines for extending and customizing the system:
Code Structure
- Modular Design: Each component is independently testable
- Configuration-Driven: Behavior controlled via YAML files
- Type Hints: Full type annotation for better IDE support
- Error Handling: Graceful degradation and retry logic
# Example module structure
from dataclasses import dataclass
from typing import List, Optional
import yaml
@dataclass
class FeedItem:
"""Normalized representation of a topic candidate."""
source: str
title: str
summary: str
url: str
published_at: datetime
def load_sources(config_path: Path) -> List[str]:
"""Load RSS sources from configuration."""
with config_path.open("r") as f:
data = yaml.safe_load(f) or {}
return [entry["url"] for entry in data.get("feeds", [])]
Testing Strategy
- Unit Tests: Individual function testing with mocked APIs
- Integration Tests: End-to-end pipeline testing
- Performance Tests: Content quality and generation speed
- A/B Tests: Prompt optimization and output comparison
API Reference
Key functions and classes for system extension:
Core Classes
class FeedItem:
"""Represents a discovered topic candidate."""
source: str # RSS feed or API source
title: str # Original headline
summary: str # Brief description
url: str # Source URL
published_at: datetime
class Brief:
"""Structured brief for article generation."""
slug: str # URL-friendly identifier
pillar: str # Content category
title: str # Proposed article title
primary_keyword: str
secondary_keywords: List[str]
search_intent: str # "howto", "explainer", "review"
outline: List[Section]
sources_hint: List[str]
class Article:
"""Generated article with metadata."""
brief: Brief
content: str # Markdown content
metadata: dict # SEO and schema data
fact_check_results: List[FactCheck]
Key Functions
# Topic Discovery
def fetch_sources(sources: List[str]) -> List[FeedItem]
def cluster_topics(items: List[FeedItem]) -> List[TopicCluster]
def analyze_search_gap(cluster: TopicCluster) -> SearchGapAnalysis
# Brief Generation
def generate_brief(cluster: TopicCluster) -> Brief
def create_github_issue(brief: Brief) -> Issue
# Article Generation
def expand_article(brief: Brief) -> Article
def fact_check_article(article: Article) -> List[FactCheck]
def optimize_seo(article: Article) -> Article
# Analytics
def ingest_cf_logs(log_path: str) -> List[AnalyticsEvent]
def score_content(article: Article, events: List[AnalyticsEvent]) -> ContentScore
def suggest_improvements(score: ContentScore) -> List[Improvement]
Need Help?
Get support, report issues, or contribute to the project.