Our Methodology
How we turn 35+ sources into actionable intelligence
Automated Source Crawling
Every 24 hours at 00:00 UTC
GitHub Actions workflow triggers our Python crawler to fetch RSS feeds, API data, and public sources from 35+ global publishers:
- RSS Feeds: Standard XML parsing with feedparser library
- GitHub API: Real-time trending repos filtered by AI topics
- YouTube Data API: Latest videos from curated creators
- Fallback Strategy: If primary RSS fails, query Google News for that source
Rate Limits: We respect all API limits. GitHub: 5,000/hour. YouTube: 10,000 units/day. All within free tier quotas.
User-Agent Spoofing: We use Chrome headers to bypass CloudFlare/Akamai blocking, ensuring reliable 24/7 uptime.
Deduplication & Quality Filtering
Remove noise before AI enrichment
URL-Based Deduplication: Each link is checked against existing database. If URL already exists, skip.
Title Similarity Matching: Use SequenceMatcher (>85% threshold) to catch near-duplicates like:
- "OpenAI Releases GPT-5" vs "GPT-5 Released by OpenAI"
- "Google Announces Gemini 2.0" vs "Gemini 2.0 Unveiled"
Content Filtering (Moderate): We filter entertainment/events but keep edge cases:
- ✅ Keep: "AI startup raises $50M", "OpenAI releases API", "ArXiv paper on reasoning"
- ❌ Filter: "Toy Story AI movie", "Chocolate AI marketing", "TechCrunch Disrupt registration"
Result: ~40% reduction in volume, focusing on signal vs noise.
AI-Powered Enrichment
Google Gemini analyzes each article
Multi-Model Fallback Chain:
- Gemini 2.5 Flash Lite (fastest, highest free quota) - Primary
- Gemini 2.5 Flash (balanced) - Fallback 1
- Gemini 2.0 Flash (proven stability) - Fallback 2
- Gemma 3 27B (open model, separate quota pool) - Fallback 3
What AI Extracts:
- Summary: 10-15 word concise description (not copied from article)
- Persona: One of 8 categories (Developer, Enterprise, Research, Startup, Creator, Education, Healthcare, Policy & Law)
- Industries: Tags from fixed whitelist (Tech, Healthcare, Finance, etc.) - prevents hallucinations
- Companies: Mentioned organizations (e.g., OpenAI, Microsoft, Mayo Clinic)
- Job Titles: Relevant professional roles (e.g., CTO, ML Engineer, Data Scientist)
- Locations: Geographic regions (countries only, no cities)
- Importance Score: 1-10 scale where 9-10 = major launches, 5-6 = notable news, 1-2 = minor updates
- AI Relevance: Boolean flag - is this genuinely about AI/ML, or just keyword spam?
Batch Processing: 20 items per API call, 4-second pause between batches to stay under 15 RPM limit across 3 API keys.
Quality Validation: After AI enrichment, we canonicalize values:
- Companies: "MSFT" → "Microsoft", "OpenAI Inc" → "OpenAI"
- Locations: "SF" → "San Francisco, United States" → "United States" (country only)
- Job Titles: Fuzzy match to known titles database (grows over time)
Metric Extraction & Trending Detection
Real-time engagement scores
Engagement Metrics (when available):
- GitHub: Star count via GitHub API
- YouTube: View count via YouTube Data API
- Others: Base score of 10 (neutral)
Trending Algorithm (Hacker News Formula):
gravity = (metrics + 10) / (age_hours + 2)^1.5 If gravity > 1.5 → Mark as TRENDING 🔥
Example:
- New repo (1 hour old) with 50 stars: gravity = 60 / 3^1.5 = 11.5 ✅ TRENDING
- Old repo (72 hours old) with 50 stars: gravity = 60 / 74^1.5 = 0.09 ❌ NOT TRENDING
Result: Fresh, high-engagement content bubbles to the top.
Database Publishing
Static JSON for instant loading
Hot Database: database.json contains items from last 365 days (max 1M items)
Cold Archive: archive.json stores items older than 365 days for historical analysis
Raw Signals: raw_signals.json captures pre-AI data for debugging/compliance
File Size: Typical daily database: ~500KB compressed, loads in <1s on 3G.
Update Frequency: Every 24 hours at 00:00 UTC. Users see "Refreshed at: YYYY-MM-DD HH:MM UTC"
Frontend Filtering & Presentation
Client-side performance
Alpine.js reactive UI: All filtering happens in browser (no backend queries)
Multi-Dimensional Filters:
- By Persona: Developer, Enterprise, Research, Startup, etc. (OR logic)
- By Industry: Tech, Healthcare, Finance, etc. (OR logic)
- By Company: OpenAI, Anthropic, Google, etc. (OR logic)
- By Job Title: CTO, ML Engineer, Data Scientist, etc. (OR logic)
- By Location: Countries only (OR logic)
- By Timeframe: Last 24h / 7 days / 14 days / 30 days
- By Search: Full-text search across title, summary, source, and all tags
Performance: Filtering 1,000 items takes ~10ms. LocalStorage saves preferences.
Sort Order: Importance score DESC → Trending first → Newest first
🔍 Transparency & Open Source
✅ All code is open source on GitHub
✅ No editorial bias - AI categorizes content, not humans
✅ Always link to originals - We never copy full articles
✅ No data sales - We don't sell or license aggregated data
✅ Privacy-first analytics - Cookieless tracking, no PII collected
✅ Fair use summaries - AI-generated, transformative content