Dashboard

Our Methodology

How we turn 35+ sources into actionable intelligence

1

Automated Source Crawling

Every 24 hours at 00:00 UTC

GitHub Actions workflow triggers our Python crawler to fetch RSS feeds, API data, and public sources from 35+ global publishers:

  • RSS Feeds: Standard XML parsing with feedparser library
  • GitHub API: Real-time trending repos filtered by AI topics
  • YouTube Data API: Latest videos from curated creators
  • Fallback Strategy: If primary RSS fails, query Google News for that source

Rate Limits: We respect all API limits. GitHub: 5,000/hour. YouTube: 10,000 units/day. All within free tier quotas.

User-Agent Spoofing: We use Chrome headers to bypass CloudFlare/Akamai blocking, ensuring reliable 24/7 uptime.

2

Deduplication & Quality Filtering

Remove noise before AI enrichment

URL-Based Deduplication: Each link is checked against existing database. If URL already exists, skip.

Title Similarity Matching: Use SequenceMatcher (>85% threshold) to catch near-duplicates like:

  • "OpenAI Releases GPT-5" vs "GPT-5 Released by OpenAI"
  • "Google Announces Gemini 2.0" vs "Gemini 2.0 Unveiled"

Content Filtering (Moderate): We filter entertainment/events but keep edge cases:

  • Keep: "AI startup raises $50M", "OpenAI releases API", "ArXiv paper on reasoning"
  • Filter: "Toy Story AI movie", "Chocolate AI marketing", "TechCrunch Disrupt registration"

Result: ~40% reduction in volume, focusing on signal vs noise.

3

AI-Powered Enrichment

Google Gemini analyzes each article

Multi-Model Fallback Chain:

  1. Gemini 2.5 Flash Lite (fastest, highest free quota) - Primary
  2. Gemini 2.5 Flash (balanced) - Fallback 1
  3. Gemini 2.0 Flash (proven stability) - Fallback 2
  4. Gemma 3 27B (open model, separate quota pool) - Fallback 3

What AI Extracts:

  • Summary: 10-15 word concise description (not copied from article)
  • Persona: One of 8 categories (Developer, Enterprise, Research, Startup, Creator, Education, Healthcare, Policy & Law)
  • Industries: Tags from fixed whitelist (Tech, Healthcare, Finance, etc.) - prevents hallucinations
  • Companies: Mentioned organizations (e.g., OpenAI, Microsoft, Mayo Clinic)
  • Job Titles: Relevant professional roles (e.g., CTO, ML Engineer, Data Scientist)
  • Locations: Geographic regions (countries only, no cities)
  • Importance Score: 1-10 scale where 9-10 = major launches, 5-6 = notable news, 1-2 = minor updates
  • AI Relevance: Boolean flag - is this genuinely about AI/ML, or just keyword spam?

Batch Processing: 20 items per API call, 4-second pause between batches to stay under 15 RPM limit across 3 API keys.

Quality Validation: After AI enrichment, we canonicalize values:

  • Companies: "MSFT" → "Microsoft", "OpenAI Inc" → "OpenAI"
  • Locations: "SF" → "San Francisco, United States" → "United States" (country only)
  • Job Titles: Fuzzy match to known titles database (grows over time)
4

Metric Extraction & Trending Detection

Real-time engagement scores

Engagement Metrics (when available):

  • GitHub: Star count via GitHub API
  • YouTube: View count via YouTube Data API
  • Others: Base score of 10 (neutral)

Trending Algorithm (Hacker News Formula):

gravity = (metrics + 10) / (age_hours + 2)^1.5

If gravity > 1.5 → Mark as TRENDING 🔥

Example:

  • New repo (1 hour old) with 50 stars: gravity = 60 / 3^1.5 = 11.5 ✅ TRENDING
  • Old repo (72 hours old) with 50 stars: gravity = 60 / 74^1.5 = 0.09 ❌ NOT TRENDING

Result: Fresh, high-engagement content bubbles to the top.

5

Database Publishing

Static JSON for instant loading

Hot Database: database.json contains items from last 365 days (max 1M items)

Cold Archive: archive.json stores items older than 365 days for historical analysis

Raw Signals: raw_signals.json captures pre-AI data for debugging/compliance

File Size: Typical daily database: ~500KB compressed, loads in <1s on 3G.

Update Frequency: Every 24 hours at 00:00 UTC. Users see "Refreshed at: YYYY-MM-DD HH:MM UTC"

6

Frontend Filtering & Presentation

Client-side performance

Alpine.js reactive UI: All filtering happens in browser (no backend queries)

Multi-Dimensional Filters:

  • By Persona: Developer, Enterprise, Research, Startup, etc. (OR logic)
  • By Industry: Tech, Healthcare, Finance, etc. (OR logic)
  • By Company: OpenAI, Anthropic, Google, etc. (OR logic)
  • By Job Title: CTO, ML Engineer, Data Scientist, etc. (OR logic)
  • By Location: Countries only (OR logic)
  • By Timeframe: Last 24h / 7 days / 14 days / 30 days
  • By Search: Full-text search across title, summary, source, and all tags

Performance: Filtering 1,000 items takes ~10ms. LocalStorage saves preferences.

Sort Order: Importance score DESC → Trending first → Newest first

🔍 Transparency & Open Source

All code is open source on GitHub

No editorial bias - AI categorizes content, not humans

Always link to originals - We never copy full articles

No data sales - We don't sell or license aggregated data

Privacy-first analytics - Cookieless tracking, no PII collected

Fair use summaries - AI-generated, transformative content

Explore the Dashboard →