How does AI Sentia curate AI news?

AI Sentia uses automated crawling via GitHub Actions to fetch RSS feeds and API data from 35+ sources every 24 hours. Each article is analyzed by Google Gemini AI to generate summaries, assign personas, and extract metadata like industries, companies, and locations.

How accurate are AI-generated summaries?

Summaries are generated using Google Gemini 2.5 Flash with a multi-model fallback chain. We achieve ~95% success rate for AI enrichment. All summaries are original content created by our AI, not copied from source articles.

What makes AI Sentia different from other news aggregators?

AI Sentia uses LLM-powered intelligence to categorize content by persona (Developer, Research, Enterprise) and extract structured metadata. We monitor 35+ specialized AI sources and detect trending content using a Hacker News-style algorithm.

How often is the database updated?

The database updates automatically every 24 hours at midnight UTC via GitHub Actions. New articles are crawled, deduplicated, enriched with AI summaries, and published within 5-10 minutes.

Our Methodology

How we turn 35+ sources into actionable intelligence

Automated Source Crawling

Every 24 hours at 00:00 UTC

GitHub Actions workflow triggers our Python crawler to fetch RSS feeds, API data, and public sources from 35+ global publishers:

RSS Feeds: Standard XML parsing with feedparser library
GitHub API: Real-time trending repos filtered by AI topics
YouTube Data API: Latest videos from curated creators
Fallback Strategy: If primary RSS fails, query Google News for that source

Rate Limits: We respect all API limits. GitHub: 5,000/hour. YouTube: 10,000 units/day. All within free tier quotas.

User-Agent Spoofing: We use Chrome headers to bypass CloudFlare/Akamai blocking, ensuring reliable 24/7 uptime.

Deduplication & Quality Filtering

Remove noise before AI enrichment

URL-Based Deduplication: Each link is checked against existing database. If URL already exists, skip.

Title Similarity Matching: Use SequenceMatcher (>85% threshold) to catch near-duplicates like:

"OpenAI Releases GPT-5" vs "GPT-5 Released by OpenAI"
"Google Announces Gemini 2.0" vs "Gemini 2.0 Unveiled"

Content Filtering (Moderate): We filter entertainment/events but keep edge cases:

✅ Keep: "AI startup raises $50M", "OpenAI releases API", "ArXiv paper on reasoning"
❌ Filter: "Toy Story AI movie", "Chocolate AI marketing", "TechCrunch Disrupt registration"

Result: ~40% reduction in volume, focusing on signal vs noise.

AI-Powered Enrichment

Google Gemini analyzes each article

Multi-Model Fallback Chain:

Gemini 2.5 Flash Lite (fastest, highest free quota) - Primary
Gemini 2.5 Flash (balanced) - Fallback 1
Gemini 2.0 Flash (proven stability) - Fallback 2
Gemma 3 27B (open model, separate quota pool) - Fallback 3

What AI Extracts:

Summary: 10-15 word concise description (not copied from article)
Persona: One of 8 categories (Developer, Enterprise, Research, Startup, Creator, Education, Healthcare, Policy & Law)
Industries: Tags from fixed whitelist (Tech, Healthcare, Finance, etc.) - prevents hallucinations
Companies: Mentioned organizations (e.g., OpenAI, Microsoft, Mayo Clinic)
Job Titles: Relevant professional roles (e.g., CTO, ML Engineer, Data Scientist)
Locations: Geographic regions (countries only, no cities)
Importance Score: 1-10 scale where 9-10 = major launches, 5-6 = notable news, 1-2 = minor updates
AI Relevance: Boolean flag - is this genuinely about AI/ML, or just keyword spam?

Batch Processing: 20 items per API call, 4-second pause between batches to stay under 15 RPM limit across 3 API keys.

Quality Validation: After AI enrichment, we canonicalize values:

Companies: "MSFT" → "Microsoft", "OpenAI Inc" → "OpenAI"
Locations: "SF" → "San Francisco, United States" → "United States" (country only)
Job Titles: Fuzzy match to known titles database (grows over time)

Metric Extraction & Trending Detection

Real-time engagement scores

Engagement Metrics (when available):

GitHub: Star count via GitHub API
YouTube: View count via YouTube Data API
Others: Base score of 10 (neutral)

Trending Algorithm (Hacker News Formula):

gravity = (metrics + 10) / (age_hours + 2)^1.5

If gravity > 1.5 → Mark as TRENDING 🔥

Example:

New repo (1 hour old) with 50 stars: gravity = 60 / 3^1.5 = 11.5 ✅ TRENDING
Old repo (72 hours old) with 50 stars: gravity = 60 / 74^1.5 = 0.09 ❌ NOT TRENDING

Result: Fresh, high-engagement content bubbles to the top.

Database Publishing

Static JSON for instant loading

Hot Database: database.json contains items from last 365 days (max 1M items)

Cold Archive: archive.json stores items older than 365 days for historical analysis

Raw Signals: raw_signals.json captures pre-AI data for debugging/compliance

File Size: Typical daily database: ~500KB compressed, loads in <1s on 3G.

Update Frequency: Every 24 hours at 00:00 UTC. Users see "Refreshed at: YYYY-MM-DD HH:MM UTC"

Frontend Filtering & Presentation

Client-side performance

Alpine.js reactive UI: All filtering happens in browser (no backend queries)

Multi-Dimensional Filters:

By Persona: Developer, Enterprise, Research, Startup, etc. (OR logic)
By Industry: Tech, Healthcare, Finance, etc. (OR logic)
By Company: OpenAI, Anthropic, Google, etc. (OR logic)
By Job Title: CTO, ML Engineer, Data Scientist, etc. (OR logic)
By Location: Countries only (OR logic)
By Timeframe: Last 24h / 7 days / 14 days / 30 days
By Search: Full-text search across title, summary, source, and all tags

Performance: Filtering 1,000 items takes ~10ms. LocalStorage saves preferences.

Sort Order: Importance score DESC → Trending first → Newest first

🔍 Transparency & Open Source

✅ All code is open source on GitHub

✅ No editorial bias - AI categorizes content, not humans

✅ Always link to originals - We never copy full articles

✅ No data sales - We don't sell or license aggregated data

✅ Privacy-first analytics - Cookieless tracking, no PII collected

✅ Fair use summaries - AI-generated, transformative content

Explore the Dashboard →