How I Built a Custom SEO Automation Stack to Track 500+ Keywords (And What Broke Along the Way)

·

·

👁 11 views

How I Built a Custom SEO Automation Stack to Track 500+ Keywords (And What Broke Along the Way)

Most SEO professionals rely on third-party rank tracking tools — Ahrefs, Semrush, SERPWatcher. They’re fine. But if you’re running an agency or building your own SEO platform, there’s a point where you start wondering: what if I just owned this data myself?

That’s exactly what I’ve been building. Today I want to share what a custom SEO automation and keyword rank tracking system actually looks like in production — the data model, the API integrations, the bugs that cost real money to fix, and the lessons that came out of it.

Why Build Your Own Rank Tracking Software?

The honest answer: control and cost at scale. When you’re tracking 500+ keywords across multiple client sites, the math on SaaS rank tracking tools gets ugly fast. More importantly, when the data lives in your own database, you can do things no off-the-shelf tool lets you do — cross-reference keyword difficulty against actual conversion data, build custom dashboards, trigger automations when rankings change.

For Master Control Press, I needed granular SEO rank tracking for 501 keywords: not just position, but search volume, keyword difficulty, CPC data, monthly search trends, competition scores, and top-of-page bid ranges. That’s a lot of data to manage. Here’s how I structured it.

The Data Architecture

The core schema lives in PostgreSQL (via Supabase) with two main tables:

tracked_keywords
├── id, domain, keyword
├── search_volume_monthly (integer)
├── keyword_difficulty (0-100)
├── competition, competition_index
├── cpc (decimal)
├── low_top_of_page_bid, high_top_of_page_bid
├── monthly_searches (JSONB — 12-month trend array)
└── current_rank, rank_checked_at

domain_metrics
├── id, domain, metric_type
├── data (JSONB — flexible storage for traffic estimates, competitors)
└── fetched_at

The JSONB column on monthly_searches is key. Google Ads API (via DataForSEO) returns a 12-month trend array for each keyword — storing that as structured JSON means I can query trends without denormalizing into 12 separate columns per keyword.

The domain_metrics table uses a flexible metric_type + data JSONB pattern. Traffic estimates, competitor lists, domain authority signals — they all have different shapes, and this avoids a migration every time I add a new metric type.

Populating the Data: DataForSEO API

I’m using the DataForSEO API for all keyword intelligence. It’s the most comprehensive programmatic SEO data source I’ve found — it covers Google Ads search volume, keyword difficulty, SERP results, backlinks, and more through a single pay-per-call REST API.

For 501 keywords, here’s what the data pull looked like:

  • Search volume + monthly trends (501 keywords): $0.075
  • Keyword difficulty scores (498 keywords): $0.060
  • Traffic estimate for the domain: $0.010
  • Competitor discovery: $0.012
  • SERP rank checks (538 keywords × live results): ~$7.53

Total cost to populate a complete keyword database for one site: under $8. That’s a one-time cost — subsequent rank checks are what add up over time, but even at $0.014 per keyword per SERP check, weekly tracking for 500 keywords runs about $3.60/week.

The Bug That Cost 476 API Calls

Here’s the part no one talks about when they write “how I built X” posts: the embarrassing bugs.

I wrote a bulk rank check script that used Node’s execSync to shell out to curl for each SERP request. The script ran, returned “success” messages, wrote files — everything looked fine. Except when I went to parse the results, 476 out of 538 files contained auth errors:

{"status_code": 40100, "status_message": "AuthorizationError"}

The problem: execSync() in Node.js spawns a child process that doesn’t inherit the parent’s environment variables by default unless you explicitly pass them. My DATAFORSEO_LOGIN and DATAFORSEO_PASSWORD env vars were set in the parent process — but the curl commands were running blind.

The fix was straightforward: replace execSync(curl) with native fetch() in Node, reading credentials directly from process.env. No more child process env var inheritance issues.

The lesson: never use execSync to shell out to curl when you can just use fetch. And always validate API responses before logging “success” — a 200 status with an error body in the payload is still a failure.

The API Client Pattern That Fixed Everything

Beyond the env var issue, I also hit a consistent problem in the Next.js dashboard: API routes that worked in development returned empty data in production. The culprit was using Supabase’s REST API (which requires SUPABASE_SERVICE_KEY) in an environment where only DATABASE_URL was configured.

The rule I’ve settled on: always use the database client directly (Drizzle ORM + DATABASE_URL) rather than the Supabase REST layer. It’s one fewer dependency, one fewer authentication surface, and it works consistently across environments.

// ✅ Use this
import { db } from '@/lib/db/client' // Drizzle + DATABASE_URL

// ❌ Not this in production
import { createClient } from '@supabase/supabase-js'
const supabase = createClient(url, SUPABASE_SERVICE_KEY) // key not always set

What the Finished Dashboard Looks Like

The client-facing reporting dashboard on seobandwagon.dev now includes:

  • Per-client Google Search Console integration (clicks, impressions, CTR, position)
  • Performance charts with 7d/28d/3-month date range selection
  • Top queries and top pages tables
  • PDF export with branded styling
  • Individual client views — no portfolio averages, just clean per-client data

The keyword tracking layer (the 501-keyword database we built today) feeds into a separate tracked keywords view showing rank changes, difficulty scores, and opportunity gaps. That’s the part that makes this more than a pretty GSC wrapper — it’s actual SEO automation infrastructure.

Using AI Agents for SEO Automation

Worth mentioning: this entire system was built and debugged with AI assistance. I’m running OpenClaw as my AI agent layer — it handles everything from writing the DataForSEO API scripts, to debugging the failing SERP checks, to populating the database and building the Next.js dashboard components.

The workflow isn’t “AI does everything” — it’s closer to a skilled developer who works fast but needs QA. The agent catches its own mistakes when given clear feedback loops. The SERP bug got fixed because I set up verification steps: after each batch run, check a sample of output files for error payloads before calling it done.

That verification discipline is the most important thing I’d pass on. AI-assisted SEO automation is genuinely powerful. But “the script ran” is not “the script worked.” Always verify the data, not just the execution.

Key Takeaways

  • Building your own keyword rank tracking is viable at scale — DataForSEO makes the data accessible at a cost that beats SaaS tools once you’re past ~200 keywords
  • JSONB columns are underrated for SEO data storage — flexible schema without endless migrations
  • Never execSync curl when you can fetch() — child process env inheritance is a silent killer
  • Success messages lie — validate API response payloads, not just HTTP status codes
  • Use your database client directly — fewer auth surfaces, more consistent behavior across environments
  • Paid API data should be saved to files first, then parsed into the database — if the DB write fails, you still have the raw data

If you’re building your own SEO automation stack and want to dig into any of this — the schema, the DataForSEO integration patterns, or the Next.js dashboard architecture — drop a comment below. Happy to share more specifics.

Stay in the loop

Get WordPress + AI insights delivered to your inbox. No spam, unsubscribe anytime.

We respect your privacy. Read our privacy policy.


Recommended Posts