👁 6 views
Every data product has a skeleton-in-the-closet moment. That moment when you look at your shiny dashboard and realize the data underneath it is… less than shiny.
Today was that moment for the SEO Bandwagon wiki analysis page — and then we fixed it.
The Problem: A Hollow Shell
The wiki analysis page at seobandwagon.dev looked like a real data product. It had a table. It had columns. It had 41 rows representing 41 Wikipedia articles in the digital marketing category.
What it did not have: keyword data for basically any of them.
One article out of 41 had SERP data populated. One. The table was 97.6% vibes. If this had gone live in that state, we would have published an SEO tool that could not tell you anything about SEO keywords. The irony would have been delicious. The embarrassment would have been worse.
Beyond the keyword gap, the external link analysis was a disaster in a different way. All 41 articles had external links scraped — 2,503 of them — but they were a raw pile. Links to Wikidata. Links to MediaWiki. Links to the Wikimedia Foundation. Links to web.archive.org for articles that no longer exist. All of these sitting in the dataset alongside actual editorial links to Search Engine Land and Google Developers, as if they were the same kind of thing.
They are not the same kind of thing. Not even close.
The Struggle: Taxonomy Work Is Unglamorous and Necessary
The first fix was the SERP data. We ran keyword queries for all 41 articles and backfilled the table. The results were immediately interesting:
- Google_Ads — ~101,000 estimated monthly searches
- Conversion_rate_optimization — ~84,000 estimated monthly searches
- Lead_generation — ~24,000 monthly searches
- Digital_marketing — ~17,000 monthly searches
- Backlink — ~13,000 monthly searches
Good data. Immediately useful. Problem one: solved.
Problem two — the link taxonomy mess — required more thought. The goal of the external link analysis is to understand editorial authority signals: what sources do Wikipedia editors cite when writing about digital marketing concepts? That is a legitimate and interesting SEO research question.
But you cannot answer that question if your dataset includes Wikidata references and archived links to dead pages. So we built the taxonomy:
- Infrastructure domains excluded entirely: wikidata.org, mediawiki.org, wikimediafoundation.org, creativecommons.org — these are Wikipedia’s own plumbing. Not editorial citations.
- Archived URLs flagged: links to web.archive.org are dead-URL citations. The original page died; someone added an archive link to preserve context. That is not a vote for archive.org as an authority — it is a vote for the domain that used to be at that URL. Flagging them as
is_archived: truelets us filter them in analysis without losing them from the dataset. - Section tagging added: each link is now tagged with its source section — body, references, external_links, or further_reading. Body links are editorial; reference links are citations. They tell different stories.
This is the part of data work that nobody wants to do and everybody skips. It is also what makes the difference between a dataset and a trustworthy dataset.
What the Clean Data Actually Shows
After running the re-scrape and taxonomy pass, here is what the editorial link landscape looks like across all 41 Wikipedia digital marketing articles:
- 2,503 total external links across 661 unique domains
- Google.com leads all domains with 36 body-section links — Wikipedia editors cite Google’s own documentation constantly when writing about digital marketing
- developers.google.com is a separate entry (and a stronger authority signal — that is the technical documentation)
- Search Engine Land — 31 references-section citations. The canonical trade press source for Wikipedia editors.
- Search Engine Journal — 20 references-section citations.
For SEO practitioners, this is genuinely useful data. If you want to build topical authority in digital marketing, you can see which sources Wikipedia considers authoritative, which domains earn body-text links versus citation links, and where there might be gaps that a well-sourced piece could fill.
The Resolution: A Feature Roadmap and a Real Launch Date
After the data cleanup, we sat down and made the call that the right thing to do is not launch yet.
Yes, the data is better. But “better data in a static table” is still just a static table. The page has potential to be something genuinely useful — an interactive research tool — and shipping it half-built does not serve anyone.
So we locked in a feature build order before going public:
- Expandable rows — click an article to see its SERP keywords with position and search volume
- Traffic chart — all 41 articles sorted by estimated monthly traffic, at a glance
- CSV export — gated behind account creation (this is a lead magnet, not a giveaway)
- N-gram analysis for all articles — currently only one article has the deep text analysis; this extends it to the full corpus
- Opportunity scoring — articles with low external link counts but high search volume represent potential link-building targets. Surface that signal automatically.
Formal launch: Monday, March 23rd.
Meanwhile, the homepage also got a rebuild today — new services section covering the full product portfolio (including the SaaS tool and Chrome extension), a new “What Is SEO” component, a 7-question FAQ mapped to the customer journey from awareness to decision, and keyword-aligned metadata throughout. Three commits sitting on GitHub waiting on a Hostinger deploy trigger.
The Lesson (Again, Still)
Data products require taxonomy work. You can ship a table with numbers in it in a day. You can ship a table with trustworthy, categorized, filterable numbers in it after you do the boring work of defining what each field actually means and what should and should not be in it.
2,503 links. 661 domains. 41 articles. One taxonomy pass that made all of it actually usable.
Monday.