How to Automate Keyword Clustering with Scraped Data
Introduction
Manual keyword grouping is slow, subjective, and often wrong. Two keywords that share words may target completely different search intents. The solution is automation. By scraping SERP data and clustering keywords based on the URLs Google ranks, you can build topic clusters that reflect what search engines actually reward — not what humans assume belongs together.
Why SERP-Based Clustering Outperforms Text-Based Grouping
Traditional keyword grouping tools match keywords by shared words or phrases. This approach fails systematically. The keywords “best running shoes” and “best running trails” share the words “best running.” But Google ranks completely different pages for each query because they serve different user intents .
Text-based clustering would merge these keywords into one group. SERP-based clustering keeps them separate.
SERP-based keyword clustering works on a simple principle: when two or more keywords return the same ranked URLs in Google, those related keywords belong to the same topical cluster . You are not guessing which keywords are related. You are reading the signal Google publishes on every search results page.
This approach aligns content strategy with Google’s own algorithmic interpretation of topics and search intent, not with human assumptions about keyword similarity .
The Core Data Source: Scraping SERPs for URL Overlap
To automate SERP-based clustering, you first need scraped SERP data for every keyword in your list. For each keyword, you extract the top-ranking organic URLs — typically positions 1 through 10 or 20.
The similarity between keywords is measured by Jaccard similarity or a similar overlap metric. If Keyword A and Keyword B share 4 out of 10 ranking URLs, they are considered closely related. If they share zero URLs, they belong to different clusters .
Keyword Cupid, a semantic keyword clustering tool, scrapes Google search results at the moment of each query, trains an ensemble of unsupervised AI models on the fly, and groups related keywords by algorithmic intent rather than surface-level text matching .
Step-by-Step Automation Workflow
Automated keyword clustering follows a repeatable pipeline. Each stage can be scripted or integrated into low-code platforms.
Stage 1: Gather Keyword Data
Start with a comprehensive list of keywords around your primary topic. Pull data from keyword research tools like Ahrefs, Semrush, Moz, or Google Search Console .
Apply basic filters to keep the dataset manageable: monthly search volume greater than zero, relevant language, and focus on your target markets. The goal is a broad dataset that captures the full topical landscape.
Stage 2: Scrape SERPs for Each Keyword
For every keyword in your list, scrape the top-ranking organic URLs. You can use SERP APIs like Serper.dev (approximately $1 per 1,000 keywords) or build custom scrapers with tools like Scrape.do .
Using a managed SERP API is generally more reliable than custom scraping for production workflows. APIs handle proxy rotation, CAPTCHA solving, and parser maintenance automatically.
For multi-market keyword research across the USA, Germany, United Kingdom, France, Italy, Russia, Spain, Netherlands, Switzerland, Poland, Ireland, Australia, Canada, Thailand, and Hong Kong, run separate SERP scrapes with appropriate country parameters. Google search results and SERP intent vary by location and device .
Stage 3: Calculate URL Overlap Similarity
With SERP data collected, calculate the overlap between every pair of keywords. The standard approach uses Jaccard similarity:
text
similarity = |A ∩ B| / |A ∪ B|
Where A and B are the sets of ranking URLs for Keyword A and Keyword B.
This score ranges from 0 (no overlap) to 1 (identical ranking sets). Higher scores indicate closer topical relationships .
Stage 4: Apply Hierarchical Clustering
With similarity scores calculated, apply agglomerative hierarchical clustering. This algorithm starts by treating each keyword as its own cluster, then merges clusters based on similarity thresholds .
The GitHub repository by kbradbery implements this exact approach using Streamlit for the interface, SQLite for data storage, and NetworkX for graph-based clustering .
You control the clustering granularity through a minimum overlap threshold. A higher threshold creates finer, more specific clusters. A lower threshold creates broader, more general clusters.
Stage 5: Add Intent Classification (Optional)
To enrich clusters further, add intent classification. Using Sentence Transformers or similar models, analyze the titles of top-ranking pages to determine whether user intent is informational, commercial, navigational, or transactional .
This step adds depth but increases processing time. The Sentence Transformer model is powerful but resource-intensive. For large keyword lists, make intent classification optional or run it after initial clustering.
Stage 6: Export Structured Clusters
The final output should include cluster assignments, aggregated metrics per cluster (total search volume, average keyword difficulty, combined CPC), the dominant intent for each cluster, and recommended heading structures based on top-ranking pages.
Keyword Cupid outputs an interactive hierarchical mindmap, a downloadable Excel file containing keyword cluster assignments with aggregated search volume, keyword difficulty, and CPC data, and a structured topical silo architecture that maps keyword groups to pages and pages to silos .
Tools for Automating Keyword Clustering
Several tools automate SERP-based clustering for different use cases and budgets.
Keyword Cupid
Keyword Cupid is a machine learning clustering tool that scrapes Google search results in real time and trains unsupervised AI models on demand. A single report handles thousands of keywords.
Key features include geo-targeting by country and city, device targeting across mobile, desktop, and tablet, SERP Spy on-page data including average content length from top-ranking pages, and support for Google and Yandex .
Pricing is not publicly listed, but the tool offers Bring Your Own Data uploads accepting CSV and Excel files.
KeyClusters
KeyClusters starts at $9.99 per month. The tool uses real-time Google SERP data to identify which pages are ranking for which keywords, groups similar keywords into topics, and shows how to interlink them for SEO outcomes .
Open Source Python Solutions
For teams with engineering resources, open source Python solutions provide maximum control.
The SEO Keyword Clustering repository by kbradbery offers a complete Streamlit application. The workflow: upload keywords, set parameters, scrape SERPs via Serper.dev API, run agglomerative clustering, optionally classify intent, and export results as CSV .
The keyword_clustering_easy_demo repository by evemilano uses SentenceTransformer for generating semantic embeddings and BERTopic for advanced topic modeling. This approach is particularly effective for multilingual datasets including Italian .
Low-Code Automation with N8n
For teams preferring low-code, n8n offers a workflow template that categorizes SEO keywords and creates content strategies with AI. The workflow automatically sorts keywords into strategic buckets — Quick Wins, Authority Builders, Emerging Topics, Intent Signals, and Semantic Topics — creates content blueprints with titles and descriptions, and integrates with Airtable for structured output .
Multi-Market Keyword Clustering Considerations
For businesses operating across multiple countries, keyword clustering must account for geographic variation in search intent.
The same keyword list run through Keyword Cupid with geo-targeting set to New York versus Los Angeles versus London can produce different cluster structures because Google’s ranking URLs vary by location .
For accurate multi-market clustering:
Run separate clustering analyses for each target country including the USA, Germany, United Kingdom, France, Italy, Russia, Spain, Netherlands, Switzerland, Poland, Ireland, Australia, Canada, Thailand, and Hong Kong.
Use proxies closest to each location to scrape local results. Keyword Cupid routes each SERP scrape through a proxy closest to the selected location .
Compare cluster structures across markets to identify universal topics (clusters that appear in all markets), regional variations (clusters that differ by location), and market-specific opportunities (clusters unique to one country).
Avoiding Keyword Cannibalization Through Clustering
Keyword cannibalization occurs when multiple pages on your site target the same keyword or closely related keywords, causing them to compete against each other in search results.
Proper clustering prevents cannibalization by ensuring each keyword group maps to a single target page. If two keywords belong to the same cluster, they should be optimized on the same page. If they belong to different clusters, they need separate pages .
KeyClusters explicitly addresses this use case, helping users identify relevant keywords to target and group them to avoid cannibalization issues .
Integrating Clusters into Content Strategy
Raw keyword clusters become valuable when mapped to content production.
For each cluster, identify the primary keyword (highest search volume or strongest commercial intent). Use the cluster’s aggregated metrics — total search volume, average difficulty, combined CPC — to prioritize which clusters to tackle first .
Extract heading patterns from the top-ranking URLs in each cluster. If 4 out of 5 competitors use question-style H2s, your content should too. Use SERP Spy data including average content length to match the formatting patterns that search engines reward .
Build hub-and-spoke architectures: pillar pages target broad clusters (hub), supporting posts target long-tail variations within the cluster (spokes). The n8n workflow template includes a Hub and Spoke Mapper that identifies main topics and supporting content opportunities .
Why Hir Infotech Automates Keyword Clustering
At Hir Infotech, we have built our data intelligence practice around delivering actionable SEO insights to B2B teams. With over 13 years of experience and 2,745+ satisfied clients across the USA, Europe, and Australia, we have deployed SERP extraction and keyword clustering for hundreds of content strategy use cases.
Our approach to automated keyword clustering focuses on three core capabilities:
First, we scrape SERP data at scale across all target markets. We extract organic ranking URLs, SERP features, and People Also Ask questions for any keyword list using premium proxy networks and SERP APIs.
Second, we perform URL overlap analysis and agglomerative clustering. Our Python-based pipelines calculate Jaccard similarity scores, apply hierarchical clustering algorithms, and output structured cluster assignments with aggregated metrics.
Third, we deliver actionable outputs including cluster maps, topical silo architectures, and content brief templates. Delivery options include CSV, Excel, API, or direct integration with content planning tools.
We do not sell software subscriptions. We deliver structured, decision-ready cluster data that feeds directly into content strategy workflows. For organizations ready to move beyond manual keyword grouping and build topical authority through data-driven clustering, we provide the infrastructure and expertise to automate the entire pipeline across every market you serve.
Frequently Asked Questions
What is SERP-based keyword clustering?
SERP-based clustering groups keywords by comparing the URLs that rank on Google’s first page for each keyword. When two keywords share multiple ranking URLs, they belong to the same topical cluster. This approach aligns with Google’s own understanding of keyword relationships rather than text-based similarity .
How is this different from text-based clustering?
Text-based clustering matches keywords that share words, incorrectly merging “best running shoes” with “best running trails.” SERP-based clustering separates them because Google ranks different pages for each query, confirming distinct search intents .
What data do I need to start clustering?
You need a list of keywords (exported from Ahrefs, Semrush, Google Search Console, or similar tools) and SERP data for each keyword including the top 10 to 20 ranking organic URLs .
Can I cluster keywords across multiple countries?
Yes. You must run separate clustering analyses for each country because Google’s ranking URLs and search intent vary by location. Tools like Keyword Cupid support geo-targeting by country and city through proxy routing .
What is the cheapest way to automate keyword clustering?
For low-volume needs, the open-source Streamlit app using Serper.dev API costs approximately
1per1,000keywordsforSERPdata,pluscomputationalcostsforclustering[citation:5].Forhighervolume,toolslikeKeyClustersstartat
1per1,000keywordsforSERPdata,pluscomputationalcostsforclustering[citation:5].Forhighervolume,toolslikeKeyClustersstartat9.99 per month .
Conclusion
Automated keyword clustering powered by scraped SERP data transforms how SEO teams build topic clusters and content strategies. Instead of manually grouping keywords by word matching — which fails systematically — you let Google’s own ranking URLs reveal which keywords belong together. The workflow is repeatable: gather keywords, scrape SERPs, calculate URL overlap, apply hierarchical clustering, optionally classify intent, and export structured clusters. Implementation options range from open source Python scripts to managed tools like Keyword Cupid and KeyClusters to low-code automation through n8n. For multi-market operations, separate clustering analyses per country capture regional intent variations. The output drives cannibalization prevention, hub-and-spoke architecture, and data-backed content briefs. For organizations ready to move beyond guesswork and build topical authority systematically, Hir Infotech delivers automated keyword clustering pipelines across the USA, Germany, United Kingdom, France, Italy, Russia, Spain, Netherlands, Switzerland, Poland, Ireland, Australia, Canada, Thailand, and Hong Kong — turning scraped SERP data into scalable content intelligence.