Autonomous AI agents are browsing the open web — and without a reliable categorization layer, they have zero awareness of where they are navigating. Our 102 million domain database gives your agent harness pre-classified URLs for up to 20 page types per domain. We traversed the link structure of every domain and analyzed over 10 billion individual pages through our multi-step AI pipeline to identify login pages, pricing pages, careers pages, and 17 more page types — returning the actual, verified URLs so your agents never have to guess.
Without a URL categorization layer, autonomous agents have no mechanism to distinguish between a benign product page and a corporate admin panel.
When an AI agent receives an instruction like "research competitor pricing," it needs to visit dozens of websites. Without URL categorization data, the agent has no way to know whether it is landing on a public marketing page, a login portal, a payment checkout flow, or an internal HR portal. Every uncategorized navigation event is a potential compliance incident, a data exposure risk, or a brand safety violation.
Our 102 million domain database provides pre-classified, verified URLs for up to 20 page types per domain. For each domain, we traversed its complete link structure — navigation menus, footer links, sitemaps, and internal references — then analyzed each discovered page through a multi-step AI classification pipeline. The result: when your agent needs the login page, pricing page, or careers page of any domain, the database returns the actual URL, not a pattern guess.
This is fundamentally different from checking whether "pricing.php" or "/login" exists in a URL path. A company's pricing page might live at /solutions/enterprise-plans, their login at /portal/access, or their careers page at /join-our-team. Our AI models examined the content and structure of each page to identify its type regardless of URL format. The database delivers these verified URLs so your agent harness can make instant allow/block decisions without runtime crawling or classification.
Three integration patterns that turn a static database into a dynamic agent control plane
For each of our 102 million domains, we traverse the entire link structure — navigation menus, footer links, sitemaps, breadcrumbs, and internal references. We follow the actual link graph to discover every reachable page, building a comprehensive map of each website's architecture before classification begins.
Each discovered page passes through our sophisticated multi-step AI pipeline. The models analyze page content, HTML structure, form elements, and contextual signals to classify the page as one of 20 types: login, pricing, careers, contact, checkout, admin, settings, about, blog, documentation, and more. URL patterns alone don't determine classification — a login page at /welcome is identified just as accurately as one at /login.
The database returns the actual, verified URLs for each page type per domain. Query "login page for example.com" and receive the real URL that was discovered and classified — not a guessed path. Your agent harness consumes these pre-classified URLs in microseconds, skipping the entire discovery and classification phase that would otherwise take seconds to minutes per domain.
Production-ready snippets to plug URL categorization into your agent harness
import http.client
import json
class AgentURLFilter:
"""Middleware that checks every URL before an AI agent navigates."""
BLOCKED_PAGE_TYPES = ["login", "checkout", "settings", "admin"]
BLOCKED_CATEGORIES = ["Adult", "Illegal Content", "Malware"]
def __init__(self, api_key):
self.api_key = api_key
self.conn = http.client.HTTPSConnection(
"www.websitecategorizationapi.com"
)
def classify_url(self, target_url):
payload = (
f"query={target_url}"
f"&api_key={self.api_key}"
f"&data_type=url"
f"&expanded_categories=1"
)
headers = {
"Content-Type": "application/x-www-form-urlencoded"
}
self.conn.request(
"POST",
"/api/iab/iab_web_content_filtering.php",
payload,
headers
)
res = self.conn.getresponse()
return json.loads(res.read().decode("utf-8"))
def should_allow(self, target_url):
data = self.classify_url(target_url)
categories = [
c[0].split("Category name: ")[1]
for c in data.get("iab_classification", [])
]
page_type = data.get("page_type", "unknown")
if page_type in self.BLOCKED_PAGE_TYPES:
return False, f"Blocked page type: {page_type}"
for cat in categories:
for blocked in self.BLOCKED_CATEGORIES:
if blocked.lower() in cat.lower():
return False, f"Blocked category: {cat}"
return True, "Navigation approved"
# Usage in agent harness
filter = AgentURLFilter(api_key="your_api_key")
allowed, reason = filter.should_allow("https://example.com/admin")
if not allowed:
print(f"Agent blocked: {reason}")
async function agentNavigationGuard(targetURL, policyRules) {
const response = await fetch(
"https://www.websitecategorizationapi.com" +
"/api/iab/iab_web_content_filtering.php",
{
method: "POST",
headers: {
"Content-Type": "application/x-www-form-urlencoded"
},
body: new URLSearchParams({
query: targetURL,
api_key: policyRules.apiKey,
data_type: "url",
expanded_categories: "1"
})
}
);
const classification = await response.json();
const filterCategory =
classification.filtering_taxonomy?.[0]?.[0]
?.replace("Category name: ", "") || "Unknown";
const decision = {
url: targetURL,
category: filterCategory,
action: "allow",
timestamp: new Date().toISOString()
};
if (policyRules.blockedCategories.includes(filterCategory)) {
decision.action = "block";
}
return decision;
}
Having pre-classified URLs for 20 page types across 102 million domains at the start of any agent task means your agents skip the discovery phase entirely. The result: orders of magnitude faster task completion.
Purpose-built domain databases for AI agent filtering. Includes IAB categories, 20+ page types, reputation scores, and popularity rankings. One-time purchase with perpetual license.
10 Million Domains with Page-Type Intelligence
One-time purchase: Perpetual license | Optional Updates: $1,599/year
20 Million Domains with Full Intelligence Suite
One-time purchase: Perpetual license | Optional Updates: $2,999/year
50 Million Domains with Complete Intelligence Suite
One-time purchase: Perpetual license | Optional Updates: $4,999/year
Also available: Enterprise URL Database up to 102M domains from $2,499. View all database tiers →
Search any IAB or Web Filtering category to see how many domains are in our 102M Enterprise Database — the same data your AI agent filtering rules will reference.
How 102 million domains from our main Enterprise Database are distributed across IAB v3 taxonomy classifications
Spanning Tier 1 through Tier 4 classifications from our 102M Enterprise Database
Charts display domain counts for the top 50 out of 700+ categories in our 102M Enterprise Database. To check the number of domains for the remaining 650+ categories, use the Category Counter tool above .
The shift from chat-based AI to agentic AI means language models are no longer passively answering questions — they are actively navigating websites, clicking buttons, filling forms, and making decisions on behalf of users. This transition creates an entirely new threat surface. A chatbot that hallucinates a URL is annoying; an agent that navigates to that URL and submits credentials is a security incident.
URL categorization databases address this gap by providing the structured metadata that agents lack natively. When an agent receives a URL — whether from its own web search, a user instruction, or a tool call — the categorization layer instantly resolves it to a known category, page type, and reputation score. This resolution happens deterministically, without model inference, which means zero hallucination risk in the decision path.
Our page-type identification process is fundamentally different from simple URL pattern matching. We do not check whether a domain has a file called "pricing.php" or a path containing "/login". Instead, for each of our 102 million domains, we traverse the complete link structure of the website — following navigation menus, footer links, sitemaps, breadcrumbs, and internal references — to discover every reachable page.
Each discovered page then passes through our multi-step AI classification pipeline. The pipeline examines page content, HTML structure, form elements, heading hierarchies, and contextual signals across multiple model stages to determine whether the page is a login page, pricing page, careers page, contact page, checkout page, admin panel, settings page, about page, blog, documentation, API reference, support page, FAQ, forum, product page, legal page, privacy policy, terms of service, signup page, or homepage. A pricing page at /solutions/enterprise-plans is identified just as accurately as one at /pricing.php. A login page at /portal/access is identified just as accurately as one at /login.
The database returns the actual verified URLs for each identified page type per domain. When your agent or agent harness queries for "login page of example.com," it receives the real, confirmed URL that was discovered during link traversal and classified by our AI pipeline — not a guess, not a pattern match, but a verified link.
Without a pre-classified database, an AI agent tasked with finding the pricing page of 1,000 companies must crawl each website, follow links, load pages, analyze content, and determine which page is the pricing page. This process takes seconds to minutes per domain, with each step consuming LLM tokens, compute cycles, and API calls. For 1,000 domains, the agent might spend hours and hundreds of dollars completing the task.
With our database, the same task completes in milliseconds. The agent queries the database for the pricing page URL of each domain and receives verified URLs instantly. No crawling, no runtime classification, no token consumption for page analysis. The total cost is effectively zero per query because the database is a one-time purchase. This is not a marginal improvement — it is several orders of magnitude faster and cheaper than having agents discover page types from scratch.
Additionally, pre-classified URLs eliminate the hallucination problem entirely. When an LLM guesses a URL, it frequently fabricates paths that do not exist — /pricing, /plans, /packages — leading to 404 errors, wasted retries, and agents getting lost. Our database provides only URLs that were actually discovered and verified to exist, eliminating this entire failure mode.
The IAB Content Taxonomy v3 organizes websites into a hierarchical structure with four tiers of increasing specificity. Tier 1 categories like "Technology & Computing" or "Business and Finance" provide broad domain awareness. Tier 4 categories like "Artificial Intelligence > Machine Learning > Natural Language Processing" provide granular topic resolution.
For agent filtering, the most effective approach is to define policy rules at multiple tiers simultaneously. Block all Tier 1 categories related to sensitive content (Adult, Illegal, Gambling). Allow specific Tier 2 categories that match the agent's task scope (e.g., "Business and Finance > Financial Services" for a financial research agent). Flag Tier 3 and Tier 4 categories for logging when they represent edge cases that may require human review.
In addition to IAB taxonomy, our database includes web filtering categories specifically designed for security and compliance use cases. These categories — such as Malware, Phishing, Spam, Adult, Gambling, Weapons, and Drugs — map directly to the blocking rules that enterprise web proxies and CASBs already enforce for human users. Extending these same categories to AI agents creates a consistent security posture across your entire organization.
The 102M domain database ships as a flat file — CSV or JSON — that you can ingest into any data store. Common deployment patterns include loading the data into Redis for sub-millisecond lookups, importing into PostgreSQL for SQL-based policy queries, or embedding a SQLite file directly alongside your agent runtime. For cloud-native deployments, teams often load the data into DynamoDB or Cloud Firestore for serverless agent architectures.
Regardless of the storage backend, the integration pattern is the same: intercept the agent's navigation intent, extract the target URL, query the database, evaluate the result against your policy rules, and either allow or block the navigation before the agent's HTTP request fires.
No static database covers every domain on the internet. New domains are registered at a rate of approximately 50,000 per day. To handle the long tail of newly registered, rarely visited, or dynamically generated URLs, pair the offline database with our real-time API. When a URL lookup returns no match in the local database, the agent's middleware sends the URL to the API for on-demand classification. The API response includes the same IAB categories, page types, and reputation signals as the database — ensuring consistent policy evaluation regardless of the data source.
Whether you are building on LangChain, CrewAI, AutoGen, or a custom agent framework, the integration pattern follows the same middleware approach. In LangChain, implement a custom Tool that wraps the database lookup and returns a structured allow/block decision. In CrewAI, add a pre-navigation hook to the agent's browsing tool that checks the database before each HTTP request. In AutoGen, register a function call that the agent invokes before every URL visit. The key principle is that the categorization check must execute before the navigation — not after.
The market for agent filtering is broad and growing rapidly as organizations move from pilot AI agent deployments to production. The primary buyers include enterprise security teams deploying browser-using agents like Anthropic's Computer Use, OpenAI's Operator, or Google's Project Mariner. These teams need to enforce the same URL filtering policies on agents that they already enforce on employees via web proxies and CASBs.
Platform vendors building agent orchestration tools need categorization data to offer their customers built-in governance controls. Without this data, their platforms ship with a "deploy and hope" security model that enterprise buyers will not accept.
Managed service providers operating AI agents on behalf of clients need URL categorization to prove compliance with client security policies and regulatory requirements. The database provides the audit trail: every domain the agent visited, its category, its page type, and the policy decision that was made.
An agent filtering database is only as good as its coverage. If 20% of the URLs an agent encounters return "unknown" from the database, your policy engine defaults to either blocking (which halts the agent's workflow) or allowing (which defeats the purpose of filtering). Our 102M domain database covers 99.5% of the active internet as measured by the Google Chrome User Experience Report. This means that for virtually every domain an agent will encounter in normal operation, the database already has a classification ready.
The remaining 0.5% — newly registered domains, parked pages, and extremely niche sites — are handled by the real-time API fallback, ensuring 100% coverage in practice.
Deploy URL categorization as the foundation of your AI agent governance strategy. One-time purchase, perpetual license, 102 million domains classified and ready.