WebsiteCategorizationAPI
Home
Demo Tools - Categorization
Website Categorization Text Classification URL Database Taxonomy Mapper
Demo Tools - Website Intel
Technology Detector Quality Score Competitor Finder
Demo Tools - Brand Safety
Brand Safety Checker Brand Suitability Quality Checker
Demo Tools - Content
Sentiment Analyzer Context Aware Ads
Resources
API Documentation Pricing Login
Try Categorization

URL Categorization Database Built for AI Agent Filtering

Autonomous AI agents are browsing the open web — and without a reliable categorization layer, they have zero awareness of where they are navigating. Our 102 million domain database gives your agent harness pre-classified URLs for up to 20 page types per domain. We traversed the link structure of every domain and analyzed over 10 billion individual pages through our multi-step AI pipeline to identify login pages, pricing pages, careers pages, and 17 more page types — returning the actual, verified URLs so your agents never have to guess.

102M
Classified Domains
700+
IAB Categories
20+
Page Types
99.5%
Internet Coverage

The Problem: AI Agents Navigate Blind

Without a URL categorization layer, autonomous agents have no mechanism to distinguish between a benign product page and a corporate admin panel.

Unfiltered Agent Access Is a Liability

When an AI agent receives an instruction like "research competitor pricing," it needs to visit dozens of websites. Without URL categorization data, the agent has no way to know whether it is landing on a public marketing page, a login portal, a payment checkout flow, or an internal HR portal. Every uncategorized navigation event is a potential compliance incident, a data exposure risk, or a brand safety violation.

  • Login page access: Agents stumble into SSO portals and authentication screens, triggering security alerts and potentially locking accounts
  • Financial page navigation: Without page-type awareness, agents can reach banking dashboards, payment gateways, and trading interfaces
  • Sensitive content exposure: Agents may browse adult, gambling, or extremist content — a direct brand safety violation for enterprise deployments
  • Shadow IT creation: Every untracked domain visit by an agent creates a shadow IT footprint your security team cannot audit

The Solution: Pre-Classified Page-Type URLs as Your Agent's Map

Our 102 million domain database provides pre-classified, verified URLs for up to 20 page types per domain. For each domain, we traversed its complete link structure — navigation menus, footer links, sitemaps, and internal references — then analyzed each discovered page through a multi-step AI classification pipeline. The result: when your agent needs the login page, pricing page, or careers page of any domain, the database returns the actual URL, not a pattern guess.

This is fundamentally different from checking whether "pricing.php" or "/login" exists in a URL path. A company's pricing page might live at /solutions/enterprise-plans, their login at /portal/access, or their careers page at /join-our-team. Our AI models examined the content and structure of each page to identify its type regardless of URL format. The database delivers these verified URLs so your agent harness can make instant allow/block decisions without runtime crawling or classification.

Domain Classification Network

Visualizing how 102M domains map to IAB categories in real-time

How Domain Categorization Powers Agent Filtering

Three integration patterns that turn a static database into a dynamic agent control plane

Link Structure Traversal

For each of our 102 million domains, we traverse the entire link structure — navigation menus, footer links, sitemaps, breadcrumbs, and internal references. We follow the actual link graph to discover every reachable page, building a comprehensive map of each website's architecture before classification begins.

Multi-Step AI Page Classification

Each discovered page passes through our sophisticated multi-step AI pipeline. The models analyze page content, HTML structure, form elements, and contextual signals to classify the page as one of 20 types: login, pricing, careers, contact, checkout, admin, settings, about, blog, documentation, and more. URL patterns alone don't determine classification — a login page at /welcome is identified just as accurately as one at /login.

Verified URL Delivery

The database returns the actual, verified URLs for each page type per domain. Query "login page for example.com" and receive the real URL that was discovered and classified — not a guessed path. Your agent harness consumes these pre-classified URLs in microseconds, skipping the entire discovery and classification phase that would otherwise take seconds to minutes per domain.

Agent Policy Decision Flow

URL → Classify → Evaluate Policy → Allow/Block/Review

Deep Link Analysis Pipeline

Over 10 Billion Links Individually Analyzed

To identify up to 20 key page types for each of our 102 million domains, we traversed and individually analyzed over 10 billion links using our sophisticated multi-step AI page-type identification pipeline. Each link was visited, its content examined, and its page type classified through multiple AI model stages.

10B+
Links Traversed
102M
Domains Covered
20
Page Types Identified
Multi-Step
AI Classification
Link Structure Traversal
For each domain, we crawl the complete link structure — navigation menus, footer links, sitemaps, and internal references — to discover every reachable page. This is not a simple URL pattern check; we follow the actual link graph of each website.
Multi-Step AI Page-Type Identification
Each discovered page is analyzed through a multi-stage AI pipeline that examines page content, structure, and context to determine its type. A login page might live at /welcome, /portal, or /access — our models identify it regardless of URL format.
Actual URLs Returned Per Domain
The database returns the real, verified URLs for each identified page type per domain. When you query for "login page of example.com," you get the actual URL — not a guess, not a pattern match, but the confirmed link discovered and classified by our pipeline.

Integration Code for Agent Filtering

Production-ready snippets to plug URL categorization into your agent harness

Python — Agent URL Filter Middleware

import http.client import json class AgentURLFilter: """Middleware that checks every URL before an AI agent navigates.""" BLOCKED_PAGE_TYPES = ["login", "checkout", "settings", "admin"] BLOCKED_CATEGORIES = ["Adult", "Illegal Content", "Malware"] def __init__(self, api_key): self.api_key = api_key self.conn = http.client.HTTPSConnection( "www.websitecategorizationapi.com" ) def classify_url(self, target_url): payload = ( f"query={target_url}" f"&api_key={self.api_key}" f"&data_type=url" f"&expanded_categories=1" ) headers = { "Content-Type": "application/x-www-form-urlencoded" } self.conn.request( "POST", "/api/iab/iab_web_content_filtering.php", payload, headers ) res = self.conn.getresponse() return json.loads(res.read().decode("utf-8")) def should_allow(self, target_url): data = self.classify_url(target_url) categories = [ c[0].split("Category name: ")[1] for c in data.get("iab_classification", []) ] page_type = data.get("page_type", "unknown") if page_type in self.BLOCKED_PAGE_TYPES: return False, f"Blocked page type: {page_type}" for cat in categories: for blocked in self.BLOCKED_CATEGORIES: if blocked.lower() in cat.lower(): return False, f"Blocked category: {cat}" return True, "Navigation approved" # Usage in agent harness filter = AgentURLFilter(api_key="your_api_key") allowed, reason = filter.should_allow("https://example.com/admin") if not allowed: print(f"Agent blocked: {reason}")

JavaScript — Real-Time Agent Gateway

async function agentNavigationGuard(targetURL, policyRules) { const response = await fetch( "https://www.websitecategorizationapi.com" + "/api/iab/iab_web_content_filtering.php", { method: "POST", headers: { "Content-Type": "application/x-www-form-urlencoded" }, body: new URLSearchParams({ query: targetURL, api_key: policyRules.apiKey, data_type: "url", expanded_categories: "1" }) } ); const classification = await response.json(); const filterCategory = classification.filtering_taxonomy?.[0]?.[0] ?.replace("Category name: ", "") || "Unknown"; const decision = { url: targetURL, category: filterCategory, action: "allow", timestamp: new Date().toISOString() }; if (policyRules.blockedCategories.includes(filterCategory)) { decision.action = "block"; } return decision; }

Real-Time Classification Pipeline

102 million domains flowing through IAB taxonomy classification

Pre-Classified Page-Type URLs

Why Pre-Classified URLs for 102M Domains
Changes Everything for AI Agents

Having pre-classified URLs for 20 page types across 102 million domains at the start of any agent task means your agents skip the discovery phase entirely. The result: orders of magnitude faster task completion.

Orders of Magnitude Faster

Without pre-classified data, an agent must crawl each domain, follow links, load pages, and analyze content to find a login or pricing page. That takes seconds to minutes per domain. With our database, the agent gets the exact URL in under 1ms — a local lookup instead of a live crawl.

From minutes per domain to microseconds

Dramatically Lower Cost

Live crawling and AI classification at runtime burns tokens, compute, and API calls. Every page an agent visits to discover structure costs $0.01–$0.05 in LLM inference. Multiply by thousands of domains and the bill explodes. A one-time database purchase eliminates all per-query classification costs.

One-time cost vs. per-query billing

Zero Hallucination Risk

When agents guess URLs, they hallucinate. An LLM asked to find a company's pricing page might fabricate /pricing, /plans, or /packages — none of which exist. Our database provides verified, real URLs that were actually discovered and classified, eliminating hallucinated navigation entirely.

Verified URLs, not AI guesses
1000x faster lookups
Zero per-query cost
100% verified URLs

AI Agent Database Pricing

Purpose-built domain databases for AI agent filtering. Includes IAB categories, 20+ page types, reputation scores, and popularity rankings. One-time purchase with perpetual license.

AI Agent Database
AI Agent Domain Database 10M
$7,999

10 Million Domains with Page-Type Intelligence

One-time purchase: Perpetual license  |  Optional Updates: $1,599/year

  • 10M+ Categorized Domains
  • IAB Taxonomies v2 & v3
  • 20+ Page Type Labels
  • Web Filtering Categories
  • OpenPageRank Scores
  • Global Popularity Rankings
  • Priority Enterprise Support
Popular
AI Agent Domain Database 20M
$14,999

20 Million Domains with Full Intelligence Suite

One-time purchase: Perpetual license  |  Optional Updates: $2,999/year

  • 20M+ Categorized Domains
  • IAB Taxonomies v2 & v3
  • 20+ Page Type Labels
  • Web Filtering Categories
  • OpenPageRank Scores
  • Global & Country Rankings
  • Dedicated Account Manager
Maximum Coverage
AI Agent Domain Database 50M
$24,999

50 Million Domains with Complete Intelligence Suite

One-time purchase: Perpetual license  |  Optional Updates: $4,999/year

  • 50M+ Categorized Domains
  • IAB Taxonomies v2 & v3
  • 20+ Page Type Labels
  • Web Filtering Categories
  • OpenPageRank Scores
  • Global & Country Rankings
  • Dedicated Account Manager

Also available: Enterprise URL Database up to 102M domains from $2,499. View all database tiers →

How Many Domains in Each Category?

Search any IAB or Web Filtering category to see how many domains are in our 102M Enterprise Database — the same data your AI agent filtering rules will reference.

Popular:
Database Analytics

Domain Distribution by Category in Our 102M Enterprise Database

How 102 million domains from our main Enterprise Database are distributed across IAB v3 taxonomy classifications

Top 50 IAB v3 Categories

Spanning Tier 1 through Tier 4 classifications from our 102M Enterprise Database

IAB v3

Charts display domain counts for the top 50 out of 700+ categories in our 102M Enterprise Database. To check the number of domains for the remaining 650+ categories, use the Category Counter tool above .

IAB Taxonomy Classification Tree

700+ categories organized across 4 taxonomy tiers

Why Every Agent Harness Needs a URL Categorization Layer

The shift from chat-based AI to agentic AI means language models are no longer passively answering questions — they are actively navigating websites, clicking buttons, filling forms, and making decisions on behalf of users. This transition creates an entirely new threat surface. A chatbot that hallucinates a URL is annoying; an agent that navigates to that URL and submits credentials is a security incident.

URL categorization databases address this gap by providing the structured metadata that agents lack natively. When an agent receives a URL — whether from its own web search, a user instruction, or a tool call — the categorization layer instantly resolves it to a known category, page type, and reputation score. This resolution happens deterministically, without model inference, which means zero hallucination risk in the decision path.

How We Identify Page Types: Link Traversal + Multi-Step AI Classification

Our page-type identification process is fundamentally different from simple URL pattern matching. We do not check whether a domain has a file called "pricing.php" or a path containing "/login". Instead, for each of our 102 million domains, we traverse the complete link structure of the website — following navigation menus, footer links, sitemaps, breadcrumbs, and internal references — to discover every reachable page.

Each discovered page then passes through our multi-step AI classification pipeline. The pipeline examines page content, HTML structure, form elements, heading hierarchies, and contextual signals across multiple model stages to determine whether the page is a login page, pricing page, careers page, contact page, checkout page, admin panel, settings page, about page, blog, documentation, API reference, support page, FAQ, forum, product page, legal page, privacy policy, terms of service, signup page, or homepage. A pricing page at /solutions/enterprise-plans is identified just as accurately as one at /pricing.php. A login page at /portal/access is identified just as accurately as one at /login.

The database returns the actual verified URLs for each identified page type per domain. When your agent or agent harness queries for "login page of example.com," it receives the real, confirmed URL that was discovered during link traversal and classified by our AI pipeline — not a guess, not a pattern match, but a verified link.

Why Pre-Classified URLs Mean Orders of Magnitude Faster Agent Completion

Without a pre-classified database, an AI agent tasked with finding the pricing page of 1,000 companies must crawl each website, follow links, load pages, analyze content, and determine which page is the pricing page. This process takes seconds to minutes per domain, with each step consuming LLM tokens, compute cycles, and API calls. For 1,000 domains, the agent might spend hours and hundreds of dollars completing the task.

With our database, the same task completes in milliseconds. The agent queries the database for the pricing page URL of each domain and receives verified URLs instantly. No crawling, no runtime classification, no token consumption for page analysis. The total cost is effectively zero per query because the database is a one-time purchase. This is not a marginal improvement — it is several orders of magnitude faster and cheaper than having agents discover page types from scratch.

Additionally, pre-classified URLs eliminate the hallucination problem entirely. When an LLM guesses a URL, it frequently fabricates paths that do not exist — /pricing, /plans, /packages — leading to 404 errors, wasted retries, and agents getting lost. Our database provides only URLs that were actually discovered and verified to exist, eliminating this entire failure mode.

Mapping IAB Categories to Agent Policy Rules

The IAB Content Taxonomy v3 organizes websites into a hierarchical structure with four tiers of increasing specificity. Tier 1 categories like "Technology & Computing" or "Business and Finance" provide broad domain awareness. Tier 4 categories like "Artificial Intelligence > Machine Learning > Natural Language Processing" provide granular topic resolution.

For agent filtering, the most effective approach is to define policy rules at multiple tiers simultaneously. Block all Tier 1 categories related to sensitive content (Adult, Illegal, Gambling). Allow specific Tier 2 categories that match the agent's task scope (e.g., "Business and Finance > Financial Services" for a financial research agent). Flag Tier 3 and Tier 4 categories for logging when they represent edge cases that may require human review.

Web Filtering Categories for Security-First Agent Deployments

In addition to IAB taxonomy, our database includes web filtering categories specifically designed for security and compliance use cases. These categories — such as Malware, Phishing, Spam, Adult, Gambling, Weapons, and Drugs — map directly to the blocking rules that enterprise web proxies and CASBs already enforce for human users. Extending these same categories to AI agents creates a consistent security posture across your entire organization.

Deploying the Database in Your Existing Agent Stack

The 102M domain database ships as a flat file — CSV or JSON — that you can ingest into any data store. Common deployment patterns include loading the data into Redis for sub-millisecond lookups, importing into PostgreSQL for SQL-based policy queries, or embedding a SQLite file directly alongside your agent runtime. For cloud-native deployments, teams often load the data into DynamoDB or Cloud Firestore for serverless agent architectures.

Regardless of the storage backend, the integration pattern is the same: intercept the agent's navigation intent, extract the target URL, query the database, evaluate the result against your policy rules, and either allow or block the navigation before the agent's HTTP request fires.

Addressing the Long Tail with Real-Time API Fallback

No static database covers every domain on the internet. New domains are registered at a rate of approximately 50,000 per day. To handle the long tail of newly registered, rarely visited, or dynamically generated URLs, pair the offline database with our real-time API. When a URL lookup returns no match in the local database, the agent's middleware sends the URL to the API for on-demand classification. The API response includes the same IAB categories, page types, and reputation signals as the database — ensuring consistent policy evaluation regardless of the data source.

Common Integration Patterns for Popular Agent Frameworks

Whether you are building on LangChain, CrewAI, AutoGen, or a custom agent framework, the integration pattern follows the same middleware approach. In LangChain, implement a custom Tool that wraps the database lookup and returns a structured allow/block decision. In CrewAI, add a pre-navigation hook to the agent's browsing tool that checks the database before each HTTP request. In AutoGen, register a function call that the agent invokes before every URL visit. The key principle is that the categorization check must execute before the navigation — not after.

Who Needs URL Categorization for Agent Filtering

The market for agent filtering is broad and growing rapidly as organizations move from pilot AI agent deployments to production. The primary buyers include enterprise security teams deploying browser-using agents like Anthropic's Computer Use, OpenAI's Operator, or Google's Project Mariner. These teams need to enforce the same URL filtering policies on agents that they already enforce on employees via web proxies and CASBs.

Platform vendors building agent orchestration tools need categorization data to offer their customers built-in governance controls. Without this data, their platforms ship with a "deploy and hope" security model that enterprise buyers will not accept.

Managed service providers operating AI agents on behalf of clients need URL categorization to prove compliance with client security policies and regulatory requirements. The database provides the audit trail: every domain the agent visited, its category, its page type, and the policy decision that was made.

Coverage Matters: Why 102 Million Domains

An agent filtering database is only as good as its coverage. If 20% of the URLs an agent encounters return "unknown" from the database, your policy engine defaults to either blocking (which halts the agent's workflow) or allowing (which defeats the purpose of filtering). Our 102M domain database covers 99.5% of the active internet as measured by the Google Chrome User Experience Report. This means that for virtually every domain an agent will encounter in normal operation, the database already has a classification ready.

The remaining 0.5% — newly registered domains, parked pages, and extremely niche sites — are handled by the real-time API fallback, ensuring 100% coverage in practice.

Enterprise Security Layer

Shield your infrastructure from uncontrolled agent navigation

Start Filtering Agent Traffic Today

Deploy URL categorization as the foundation of your AI agent governance strategy. One-time purchase, perpetual license, 102 million domains classified and ready.

View AI Agent Database View 102M Enterprise Database
Stay in the loop

You are on the list!

We will send you updates that matter — no spam.