An LLM browsing agent can read a page, summarize it, and act on it — but it cannot tell whether the site hosting that page is trustworthy. Reputation signals like OpenPageRank scores, global popularity rankings, and web filtering categories transform raw URLs into trust-scored destinations that your agent harness can evaluate before every click.
LLMs excel at language comprehension but have no built-in mechanism to assess whether the source of information is authoritative, popular, or potentially malicious.
When a browsing agent retrieves search results and begins navigating through them, it treats every page equally. A product review on a PageRank 8 domain with millions of monthly visitors carries the same weight as a review on a newly registered domain with no traffic history. The agent cannot differentiate between a well-established industry publication and a content farm created last week specifically to manipulate AI agent behavior. This source blindness leads directly to poisoned outputs, where the agent synthesizes information from untrustworthy sources and presents it as reliable.
Our 102 million domain database includes three reputation dimensions that LLM browsing agents can consume directly. OpenPageRank scores (0 to 10) measure domain authority based on link graph analysis — higher scores indicate sites that are widely referenced by other established domains. Global popularity rankings place every domain on a scale from the top 1,000 most visited sites down to the long tail, derived from the Google Chrome User Experience Report. Web filtering categories flag domains associated with malware, phishing, spam, or other security threats.
Together, these three signals give your agent harness a composite trust score for every URL before the agent loads the page. A PageRank 7 domain in the global top 50,000 with a clean web filtering category is almost certainly a legitimate, authoritative source. A PageRank 0 domain with no popularity ranking and a web filtering flag for spam is almost certainly not. The agent harness makes this evaluation deterministically, in under one millisecond, with no model inference required.
Each dimension captures a different facet of domain trustworthiness
OpenPageRank computes a 0-to-10 authority score for every domain based on the structure of the internet's link graph. A score of 8 or higher indicates a domain that is referenced by thousands of other high-authority sites — think major news outlets, government portals, and tier-1 technology companies. A score of 3 or below indicates a domain with minimal inbound link authority. Your agent harness can set a minimum PageRank threshold — for example, only trust content from domains scoring 5 or higher for factual research tasks.
Derived from the Google Chrome User Experience Report, global popularity rankings tell your agent how many real human users visit a domain. A domain in the top 10,000 receives millions of visits per month and has been validated by broad human usage. A domain outside the top 10 million receives negligible traffic and may exist solely to serve automated crawlers. Popularity rankings serve as a proxy for real-world validation — if millions of humans trust a site enough to visit it regularly, the content is more likely to be legitimate.
Web filtering categories identify domains associated with specific threat types: malware distribution, phishing campaigns, spam networks, botnet command-and-control servers, and cryptojacking scripts. These categories are maintained by security research teams and updated continuously. When a domain carries a threat category flag, your agent harness blocks the navigation unconditionally — regardless of PageRank or popularity. A high-authority domain that has been compromised and is actively distributing malware must still be blocked.
Production-ready snippets to add reputation scoring to your agent's navigation pipeline
import http.client
import json
class ReputationTrustEngine:
"""Evaluates domain trust using PageRank, popularity, and threat signals."""
THREAT_CATEGORIES = ["Malware", "Phishing", "Spam", "Botnet", "Cryptomining"]
def __init__(self, api_key, min_pagerank=3, max_popularity_rank=5000000):
self.api_key = api_key
self.min_pagerank = min_pagerank
self.max_popularity = max_popularity_rank
self.conn = http.client.HTTPSConnection(
"www.websitecategorizationapi.com"
)
def get_reputation(self, domain):
payload = (
f"query={domain}"
f"&api_key={self.api_key}"
f"&data_type=domain"
)
headers = {"Content-Type": "application/x-www-form-urlencoded"}
self.conn.request(
"POST",
"/api/iab/iab_web_content_filtering.php",
payload, headers
)
res = self.conn.getresponse()
return json.loads(res.read().decode("utf-8"))
def compute_trust_score(self, domain):
data = self.get_reputation(domain)
pagerank = data.get("open_pagerank", 0)
global_rank = data.get("global_rank", None)
filtering = data.get("filtering_taxonomy", [])
# Check threat categories
for entry in filtering:
cat_name = entry[0].replace("Category name: ", "")
if any(t.lower() in cat_name.lower()
for t in self.THREAT_CATEGORIES):
return {"domain": domain, "trust": "blocked",
"reason": f"Threat: {cat_name}", "score": 0}
# Evaluate authority
score = min(pagerank / 10.0, 1.0) * 50 # 0-50 points
if global_rank and global_rank <= self.max_popularity:
rank_score = max(0, (1 - global_rank / self.max_popularity)) * 50
score += rank_score
trust_level = "high" if score >= 60 else "medium" if score >= 30 else "low"
return {"domain": domain, "trust": trust_level,
"score": round(score, 1), "pagerank": pagerank,
"global_rank": global_rank}
# Usage
engine = ReputationTrustEngine(api_key="your_api_key", min_pagerank=4)
result = engine.compute_trust_score("example.com")
print(f"Trust: {result['trust']} (score: {result['score']})")
async function evaluateDomainTrust(domain, apiKey, opts = {}) {
const minPageRank = opts.minPageRank || 3;
const threatCategories = ["Malware", "Phishing", "Spam", "Botnet"];
const res = await fetch(
"https://www.websitecategorizationapi.com" +
"/api/iab/iab_web_content_filtering.php",
{
method: "POST",
headers: { "Content-Type": "application/x-www-form-urlencoded" },
body: new URLSearchParams({
query: domain, api_key: apiKey, data_type: "domain"
})
}
);
const data = await res.json();
const pageRank = data.open_pagerank || 0;
const globalRank = data.global_rank || null;
const filterCat = data.filtering_taxonomy?.[0]?.[0]
?.replace("Category name: ", "") || "";
// Hard block on threat categories
if (threatCategories.some(t =>
filterCat.toLowerCase().includes(t.toLowerCase()))) {
return { domain, trust: "blocked", reason: filterCat };
}
// Compute composite score
let score = (pageRank / 10) * 50;
if (globalRank && globalRank <= 1000000) {
score += Math.max(0, (1 - globalRank / 1000000)) * 50;
}
return {
domain,
trust: score >= 60 ? "high" : score >= 30 ? "medium" : "low",
score: Math.round(score * 10) / 10,
pageRank,
globalRank
};
}
Purpose-built domain databases with reputation signals for LLM browsing agents. Includes PageRank scores, popularity rankings, IAB categories, and page types. One-time purchase with perpetual license.
10 Million Domains with Reputation Intelligence
One-time purchase: Perpetual license | Optional Updates: $1,599/year
20 Million Domains with Full Reputation Suite
One-time purchase: Perpetual license | Optional Updates: $2,999/year
50 Million Domains with Complete Intelligence Suite
One-time purchase: Perpetual license | Optional Updates: $4,999/year
Also available: Enterprise URL Database up to 102M domains from $2,499. View all database tiers →
Search any IAB or Web Filtering category to see how many domains are in our 102M Enterprise Database — including PageRank distribution and popularity tiers for each category.
How 102 million domains from our main Enterprise Database are distributed across IAB v3 taxonomy classifications
Spanning Tier 1 through Tier 4 classifications from our 102M Enterprise Database
Charts display domain counts for the top 50 out of 700+ categories in our 102M Enterprise Database. To check the number of domains for the remaining 650+ categories, use the Category Counter tool above .
Every major LLM provider ships agents with content safety filters that operate on the text the model generates. These filters catch harmful outputs — hate speech, violence, dangerous instructions. What they do not catch is harmful inputs. When a browsing agent navigates to a low-reputation domain and ingests content from it, the content passes through the model's context window and influences subsequent reasoning. If the content is deliberately misleading, factually incorrect, or crafted to manipulate the agent's behavior, the safety filters on the output side cannot retroactively undo the influence on the model's reasoning.
Reputation data addresses this gap by filtering inputs before they enter the model's context. A domain with a PageRank of 1, no global popularity ranking, and a web filtering flag for "spam" is almost certainly not a source the agent should be ingesting. Blocking that domain before the agent reads a single byte from it is categorically more effective than trying to detect the influence of bad content after it has already shaped the agent's reasoning chain.
OpenPageRank is the open-source continuation of the original Google PageRank algorithm. It computes a score from 0 to 10 for every domain on the internet based on the link graph — the network of hyperlinks connecting websites to each other. A domain earns a high PageRank when many other high-PageRank domains link to it. This recursive definition means that PageRank is not just a measure of how many links a site has, but a measure of how authoritative those linking sites are.
For AI agent trust decisions, PageRank serves as a proxy for editorial endorsement. When thousands of established websites link to a domain, they are implicitly endorsing its content as valuable enough to reference. A PageRank of 7 or higher places a domain in the top 0.1% of the internet by authority — these are major news outlets, government agencies, tier-1 technology companies, established academic institutions, and leading industry publications. A PageRank of 3 or below indicates a domain with minimal link authority — it may be new, niche, or deliberately obscure.
Setting a minimum PageRank threshold for agent navigation is the simplest reputation-based filter to implement. A threshold of 4 allows the agent to browse approximately 2 million domains with moderate to high authority. A threshold of 6 restricts the agent to approximately 200,000 highly authoritative domains. The right threshold depends on the agent's task — research tasks benefit from broader access, while financial decision-making tasks benefit from stricter authority requirements.
While PageRank measures link-based authority, global popularity rankings measure actual human usage. Derived from the Google Chrome User Experience Report, these rankings reflect how many real Chrome users visit each domain on a regular basis. A domain in the global top 1,000 receives tens of millions of visits per month. A domain in the top 100,000 receives hundreds of thousands. A domain outside the top 10 million receives negligible traffic.
For agent trust decisions, popularity rankings provide a signal that PageRank alone cannot: real-world validation by human users. A domain can have a high PageRank due to historical link accumulation but be effectively abandoned — no humans visit it anymore, but the links still exist. Popularity rankings catch this case by reflecting current human usage patterns, not historical link structures. Combining PageRank with popularity ranking produces a composite signal that is significantly more robust than either dimension alone.
The practical application is straightforward: your agent harness checks both the PageRank and the popularity ranking before allowing navigation. A domain with PageRank 6 and a global rank in the top 50,000 is almost certainly a trustworthy source. A domain with PageRank 6 but no popularity ranking may have accumulated links through artificial means and warrants additional scrutiny. A domain with no PageRank and no popularity ranking should trigger the strictest evaluation — either block the navigation or require human approval before proceeding.
The third reputation dimension in our database is web filtering categories, which function as threat intelligence labels for domains. Unlike IAB content categories that describe what a site is about, web filtering categories describe what a site does — specifically, whether it engages in activities that pose security risks. Categories include Malware (domains that distribute malicious software), Phishing (domains that impersonate legitimate services to steal credentials), Spam (domains that exist primarily to distribute unsolicited content), Cryptomining (domains that hijack visitor CPU resources), and Botnet (domains that serve as command-and-control infrastructure for compromised networks).
For agent browsing, web filtering categories represent hard blocks. Unlike PageRank or popularity thresholds, which are configurable preferences, a domain flagged for malware distribution must be blocked unconditionally. Even if the agent's task legitimately requires visiting a domain in a specific category, the security risk of allowing navigation to a known threat domain outweighs any task completion benefit. The agent should report the block to the user and request an alternative source rather than proceeding to a flagged domain.
The three reputation dimensions — PageRank authority, popularity ranking, and web filtering threat status — combine into a composite trust score that provides a single, actionable metric for the agent harness. The scoring algorithm assigns weighted points: PageRank contributes up to 50 points (scaled linearly from the 0-10 score), popularity ranking contributes up to 50 points (inversely proportional to the rank value), and any web filtering threat flag immediately zeroes the score and triggers a hard block.
A composite score of 60 or higher indicates a high-trust domain — the agent can browse freely. A score between 30 and 59 indicates a medium-trust domain — the agent can browse but with enhanced logging. A score below 30 indicates a low-trust domain — the agent either blocks navigation or requests human confirmation. This three-tier trust model gives operators fine-grained control over how aggressively their agents filter sources.
Beyond browsing agents, reputation data improves the quality of retrieval-augmented generation (RAG) pipelines that pull information from the open web. When a RAG pipeline retrieves documents from multiple sources, reputation scores allow the pipeline to weight authoritative sources more heavily in the generation prompt. A document from a PageRank 8 domain can be prioritized over a document from a PageRank 2 domain, even if the lower-authority document appears more relevant based on keyword matching alone.
This reputation-weighted retrieval reduces the risk of RAG poisoning attacks, where adversaries publish content specifically designed to be retrieved by AI systems and injected into their context windows. By filtering out low-reputation sources before they enter the retrieval pool, the pipeline maintains a higher baseline of source quality across all generated outputs.
The 102M domain database includes PageRank scores and popularity rankings for every domain, pre-computed and ready for lookup. In a Redis deployment, the entire reputation index consumes approximately 4GB of memory and delivers sub-millisecond lookups. In a PostgreSQL deployment, a B-tree index on the domain column supports thousands of concurrent queries per second. For edge deployments where memory is constrained, a SQLite file containing only the reputation columns for the top 10 million domains by popularity fits in under 500MB.
The key architectural principle is that reputation lookups must complete faster than the agent's navigation latency. If the reputation check takes longer than the time needed to load the target page, operators will be tempted to skip it for performance reasons. Sub-millisecond lookups from a local database eliminate this tradeoff — the reputation check adds effectively zero latency to the agent's workflow.
Domain reputation is not static. A legitimate website can be compromised and begin distributing malware. A new domain can accumulate authority and popularity over time, graduating from low-trust to high-trust. Our optional annual update subscription ensures your reputation data reflects the current state of the internet, with quarterly refreshes that update PageRank scores, recalculate popularity rankings, and incorporate the latest web filtering intelligence from security research feeds.
Regulatory frameworks including the EU AI Act, NIST AI Risk Management Framework, and sector-specific regulations like FINRA for financial services increasingly require organizations to demonstrate that their AI systems use reliable and traceable data sources. Reputation-scored domain intelligence provides the traceability layer: every piece of information the agent ingests is tagged with the source domain's authority score, popularity ranking, and threat status. This metadata flows through the agent's output pipeline and into compliance reports, giving regulators and auditors a clear view of source quality across the agent's operational history.
For organizations operating in regulated industries, this source-quality audit trail is not optional — it is a compliance requirement that reputation data satisfies directly and efficiently.
Deploy domain reputation data as the trust foundation for your LLM browsing agents. PageRank scores, popularity rankings, and threat intelligence for 102 million domains. One-time purchase, perpetual license.