Complete technical documentation for integrating the Enterprise Domain Database into your systems. Supports CSV and JSON formats with examples in all major programming languages.
The Enterprise Domain Database provides comprehensive categorization data for millions of domains. Each domain includes multiple classification systems:
Comma-separated values file. Best for spreadsheet applications, database imports, and ETL pipelines.
JavaScript Object Notation. Best for web applications, APIs, and NoSQL databases.
Complete field reference for the enterprise database:
| Field Name | Data Type | Description | Example |
|---|---|---|---|
| Domain | String (Required) | The domain name without protocol (http/https) or www prefix | example.com |
| IAB v2 Tier 1 Category | String | Top-level IAB Content Taxonomy v2 category | Technology & Computing |
| IAB v2 Tier 2 Category | String | Second-level IAB v2 category (subcategory of Tier 1) | Computing |
| IAB v2 Tier 3 Category | String | Third-level IAB v2 category (may be empty) | Internet |
| IAB v2 Tier 4 Category | String | Fourth-level IAB v2 category (may be empty) | Cloud Computing |
| IAB v3 Tier 1 Category | String | Top-level IAB Content Taxonomy v3 category | Technology & Computing |
| IAB v3 Tier 2 Category | String | Second-level IAB v3 category (subcategory of Tier 1) | Computing |
| IAB v3 Tier 3 Category | String | Third-level IAB v3 category (may be empty) | Internet |
| IAB v3 Tier 4 Category | String | Fourth-level IAB v3 category (may be empty) | Cloud Computing |
| Web Filtering Category | String | Security/content filtering category for network policies | Computers & Technology |
| Personas | String | Target buyer personas/audience for marketing applications | Tech Enthusiasts, IT Professionals |
| OpenPageRank | Float | Domain authority score (0-10 scale) for quality assessment | 7.85 |
| Country | String | Geographic location or target market of the domain | United States |
| Simplified Category | String | Simplified broad category (33 possible values) - Additional field | Technology |
The Simplified Category field uses a simplified taxonomy of 33 broad categories designed for easy integration and filtering:
Import the entire dataset into your database for high-performance lookups. Best for applications requiring sub-millisecond response times.
Recommended for: Ad-tech platforms, content filtering systems, security appliances
Load the dataset into memory using a hash map or dictionary for instant lookups. Requires ~2-4GB RAM depending on implementation.
Recommended for: Real-time bidding, high-frequency trading, CDN edge nodes
Process the file in chunks for enrichment pipelines, data warehouses, or analytics platforms.
Recommended for: ETL pipelines, data lakes, business intelligence
Keep the file on disk and perform lookups using binary search or indexed access. Lower memory footprint but slower lookups.
Recommended for: Desktop applications, embedded systems, low-resource environments
import pandas as pd
from typing import Optional, Dict
class DomainCategorizer:
def __init__(self, csv_path: str):
"""Load the domain database into memory."""
# Load CSV with optimized dtypes for memory efficiency
self.df = pd.read_csv(csv_path, dtype={
'Domain': 'string',
'IAB v2 Tier 1 Category': 'category',
'IAB v2 Tier 2 Category': 'category',
'IAB v2 Tier 3 Category': 'category',
'IAB v2 Tier 4 Category': 'category',
'IAB v3 Tier 1 Category': 'category',
'IAB v3 Tier 2 Category': 'category',
'IAB v3 Tier 3 Category': 'category',
'IAB v3 Tier 4 Category': 'category',
'Web Filtering Category': 'category',
'Personas': 'string',
'OpenPageRank': 'float64',
'Country': 'category',
'Simplified Category': 'category'
})
# Create lookup dictionary for O(1) access
self.lookup = self.df.set_index('Domain').to_dict('index')
def get_category(self, domain: str) -> Optional[Dict]:
"""Get all categories for a domain."""
# Normalize domain (lowercase, no www)
domain = domain.lower().replace('www.', '')
return self.lookup.get(domain)
def get_simplified_category(self, domain: str) -> Optional[str]:
"""Get simplified category."""
result = self.get_category(domain)
return result['Simplified Category'] if result else None
# Usage
categorizer = DomainCategorizer('enterprise_domains.csv')
# Single lookup - access all fields
result = categorizer.get_category('google.com')
print(f"IAB v2: {result['IAB v2 Tier 1 Category']}")
print(f"IAB v3: {result['IAB v3 Tier 1 Category']}")
print(f"Web Filtering: {result['Web Filtering Category']}")
print(f"Simplified: {result['Simplified Category']}")
print(f"Personas: {result['Personas']}")
print(f"PageRank: {result['OpenPageRank']}")
print(f"Country: {result['Country']}")
# Batch processing with chunks (for very large files)
def process_in_chunks(csv_path: str, chunk_size: int = 100000):
for chunk in pd.read_csv(csv_path, chunksize=chunk_size):
for _, row in chunk.iterrows():
# Process each domain
print(f"{row['Domain']}: {row['Simplified Category']}")
const fs = require('fs');
const readline = require('readline');
const { parse } = require('csv-parse');
class DomainCategorizer {
constructor() {
this.lookup = new Map();
}
async loadCSV(filePath) {
return new Promise((resolve, reject) => {
const parser = parse({
columns: true,
skip_empty_lines: true
});
fs.createReadStream(filePath)
.pipe(parser)
.on('data', (row) => {
const domain = row.Domain.toLowerCase().replace('www.', '');
this.lookup.set(domain, {
iabV2Tier1: row['IAB v2 Tier 1 Category'],
iabV2Tier2: row['IAB v2 Tier 2 Category'],
iabV2Tier3: row['IAB v2 Tier 3 Category'],
iabV2Tier4: row['IAB v2 Tier 4 Category'],
iabV3Tier1: row['IAB v3 Tier 1 Category'],
iabV3Tier2: row['IAB v3 Tier 2 Category'],
iabV3Tier3: row['IAB v3 Tier 3 Category'],
iabV3Tier4: row['IAB v3 Tier 4 Category'],
webFiltering: row['Web Filtering Category'],
personas: row['Personas'],
openPageRank: parseFloat(row['OpenPageRank']) || 0,
country: row['Country'],
simplified_category: row['Simplified Category']
});
})
.on('end', () => resolve())
.on('error', reject);
});
}
getCategory(domain) {
const normalized = domain.toLowerCase().replace('www.', '');
return this.lookup.get(normalized) || null;
}
getSimplifiedCategory(domain) {
const result = this.getCategory(domain);
return result ? result.simplified_category : null;
}
}
// Usage
async function main() {
const categorizer = new DomainCategorizer();
await categorizer.loadCSV('enterprise_domains.csv');
const result = categorizer.getCategory('google.com');
console.log(`IAB v2: ${result.iabV2Tier1}`);
console.log(`IAB v3: ${result.iabV3Tier1}`);
console.log(`Simplified: ${result.simplified_category}`);
console.log(`PageRank: ${result.openPageRank}`);
console.log(`Country: ${result.country}`);
// Batch lookup
const domains = ['facebook.com', 'amazon.com', 'github.com'];
domains.forEach(domain => {
const cat = categorizer.getSimplifiedCategory(domain);
console.log(`${domain}: ${cat}`);
});
}
main().catch(console.error);
<?php
class DomainCategorizer {
private array $lookup = [];
public function __construct(string $csvPath) {
$this->loadCSV($csvPath);
}
private function loadCSV(string $path): void {
if (($handle = fopen($path, 'r')) !== false) {
$headers = fgetcsv($handle);
while (($row = fgetcsv($handle)) !== false) {
if (count($row) === count($headers)) {
$data = array_combine($headers, $row);
$domain = strtolower(str_replace('www.', '', $data['Domain']));
$this->lookup[$domain] = $data;
}
}
fclose($handle);
}
}
public function getCategory(string $domain): ?array {
$normalized = strtolower(str_replace('www.', '', $domain));
return $this->lookup[$normalized] ?? null;
}
public function getSimplifiedCategory(string $domain): ?string {
$result = $this->getCategory($domain);
return $result['Simplified Category'] ?? null;
}
// Memory-efficient streaming for large files
public static function processInChunks(string $path, callable $callback): void {
if (($handle = fopen($path, 'r')) !== false) {
$headers = fgetcsv($handle);
while (($row = fgetcsv($handle)) !== false) {
if (count($row) === count($headers)) {
$data = array_combine($headers, $row);
$callback($data);
}
}
fclose($handle);
}
}
}
// Usage
$categorizer = new DomainCategorizer('enterprise_domains.csv');
$result = $categorizer->getCategory('google.com');
echo "Category: " . $result['Simplified Category'] . "\n";
// Streaming large files
DomainCategorizer::processInChunks('enterprise_domains.csv', function($row) {
echo $row['Domain'] . ': ' . $row['Simplified Category'] . "\n";
});
import java.io.*;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
public class DomainCategorizer {
private Map<String, DomainData> lookup = new ConcurrentHashMap<>();
public static class DomainData {
public String iabTier1, iabTier2, iabTier3, iabTier4;
public String webFiltering, simplifiedCategory;
}
public void loadCSV(String filePath) throws IOException {
try (BufferedReader reader = new BufferedReader(new FileReader(filePath))) {
String headerLine = reader.readLine();
String[] headers = headerLine.split(",");
String line;
while ((line = reader.readLine()) != null) {
String[] values = parseCSVLine(line);
if (values.length >= 7) {
DomainData data = new DomainData();
data.iabTier1 = values[1];
data.iabTier2 = values[2];
data.iabTier3 = values[3];
data.iabTier4 = values[4];
data.webFiltering = values[5];
data.simplifiedCategory = values[6];
String domain = values[0].toLowerCase().replace("www.", "");
lookup.put(domain, data);
}
}
}
}
public DomainData getCategory(String domain) {
String normalized = domain.toLowerCase().replace("www.", "");
return lookup.get(normalized);
}
public String getSimplifiedCategory(String domain) {
DomainData result = getCategory(domain);
return result != null ? result.simplifiedCategory : null;
}
private String[] parseCSVLine(String line) {
// Handle quoted CSV fields
List<String> result = new ArrayList<>();
StringBuilder current = new StringBuilder();
boolean inQuotes = false;
for (char c : line.toCharArray()) {
if (c == '"') {
inQuotes = !inQuotes;
} else if (c == ',' && !inQuotes) {
result.add(current.toString().trim());
current = new StringBuilder();
} else {
current.append(c);
}
}
result.add(current.toString().trim());
return result.toArray(new String[0]);
}
public static void main(String[] args) throws IOException {
DomainCategorizer categorizer = new DomainCategorizer();
categorizer.loadCSV("enterprise_domains.csv");
DomainData result = categorizer.getCategory("google.com");
System.out.println("Category: " + result.simplifiedCategory);
}
}
using System;
using System.Collections.Generic;
using System.IO;
using Microsoft.VisualBasic.FileIO;
public class DomainCategorizer
{
private Dictionary<string, DomainData> _lookup = new();
public class DomainData
{
public string IabTier1 { get; set; }
public string IabTier2 { get; set; }
public string IabTier3 { get; set; }
public string IabTier4 { get; set; }
public string WebFiltering { get; set; }
public string SimplifiedCategory { get; set; }
}
public void LoadCSV(string filePath)
{
using var parser = new TextFieldParser(filePath);
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
parser.HasFieldsEnclosedInQuotes = true;
// Skip header
string[] headers = parser.ReadFields();
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
if (fields.Length >= 7)
{
var data = new DomainData
{
IabTier1 = fields[1],
IabTier2 = fields[2],
IabTier3 = fields[3],
IabTier4 = fields[4],
WebFiltering = fields[5],
SimplifiedCategory = fields[6]
};
string domain = fields[0].ToLower().Replace("www.", "");
_lookup[domain] = data;
}
}
}
public DomainData GetCategory(string domain)
{
string normalized = domain.ToLower().Replace("www.", "");
return _lookup.TryGetValue(normalized, out var data) ? data : null;
}
public string GetSimplifiedCategory(string domain)
{
var result = GetCategory(domain);
return result?.SimplifiedCategory;
}
}
// Usage
class Program
{
static void Main()
{
var categorizer = new DomainCategorizer();
categorizer.LoadCSV("enterprise_domains.csv");
var result = categorizer.GetCategory("google.com");
Console.WriteLine($"Category: {result?.SimplifiedCategory}");
}
}
package main
import (
"encoding/csv"
"fmt"
"os"
"strings"
"sync"
)
type DomainData struct {
IABTier1 string
IABTier2 string
IABTier3 string
IABTier4 string
WebFiltering string
SimplifiedCategory string
}
type DomainCategorizer struct {
lookup map[string]DomainData
mu sync.RWMutex
}
func NewDomainCategorizer() *DomainCategorizer {
return &DomainCategorizer{
lookup: make(map[string]DomainData),
}
}
func (dc *DomainCategorizer) LoadCSV(filePath string) error {
file, err := os.Open(filePath)
if err != nil {
return err
}
defer file.Close()
reader := csv.NewReader(file)
// Skip header
_, err = reader.Read()
if err != nil {
return err
}
for {
record, err := reader.Read()
if err != nil {
break
}
if len(record) >= 7 {
domain := strings.ToLower(strings.Replace(record[0], "www.", "", 1))
dc.lookup[domain] = DomainData{
IABTier1: record[1],
IABTier2: record[2],
IABTier3: record[3],
IABTier4: record[4],
WebFiltering: record[5],
SimplifiedCategory: record[6],
}
}
}
return nil
}
func (dc *DomainCategorizer) GetCategory(domain string) (DomainData, bool) {
dc.mu.RLock()
defer dc.mu.RUnlock()
normalized := strings.ToLower(strings.Replace(domain, "www.", "", 1))
data, exists := dc.lookup[normalized]
return data, exists
}
func (dc *DomainCategorizer) GetSimplifiedCategory(domain string) string {
if data, exists := dc.GetCategory(domain); exists {
return data.SimplifiedCategory
}
return ""
}
func main() {
categorizer := NewDomainCategorizer()
err := categorizer.LoadCSV("enterprise_domains.csv")
if err != nil {
panic(err)
}
if data, exists := categorizer.GetCategory("google.com"); exists {
fmt.Printf("Category: %s\n", data.SimplifiedCategory)
}
}
require 'csv'
class DomainCategorizer
def initialize(csv_path)
@lookup = {}
load_csv(csv_path)
end
def load_csv(path)
CSV.foreach(path, headers: true) do |row|
domain = row['Domain'].downcase.gsub('www.', '')
@lookup[domain] = {
iab_tier1: row['IAB v3 Tier 1 Category'],
iab_tier2: row['IAB v3 Tier 2 Category'],
iab_tier3: row['IAB v3 Tier 3 Category'],
iab_tier4: row['IAB v3 Tier 4 Category'],
web_filtering: row['Web Filtering Category'],
simplified_category: row['Simplified Category']
}
end
end
def get_category(domain)
normalized = domain.downcase.gsub('www.', '')
@lookup[normalized]
end
def get_simplified_category(domain)
result = get_category(domain)
result&.dig(:simplified_category)
end
# Memory-efficient streaming
def self.process_in_chunks(path, &block)
CSV.foreach(path, headers: true) do |row|
yield row.to_h
end
end
end
# Usage
categorizer = DomainCategorizer.new('enterprise_domains.csv')
result = categorizer.get_category('google.com')
puts "Category: #{result[:simplified_category]}"
# Streaming
DomainCategorizer.process_in_chunks('enterprise_domains.csv') do |row|
puts "#{row['Domain']}: #{row['Simplified Category']}"
end
[
{
"domain": "google.com",
"iab_v2_tier1": "Technology & Computing",
"iab_v2_tier2": "Computing",
"iab_v2_tier3": "Internet",
"iab_v2_tier4": "Search",
"iab_v3_tier1": "Technology & Computing",
"iab_v3_tier2": "Computing",
"iab_v3_tier3": "Internet",
"iab_v3_tier4": "Search",
"web_filtering": "Search Engines & Platforms",
"personas": "Tech Enthusiasts, IT Professionals",
"open_page_rank": 10.0,
"country": "United States",
"simplified_category": "Technology"
},
{
"domain": "amazon.com",
"iab_v2_tier1": "Shopping",
"iab_v2_tier2": "Consumer Electronics",
"iab_v2_tier3": "",
"iab_v2_tier4": "",
"iab_v3_tier1": "Shopping",
"iab_v3_tier2": "Consumer Electronics",
"iab_v3_tier3": "",
"iab_v3_tier4": "",
"web_filtering": "Shopping",
"personas": "Online Shoppers, Deal Seekers",
"open_page_rank": 10.0,
"country": "United States",
"simplified_category": "Shopping"
}
]
import json
import ijson # For streaming large JSON files
class DomainCategorizerJSON:
def __init__(self, json_path: str):
"""Load JSON database into memory."""
with open(json_path, 'r', encoding='utf-8') as f:
data = json.load(f)
self.lookup = {
item['domain'].lower(): item
for item in data
}
def get_category(self, domain: str):
normalized = domain.lower().replace('www.', '')
return self.lookup.get(normalized)
# Streaming for large files (NDJSON)
@staticmethod
def stream_ndjson(path: str):
with open(path, 'r', encoding='utf-8') as f:
for line in f:
if line.strip():
yield json.loads(line)
# Streaming large JSON array with ijson
@staticmethod
def stream_json_array(path: str):
with open(path, 'rb') as f:
for item in ijson.items(f, 'item'):
yield item
# Usage
categorizer = DomainCategorizerJSON('enterprise_domains.json')
result = categorizer.get_category('google.com')
print(f"Category: {result['simplified_category']}")
# Stream processing
for domain in DomainCategorizerJSON.stream_ndjson('domains.ndjson'):
print(f"{domain['domain']}: {domain['simplified_category']}")
const fs = require('fs');
const readline = require('readline');
class DomainCategorizerJSON {
constructor() {
this.lookup = new Map();
}
async loadJSON(filePath) {
const data = JSON.parse(fs.readFileSync(filePath, 'utf8'));
data.forEach(item => {
this.lookup.set(item.domain.toLowerCase(), item);
});
}
getCategory(domain) {
const normalized = domain.toLowerCase().replace('www.', '');
return this.lookup.get(normalized) || null;
}
// Stream NDJSON (newline-delimited JSON)
async *streamNDJSON(filePath) {
const rl = readline.createInterface({
input: fs.createReadStream(filePath),
crlfDelay: Infinity
});
for await (const line of rl) {
if (line.trim()) {
yield JSON.parse(line);
}
}
}
}
// Usage
async function main() {
const categorizer = new DomainCategorizerJSON();
await categorizer.loadJSON('enterprise_domains.json');
const result = categorizer.getCategory('google.com');
console.log(`Category: ${result.simplified_category}`);
// Stream processing
for await (const domain of categorizer.streamNDJSON('domains.ndjson')) {
console.log(`${domain.domain}: ${domain.simplified_category}`);
}
}
main();
<?php
class DomainCategorizerJSON {
private array $lookup = [];
public function __construct(string $jsonPath) {
$data = json_decode(file_get_contents($jsonPath), true);
foreach ($data as $item) {
$domain = strtolower($item['domain']);
$this->lookup[$domain] = $item;
}
}
public function getCategory(string $domain): ?array {
$normalized = strtolower(str_replace('www.', '', $domain));
return $this->lookup[$normalized] ?? null;
}
// Stream large JSON files
public static function streamJSON(string $path, callable $callback): void {
$handle = fopen($path, 'r');
$buffer = '';
$inObject = false;
$depth = 0;
while (!feof($handle)) {
$char = fgetc($handle);
if ($char === '{') {
$inObject = true;
$depth++;
}
if ($inObject) {
$buffer .= $char;
if ($char === '}') {
$depth--;
if ($depth === 0) {
$item = json_decode($buffer, true);
if ($item) $callback($item);
$buffer = '';
$inObject = false;
}
}
}
}
fclose($handle);
}
}
// Usage
$categorizer = new DomainCategorizerJSON('enterprise_domains.json');
$result = $categorizer->getCategory('google.com');
echo "Category: " . $result['simplified_category'] . "\n";
# Pretty print and filter with jq
cat enterprise_domains.json | jq '.[] | select(.simplified_category == "Technology")'
# Extract specific fields (all categories)
cat enterprise_domains.json | jq -r '.[] | "\(.domain),\(.iab_v2_tier1),\(.iab_v3_tier1),\(.simplified_category)"'
# Filter by country and PageRank
cat enterprise_domains.json | jq '.[] | select(.country == "United States" and .open_page_rank > 7)'
# Convert to NDJSON for streaming
cat enterprise_domains.json | jq -c '.[]' > domains.ndjson
# Count domains per Simplified category
cat enterprise_domains.json | jq -r '.[].simplified_category' | sort | uniq -c | sort -rn
# Count domains per country
cat enterprise_domains.json | jq -r '.[].country' | sort | uniq -c | sort -rn
CREATE TABLE domain_categories (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
domain VARCHAR(255) NOT NULL,
iab_v2_tier1 VARCHAR(100),
iab_v2_tier2 VARCHAR(100),
iab_v2_tier3 VARCHAR(100),
iab_v2_tier4 VARCHAR(100),
iab_v3_tier1 VARCHAR(100),
iab_v3_tier2 VARCHAR(100),
iab_v3_tier3 VARCHAR(100),
iab_v3_tier4 VARCHAR(100),
web_filtering VARCHAR(100),
personas TEXT,
open_page_rank DECIMAL(4,2),
country VARCHAR(100),
simplified_category VARCHAR(50),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
UNIQUE KEY idx_domain (domain),
KEY idx_simplified_category (simplified_category),
KEY idx_web_filtering (web_filtering),
KEY idx_iab_v2_tier1 (iab_v2_tier1),
KEY idx_iab_v3_tier1 (iab_v3_tier1),
KEY idx_country (country)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
-- Import from CSV
LOAD DATA INFILE '/path/to/enterprise_domains.csv'
INTO TABLE domain_categories
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(domain, iab_v2_tier1, iab_v2_tier2, iab_v2_tier3, iab_v2_tier4,
iab_v3_tier1, iab_v3_tier2, iab_v3_tier3, iab_v3_tier4,
web_filtering, personas, open_page_rank, country, simplified_category);
-- Query examples
SELECT * FROM domain_categories WHERE domain = 'google.com';
SELECT simplified_category, COUNT(*) as count FROM domain_categories GROUP BY simplified_category ORDER BY count DESC;
SELECT * FROM domain_categories WHERE country = 'United States' AND open_page_rank > 5;
CREATE TABLE domain_categories (
id BIGSERIAL PRIMARY KEY,
domain VARCHAR(255) NOT NULL UNIQUE,
iab_v2_tier1 VARCHAR(100),
iab_v2_tier2 VARCHAR(100),
iab_v2_tier3 VARCHAR(100),
iab_v2_tier4 VARCHAR(100),
iab_v3_tier1 VARCHAR(100),
iab_v3_tier2 VARCHAR(100),
iab_v3_tier3 VARCHAR(100),
iab_v3_tier4 VARCHAR(100),
web_filtering VARCHAR(100),
personas TEXT,
open_page_rank DECIMAL(4,2),
country VARCHAR(100),
simplified_category VARCHAR(50),
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_domain_categories_simplified ON domain_categories(simplified_category);
CREATE INDEX idx_domain_categories_filtering ON domain_categories(web_filtering);
CREATE INDEX idx_domain_categories_country ON domain_categories(country);
CREATE INDEX idx_domain_categories_pagerank ON domain_categories(open_page_rank);
-- Import from CSV
COPY domain_categories(domain, iab_v2_tier1, iab_v2_tier2, iab_v2_tier3, iab_v2_tier4,
iab_v3_tier1, iab_v3_tier2, iab_v3_tier3, iab_v3_tier4,
web_filtering, personas, open_page_rank, country, simplified_category)
FROM '/path/to/enterprise_domains.csv'
DELIMITER ','
CSV HEADER;
// Create collection with schema validation
db.createCollection("domains", {
validator: {
$jsonSchema: {
bsonType: "object",
required: ["domain"],
properties: {
domain: { bsonType: "string" },
iab_v2_tier1: { bsonType: "string" },
iab_v2_tier2: { bsonType: "string" },
iab_v2_tier3: { bsonType: "string" },
iab_v2_tier4: { bsonType: "string" },
iab_v3_tier1: { bsonType: "string" },
iab_v3_tier2: { bsonType: "string" },
iab_v3_tier3: { bsonType: "string" },
iab_v3_tier4: { bsonType: "string" },
web_filtering: { bsonType: "string" },
personas: { bsonType: "string" },
open_page_rank: { bsonType: "double" },
country: { bsonType: "string" },
simplified_category: { bsonType: "string" }
}
}
}
});
// Create indexes
db.domains.createIndex({ domain: 1 }, { unique: true });
db.domains.createIndex({ simplified_category: 1 });
db.domains.createIndex({ web_filtering: 1 });
db.domains.createIndex({ country: 1 });
db.domains.createIndex({ open_page_rank: -1 });
// Import using mongoimport
// mongoimport --db categorization --collection domains --type json --file enterprise_domains.json --jsonArray
// Query examples
db.domains.findOne({ domain: "google.com" });
db.domains.find({ country: "United States", open_page_rank: { $gt: 7 } });
db.domains.aggregate([
{ $group: { _id: "$simplified_category", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
]);
# Python example for Redis import
import redis
import json
import csv
r = redis.Redis(host='localhost', port=6379, db=0)
pipe = r.pipeline()
# Import from CSV
with open('enterprise_domains.csv', 'r') as f:
reader = csv.DictReader(f)
for i, row in enumerate(reader):
domain = row['Domain'].lower()
data = json.dumps({
'iab_v2_tier1': row['IAB v2 Tier 1 Category'],
'iab_v3_tier1': row['IAB v3 Tier 1 Category'],
'web_filtering': row['Web Filtering Category'],
'personas': row['Personas'],
'open_page_rank': row['OpenPageRank'],
'country': row['Country'],
'simplified_category': row['Simplified Category']
})
pipe.set(f"domain:{domain}", data)
# Execute in batches
if i % 10000 == 0:
pipe.execute()
pipe = r.pipeline()
pipe.execute()
# Query
result = r.get("domain:google.com")
data = json.loads(result) if result else None
print(f"IAB v2: {data['iab_v2_tier1']}, Simplified: {data['simplified_category']}")
Use categorical data types (pandas) or interned strings to reduce memory usage by 60-80%. For millions of domains, expect ~2GB RAM with optimization vs ~8GB without.
Always create indexes on the domain column (unique) and category columns. Use hash indexes for exact lookups and B-tree for range queries.
When processing the full dataset, use streaming/chunked reading instead of loading everything into memory. Process in batches of 100K-500K records.
Store compressed (gzip) copies for archival. Decompress on-the-fly during import. Typical compression ratio: 85-90% size reduction.
For monthly updates, use diff-based updates or UPSERT operations rather than full table replacement to minimize downtime.
The Enterprise Database is updated monthly with:
Updates are delivered via:
Need help with integration? Our technical team is here to assist:
Dedicated integration assistance available