Custom Preprocessing

Learn how to create custom preprocessing strategies for domain-specific log formats.

Overview

Preprocessing transforms raw log lines before template extraction. It's crucial for:

Masking variable data (IDs, tokens, values)
Normalizing inconsistent formatting
Handling domain-specific patterns
Improving compression quality

Default Preprocessing

logpare includes built-in patterns for common variables:

import { DEFAULT_PATTERNS } from 'logpare';

console.log(DEFAULT_PATTERNS);
// {
//   ipv4: /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/g,
//   ipv6: /([0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}/g,
//   uuid: /[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}/gi,
//   timestamp: /\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}/g,
//   hexId: /0x[0-9a-fA-F]+/g,
//   filepath: /\/[\w\-\.\/]+/g,
//   url: /https?:\/\/[^\s]+/g,
//   number: /\b\d+\b/g,
//   // ... and more
// }

These patterns automatically replace matching values with <*>.

Creating Custom Strategies

Use defineStrategy() to create a custom preprocessing strategy:

import { defineStrategy } from 'logpare';

const customStrategy = defineStrategy({
  preprocess(line: string): string {
    // Transform the line
    return line;
  },

  tokenize(line: string): string[] {
    // Split line into tokens
    return line.split(/\s+/).filter(Boolean);
  },

  getSimThreshold(depth: number): number {
    // Return similarity threshold for this depth
    return 0.4;
  }
});

All three methods are optional. Only override what you need.

Common Patterns

Adding Custom ID Patterns

Mask application-specific identifiers:

import { defineStrategy, DEFAULT_PATTERNS, WILDCARD } from 'logpare';

const strategy = defineStrategy({
  preprocess(line: string): string {
    let result = line;

    // Apply default patterns first
    for (const [, pattern] of Object.entries(DEFAULT_PATTERNS)) {
      result = result.replace(pattern, WILDCARD);
    }

    // Add custom patterns
    result = result.replace(/order-[A-Z0-9]{8}/g, WILDCARD);
    result = result.replace(/user_\d+/g, WILDCARD);
    result = result.replace(/session-[a-f0-9]{32}/gi, WILDCARD);
    result = result.replace(/REQ-\d{10}/g, WILDCARD);

    return result;
  }
});

compress(logs, { preprocessing: strategy });

E-commerce Logs

const ecommerceStrategy = defineStrategy({
  preprocess(line: string): string {
    return line
      // Order IDs
      .replace(/order-[A-Z0-9]{8}/g, '<*>')
      // Product SKUs
      .replace(/SKU-\d{6}/g, '<*>')
      // Prices
      .replace(/\$\d+\.\d{2}/g, '<*>')
      // Customer IDs
      .replace(/cust_[a-z0-9]{16}/g, '<*>')
      // Cart IDs
      .replace(/cart-[A-Z0-9]{12}/g, '<*>')
      // Apply defaults
      .replace(/\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/g, '<*>');
  }
});

Multi-tenant SaaS Logs

const saasStrategy = defineStrategy({
  preprocess(line: string): string {
    return line
      // Tenant IDs
      .replace(/tenant-[a-z0-9]{16}/g, '<*>')
      // Organization IDs
      .replace(/org_[A-Z0-9]{12}/g, '<*>')
      // Workspace IDs
      .replace(/workspace_\d+/g, '<*>')
      // API keys (partial)
      .replace(/sk_live_[A-Za-z0-9]{24}/g, '<*>')
      .replace(/pk_live_[A-Za-z0-9]{24}/g, '<*>')
      // User emails
      .replace(/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g, '<*>');
  }
});

Kubernetes/Container Logs

const k8sStrategy = defineStrategy({
  preprocess(line: string): string {
    return line
      // Pod names
      .replace(/\b[a-z0-9-]+-[a-z0-9]{8,10}-[a-z0-9]{5}\b/g, '<*>')
      // Container IDs
      .replace(/[0-9a-f]{64}/g, '<*>')
      // Deployment revision
      .replace(/revision=\d+/g, 'revision=<*>')
      // Resource versions
      .replace(/resourceVersion:\s*"\d+"/g, 'resourceVersion:"<*>"')
      // Image tags
      .replace(/(:\s*v?\d+\.\d+\.\d+(-[\w\.]+)?)/g, ':<*>');
  }
});

Custom Tokenization

CSV Logs

Split on commas instead of whitespace:

const csvStrategy = defineStrategy({
  tokenize(line: string): string[] {
    return line.split(',').map(token => token.trim());
  }
});

Tab-Separated Logs

const tsvStrategy = defineStrategy({
  tokenize(line: string): string[] {
    return line.split('\t').filter(Boolean);
  }
});

JSON Logs

Extract specific fields for tokenization:

const jsonStrategy = defineStrategy({
  preprocess(line: string): string {
    try {
      const parsed = JSON.parse(line);
      // Create a normalized format
      return `${parsed.level} ${parsed.message || ''} ${parsed.context || ''}`;
    } catch {
      // Fallback for non-JSON lines
      return line;
    }
  },

  tokenize(line: string): string[] {
    return line.split(/\s+/).filter(Boolean);
  }
});

Depth-Based Similarity Thresholds

Adjust matching strictness by tree depth:

const adaptiveStrategy = defineStrategy({
  getSimThreshold(depth: number): number {
    // More lenient for shallow depths (first few tokens)
    if (depth <= 2) return 0.3;

    // Default for middle depths
    if (depth <= 4) return 0.4;

    // Stricter for deeper levels
    return 0.5;
  }
});

Use case: When initial tokens are highly variable but later tokens are consistent.

Testing Custom Strategies

Verify your strategy works as expected:

import { defineStrategy, WILDCARD } from 'logpare';

const strategy = defineStrategy({
  preprocess(line: string): string {
    return line.replace(/order-[A-Z0-9]{8}/g, WILDCARD);
  }
});

// Test preprocessing
const input = 'Processing order-ABC12345 for user 123';
const output = strategy.preprocess(input);

console.log(output);
// Expected: "Processing <*> for user 123"

// Test with compress
const result = compress([
  'Processing order-ABC12345 for user 123',
  'Processing order-XYZ98765 for user 456',
], { preprocessing: strategy });

console.log(result.templates[0].pattern);
// Expected: "Processing <*> for user <*>"

Best Practices

Apply defaults first - Start with DEFAULT_PATTERNS then add custom patterns
Test incrementally - Add patterns one at a time and verify results
Be specific - Use precise regex to avoid over-matching
Cache patterns - Compile regex once, reuse many times
Document patterns - Comment what each pattern matches
Validate input - Handle malformed logs gracefully
Monitor performance - Complex regex can slow processing

Debugging Tips

Inspect Preprocessing Output

const strategy = defineStrategy({
  preprocess(line: string): string {
    const result = line.replace(/custom-pattern/g, '<*>');
    console.log(`Before: ${line}`);
    console.log(`After: ${result}`);
    return result;
  }
});

Check Pattern Matches

const testPattern = /order-[A-Z0-9]{8}/g;
const testLine = 'Processing order-ABC12345';

const matches = testLine.match(testPattern);
console.log('Matches:', matches);
// Output: ["order-ABC12345"]

Compare Results

// Without custom preprocessing
const result1 = compress(logs);
console.log(`Templates: ${result1.stats.uniqueTemplates}`);

// With custom preprocessing
const result2 = compress(logs, { preprocessing: customStrategy });
console.log(`Templates: ${result2.stats.uniqueTemplates}`);
console.log(`Improvement: ${result1.stats.uniqueTemplates - result2.stats.uniqueTemplates}`);