Spam Classifier

Find it on GitHub: /edgee-cloud/spam-classifier-component The Spam Classifier component is a high-performance machine learning edge function that provides real-time spam detection using a Naive Bayesian classifier. This Wasm-based component runs at the edge, offering fast and accurate content classification without requiring external services or additional backend infrastructure.

What is the Spam Classifier Component?

The Spam Classifier component is a security-focused edge function that:

Uses Naive Bayes algorithms with Finite State Transducers (FST) for optimal performance
Provides real-time text classification with confidence scores
Supports multi-language content analysis
Runs entirely at the edge as a Wasm component
Returns detailed classification results with spam probabilities
Requires no external API calls or backend dependencies

Getting Started

Access Component Library

Open the Edgee console and navigate to your project’s Components section.

Add Spam Classifier Component

Select “Add a component” and choose edgee/spam-classifier from the list of available components.

Configure Component Settings

Set up the component configuration:

Endpoint path: Configure the URL path (e.g., /classify or /spam-check)
Spam classification threshold (optional): Set the probability threshold (default: 0.80 works well for most cases)
Laplace smoothing factor (optional): Configure Laplace smoothing (default: 1.0 provides good balance)

Deploy Component

Click Save to deploy the component to your edge infrastructure.

The component will be available at your configured endpoint within minutes.

Configuration

When adding the Spam Classifier component to your project through the Edgee console, you can customize its behavior with these settings:

path

string

default:"/classify"

required

The URL path where the spam classifier will be accessible. This endpoint will receive POST requests with text content for classification.

spam_threshold

number

default:"0.80"

Spam classification threshold (optional). Probability threshold above which content is classified as spam. The default value of 0.80 works well for most use cases. Range: 0.0-1.0. Higher values = stricter detection, reducing false positives but may miss subtle spam. Lower values catch more spam but increase false positives.

laplace_smoothing_factor

number

default:"1.0"

Laplace smoothing factor (optional). Smoothing parameter for the Naive Bayes classifier that handles unseen tokens. The default value of 1.0 provides good balance for most content types. Range: 0.0+. Higher values provide more conservative classifications.

Configuration changes take effect immediately without requiring component redeployment. You can adjust these values based on your content patterns and false positive tolerance.

API Reference

Request Parameters

input

string

required

The text content to classify for spam detection.

Example Request

curl -X POST https://yourdomain.com/classify \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, how are you today? I hope you are having a great day!"
  }'

Response Fields

spam_probability

float

Probability that the content is spam, ranging from 0.0 to 1.0. Values closer to 1.0 indicate higher likelihood of spam.

ham_probability

float

Probability that the content is legitimate (ham), ranging from 0.0 to 1.0. Always equals 1.0 - spam_probability.

is_spam

boolean

Boolean flag indicating whether the content exceeds the configured spam threshold. True if spam_probability >= spam_threshold.

confidence

float

Classification confidence level, calculated as the absolute difference between spam and ham probabilities.

text

string

The original input text that was classified, echoed back in the response.

Example Response

{
  "text": "Hello, how are you today? I hope you are having a great day!",
  "spam_probability": 0.23,
  "ham_probability": 0.77,
  "is_spam": false,
  "confidence": 0.54
}

Usage Examples

const classifyText = async (text) => {
  try {
    const response = await fetch('/classify', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({ input: text })
    });
    
    if (!response.ok) {
      throw new Error(`HTTP error! status: ${response.status}`);
    }
    
    const result = await response.json();
    
    if (result.is_spam) {
      console.log(`Spam detected with ${(result.spam_probability * 100).toFixed(1)}% confidence`);
    } else {
      console.log(`Content is legitimate with ${(result.confidence * 100).toFixed(1)}% confidence`);
    }
    
    return result;
  } catch (error) {
    console.error('Classification error:', error);
    throw error;
  }
};

// Example usage with different content types
await classifyText("Hello, how are you today?");
await classifyText("Buy now! Limited time offer! Click here for amazing deals!");

Form Validation Integration

// Simple form validation with spam detection
document.getElementById('messageForm').addEventListener('submit', async (e) => {
  e.preventDefault();
  
  const messageInput = document.getElementById('messageInput');
  const message = messageInput.value;
  
  try {
    // Check for spam
    const response = await fetch('/classify', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ input: message })
    });
    
    const result = await response.json();
    
    if (result.is_spam) {
      alert(`Message appears to be spam (${(result.spam_probability * 100).toFixed(1)}% confidence)`);
      return;
    }
    
    // Submit the form if content is legitimate
    console.log('Message validated, submitting form');
    // Your form submission logic here
    
  } catch (error) {
    console.error('Validation error:', error);
    // Proceed with submission if spam check fails
  }
});

Performance Characteristics

The Spam Classifier component delivers exceptional performance at the edge:

Benchmark Results (x86, native)

Short text (~5 words): ~28 µs processing time (72K tokens/sec)
Medium text (~15 words): ~66 µs processing time (227K tokens/sec)
Long text (~62 words): ~128 µs processing time (484K tokens/sec)

Key Performance Features

O(log n) token lookup using Finite State Transducers
64-bit packed counters for memory efficiency
Log-space calculations to prevent numerical overflow
Static model embedding eliminates file I/O overhead
Sub-millisecond response times for most content

Technical Details

Machine Learning Architecture

The component uses a sophisticated Naive Bayes implementation optimized for edge computing:

Text Processing Pipeline: Advanced Unicode tokenization with multi-language support
Feature Engineering: AlphaNumeric token filtering and text normalization
Classification Algorithm: Naive Bayes with configurable Laplace smoothing
Performance Optimization: FST-based token lookup and log-space probability calculations

Text Processing Features

Advanced Unicode-aware tokenization that handles:

Multi-language text processing
Special character normalization
Stemming and lowercase conversion
Stop word filtering

Use Cases

The Spam Classifier component is ideal for:

Content Moderation

Comment Systems: Filter spam comments on blogs and forums
User-Generated Content: Moderate posts, reviews, and submissions
Social Media: Detect spam in messages and posts

Security Applications

Form Protection: Prevent spam submissions in contact forms
API Security: Filter malicious content in API requests
Email Systems: Pre-filter messages before processing

Quality Control

Content Quality: Ensure high-quality user contributions
Automated Triage: Route suspicious content for human review
Compliance: Meet anti-spam regulatory requirements

Limitations and Considerations

This edge-optimized classifier prioritizes speed and simplicity over maximum accuracy. For applications requiring the highest precision spam detection, consider dedicated spam filtering services or more sophisticated machine learning solutions.

Compared to Enterprise Solutions

Simpler feature set: Uses basic Naive Bayes with token-based analysis
No behavioral analysis: Lacks sender reputation, link analysis, or pattern recognition
Limited training data: Smaller model size optimized for edge deployment
No real-time updates: Model updates require component redeployment

Best Fit Use Cases

This classifier works well for:

Basic content filtering where speed is prioritized over perfect accuracy
First-line defense in multi-layer spam protection strategies
Edge computing scenarios where low latency is critical
Privacy-focused applications that avoid external API calls

Consider combining this component with other security measures like rate limiting, CAPTCHA, or human moderation for comprehensive protection.

Error Handling

The component implements robust error handling with proper HTTP status codes:

Status Codes

200 OK: Successful classification with valid JSON input
400 Bad Request: Invalid request body or malformed JSON
500 Internal Server Error: Component processing error

Error Response Format

When errors occur, the component returns a structured JSON error response:

{
  "error": "Error message describing what went wrong"
}

Best Practices

Start with a spam threshold of 0.80 and adjust based on your content patterns. Higher thresholds reduce false positives but may miss some spam.

The component processes text content only. Binary data, HTML tags, and special formatting are normalized during tokenization.

For optimal performance, consider batching multiple short texts into single requests when processing large volumes of content.

Model Information

The embedded classification model is trained on diverse, multilingual datasets including:

Email spam detection datasets
Comment spam collections
Social media spam samples
Multilingual content examples

The model supports incremental updates and can be retrained with domain-specific data for improved accuracy in specialized use cases.

DataDome Bot Protection

Advanced bot detection and protection for comprehensive security coverage.

Components

Data Collection

JS Gateway

Consent Management

Identity

Edge Functions

Security

Stitching

What is the Spam Classifier Component?

Getting Started

Configuration

API Reference

Request Parameters

Example Request

Response Fields

Example Response

Usage Examples

Form Validation Integration

Performance Characteristics

Benchmark Results (x86, native)

Key Performance Features

Technical Details

Machine Learning Architecture

Text Processing Features

Use Cases

Content Moderation

Security Applications

Quality Control

Limitations and Considerations

Compared to Enterprise Solutions

Best Fit Use Cases

Error Handling

Status Codes

Error Response Format

Best Practices

Model Information

DataDome Bot Protection

Components

Data Collection

JS Gateway

Consent Management

Identity

Edge Functions

Security

Stitching

​What is the Spam Classifier Component?

​Getting Started

​Configuration

​API Reference

​Request Parameters

​Example Request

​Response Fields

​Example Response

​Usage Examples

​Form Validation Integration

​Performance Characteristics

​Benchmark Results (x86, native)

​Key Performance Features

​Technical Details

​Machine Learning Architecture

​Text Processing Features

​Use Cases

​Content Moderation

​Security Applications

​Quality Control

​Limitations and Considerations

​Compared to Enterprise Solutions

​Best Fit Use Cases

​Error Handling

​Status Codes

​Error Response Format

​Best Practices

​Model Information

​Related Components

DataDome Bot Protection

What is the Spam Classifier Component?

Getting Started

Configuration

API Reference

Request Parameters

Example Request

Response Fields

Example Response

Usage Examples

Form Validation Integration

Performance Characteristics

Benchmark Results (x86, native)

Key Performance Features

Technical Details

Machine Learning Architecture

Text Processing Features

Use Cases

Content Moderation

Security Applications

Quality Control

Limitations and Considerations

Compared to Enterprise Solutions

Best Fit Use Cases

Error Handling

Status Codes

Error Response Format

Best Practices

Model Information

Related Components