Skip to content

Document Processing

How SiteWhiz AI processes and indexes your documents.


Overview

When you upload documents, SiteWhiz automatically:

  1. Extracts text - Reads all content from the document
  2. Performs OCR - Converts images and scans to text
  3. Creates embeddings - Prepares content for AI search
  4. Indexes content - Makes everything searchable

Processing Pipeline

Stage 1: Upload

  • File is received and stored securely
  • Basic validation checks file type and size
  • Document appears in list as "Processing"

Stage 2: Text Extraction

For text-based documents:

  • PDF text is extracted directly
  • Word documents are parsed
  • Excel data is converted to searchable text

Stage 3: OCR (If Needed)

For image-based content:

  • Scanned pages are analyzed
  • Text is recognized using OCR
  • Tables and figures are identified

Stage 4: AI Processing

  • Content is analyzed for meaning
  • Semantic embeddings are created
  • Related concepts are linked

Stage 5: Indexing

  • Full-text search index is built
  • Document is marked as "Ready"
  • Content is available for AI queries

Processing Time

Document Type Typical Time
Text PDF (10 pages) 30 seconds
Scanned PDF (10 pages) 2-3 minutes
Large document (100+ pages) 5-10 minutes
Excel spreadsheet 1-2 minutes

Processing Queue

Multiple documents process in parallel. Large batches may take longer during peak times.


Document Status

Checking Status

In the Documents list, each file shows its status:

Status Icon Meaning
Uploading File transfer in progress
Processing AI extraction in progress
Ready Fully searchable
Error Processing failed

Status Details

Click on a document to see detailed status:

  • Processing stage
  • Time elapsed
  • Any warnings or errors

Quality Factors

Best Results

Documents process best when they are:

  • Text-based (not scanned images)
  • High resolution (300 DPI minimum for scans)
  • Properly oriented (not rotated)
  • Clean (no handwriting over text)
  • Standard fonts

Challenging Documents

These may have reduced accuracy:

  • Low-quality scans
  • Handwritten content
  • Complex tables
  • Multi-column layouts
  • Unusual fonts or languages

Reprocessing Documents

When to Reprocess

Consider reprocessing if:

  • Original had poor quality
  • Better version is available
  • Processing failed

How to Reprocess

  1. Delete the document from SiteWhiz.

  2. Upload the document again.

  3. Wait for processing to complete.


Supported Languages

Text extraction supports:

Language Support Level
English Full
Dutch Full
German Full
French Full
Spanish Full
Other Western Good

Data Security

During Processing

  • Documents are encrypted in transit
  • Processing happens on secure servers
  • No data is shared with third parties

After Processing

  • Original files are stored securely
  • Extracted text is encrypted
  • Access is controlled by permissions

Troubleshooting

Processing Stuck

If a document stays in "Processing" too long:

  1. Wait at least 15 minutes for large files
  2. Refresh the page to check status
  3. Delete and re-upload if still stuck
  4. Contact support for persistent issues

Poor Text Quality

If extracted text has errors:

  1. Check original document quality
  2. Ensure scan resolution is adequate
  3. Try a cleaner copy of the document
  4. Some formatting may not convert well

Processing Failed

If processing fails:

  1. Check file isn't corrupted
  2. Verify format is supported
  3. Try converting to PDF first
  4. Contact support with error details