# Document Ingestion Guide

Complete guide for ingesting documents into the Research Development Framework.

---

## Overview

The ingestion pipeline processes documents through these stages:

1. **Intake** - File detection and validation
2. **Extraction** - Text and metadata extraction
3. **Quality Assessment** - OCR quality evaluation
4. **Conversion** - Markdown generation with YAML front matter
5. **Registration** - Database record creation
6. **Organization** - File archival and organization

---

## Where Is My Research Stored?

After ingestion, your research is stored in three locations:

| Location | Purpose | Contents |
|----------|---------|----------|
| `library/MARKDOWN_LIBRARY/` | **Working copies** | Converted markdown files for searching and processing |
| `library/ORGANIZED/` | **Archive** | Original files organized by category/author |
| `library/NEW_DOCS/completed/` | **Backup** | Original files as received |
| **PostgreSQL database** | **Index** | Metadata, chunks, embeddings, concepts |

### Finding Your Documents

```bash
# List all markdown (searchable) versions
ls library/MARKDOWN_LIBRARY/

# List originals by category
ls library/ORGANIZED/

# Search the database
./rdf search "your topic" --limit 10

# Check total document count
./rdf health --format json
```

### Cleaning Up Test Data

To remove all ingested research and start fresh:

```bash
# 1. Clear database records
python3 -c "
import psycopg2
from pipeline.config import DB_CONFIG
conn = psycopg2.connect(**DB_CONFIG)
cur = conn.cursor()
for table in ['search_history', 'chunk_concepts', 'document_concepts', 'document_topics', 'chunks', 'documents', 'concepts', 'topics', 'authors']:
    cur.execute(f'DELETE FROM {table}')
conn.commit()
conn.close()
print('Database cleared')
"

# 2. Clear library files
rm -rf library/ORGANIZED/*
rm -rf library/NEW_DOCS/completed/*
rm -rf library/MARKDOWN_LIBRARY/*

# 3. Clear any book projects
rm -rf projects/books/BOOK_*/

# 4. Verify
./rdf health --format json
```

---

## Quick Start

### 1. Place Documents

Copy your documents to the import folder:

```bash
# Primary location (recommended)
cp your_document.pdf NEW_DOCS/
```

### 2. Run Ingestion

```bash
./rdf ingest NEW_DOCS/
```

### 3. Check Results

```bash
# View processed documents (markdown)
ls library/MARKDOWN_LIBRARY/

# View original files (organized by category)
ls library/ORGANIZED/

# Check library health
./rdf health --format json
```

---

## Ingest Command

### Basic Usage

```bash
./rdf ingest <path> [options]
```

### Options

| Flag | Description |
|------|-------------|
| `--ocr-profile <name>` | OCR profile for poor scans (standard, academic) |
| `--force` | Re-process existing documents |
| `--format json` | JSON output |

### Examples

```bash
# Ingest all files from a directory
./rdf ingest NEW_DOCS/

# Ingest a single file
./rdf ingest path/to/document.pdf

# Force re-processing of existing document
./rdf ingest document.pdf --force

# Use OCR for scanned documents
./rdf ingest old_scan.pdf --ocr-profile standard

# JSON output for automation
./rdf ingest NEW_DOCS/ --format json
```

---

## Supported Formats

### Document Formats

| Format | Extension | Extraction Method | Notes |
|--------|-----------|-------------------|-------|
| PDF | `.pdf` | pypdf | Best for born-digital PDFs |
| Word (Modern) | `.docx` | python-docx | Preserves formatting metadata |
| Word (Legacy) | `.doc` | pandoc/libreoffice/antiword | Tries pandoc first, then libreoffice, then antiword |
| OpenDocument | `.odt` | odfpy | LibreOffice/OpenOffice documents |
| Rich Text | `.rtf` | striprtf | Microsoft Rich Text Format |
| Plain Text | `.txt` | Direct read | UTF-8 encoding expected |
| Markdown | `.md` | Direct read | Preserves YAML front matter |

### Web/Data Formats

| Format | Extension | Extraction Method | Notes |
|--------|-----------|-------------------|-------|
| HTML | `.html`, `.htm` | BeautifulSoup | Strips scripts/styles, extracts text |
| XML | `.xml` | BeautifulSoup | Extracts text content from XML |
| CSV | `.csv` | Built-in csv | Converts tabular data to readable text |
| JSON | `.json` | Built-in json | Extracts text values from JSON structures |

### Installing Optional Dependencies

```bash
# For RTF support
pip install striprtf

# For ODT (OpenDocument) support
pip install odfpy

# For better HTML/XML parsing
pip install beautifulsoup4 lxml

# For legacy .doc files (system packages, tried in order of preference)
# Ubuntu/Debian:
sudo apt install pandoc              # Recommended: best quality
sudo apt install libreoffice-writer  # Alternative: good quality
sudo apt install antiword            # Fallback: basic extraction

# macOS:
brew install pandoc                  # Recommended
brew install --cask libreoffice      # Alternative
brew install antiword                # Fallback
```

---

## File Naming Conventions

The system extracts metadata from filenames using these patterns:

### Supported Patterns

```
Author_Name_Title_Year.pdf
→ Author: Author Name
→ Title: Title
→ Year: Year

Title_(Author_Name)_Year.pdf
→ Title: Title
→ Author: Author Name
→ Year: Year

Simple_Title.pdf
→ Title: Simple Title
→ Author: Unknown
→ Year: None
```

### Recommended Format

```
LastName_FirstName_Book_Title_YYYY.pdf
```

Examples:
- `Steiner_Rudolf_Philosophy_of_Freedom_1894.pdf`
- `Jung_Carl_Psychology_and_Alchemy_1944.pdf`

---

## Quality Assessment

Documents are assessed for OCR quality and may be flagged as low quality.

### Quality Grades

| Grade | Description | Action |
|-------|-------------|--------|
| `excellent` | Clean text, proper formatting | Processed normally |
| `good` | Minor issues, readable | Processed normally |
| `fair` | Some OCR errors, usable | Processed with warning |
| `poor` | Significant issues | Moved to low_quality/ |
| `very_poor` | Unreadable | Moved to low_quality/ |

### Document Assessment

Check document quality after ingestion:

```bash
./rdf assess DOC_023 --format json
```

Returns:
- OCR quality score
- Language detection
- Page count
- Has TOC
- Chunk statistics
- Extraction warnings
- Duplicate likelihood

### Handling Low Quality Documents

```bash
# Re-ingest with OCR profile
./rdf ingest library/NEW_DOCS/low_quality/document.pdf --ocr-profile standard
```

---

## Folder Structure After Ingestion

```
library/
├── NEW_DOCS/
│   ├── incoming/        # Place files here for ingestion
│   ├── processing/      # Currently processing (temp)
│   ├── completed/       # Successfully processed originals (backup)
│   ├── failed/          # Processing errors
│   └── low_quality/     # Poor OCR quality (needs re-processing)
│
├── MARKDOWN_LIBRARY/    # Converted markdown files (searchable)
│   ├── Philosophy_of_Freedom.md
│   ├── Psychology_and_Alchemy.md
│   └── ...
│
└── ORGANIZED/           # Originals organized by category/author
    ├── Philosophy/
    │   └── Steiner_Rudolf_Philosophy_of_Freedom_1894.pdf
    ├── History/
    │   └── Author_Name/
    │       └── Book_Title.pdf
    ├── Religion_Spirituality/
    │   └── ...
    └── [Other categories based on content]
```

**Note:** The `ORGANIZED/` folder structure is created automatically based on document metadata (category and author). This helps you browse your research library by topic.

---

## Markdown Output Format

Ingested documents are converted to markdown with YAML front matter:

```markdown
---
title: "Philosophy of Freedom"
author: "Rudolf Steiner"
publication_year: 1894
language: en
source_file: "Steiner_Rudolf_Philosophy_of_Freedom_1894.pdf"
quality_grade: excellent
word_count: 45000
processed_date: "2025-12-10T12:00:00"
---

# Philosophy of Freedom

**Author:** Rudolf Steiner

[Document content follows...]
```

---

## Post-Ingestion Pipeline

After documents are ingested, the full pipeline continues automatically. You can also run steps manually:

### Automatic Processing

The `rdf ingest` command runs the full pipeline:
1. Text extraction
2. Chunking
3. Embedding generation (if API key configured)
4. Concept extraction

### Manual Processing Steps

```bash
# If you need to re-run specific steps:

# Chunk documents
./rdf ingest NEW_DOCS/  # Includes chunking

# Generate embeddings (requires OpenAI API key)
python3 pipeline/generate_embeddings.py

# Extract concepts
python3 pipeline/extract_concepts.py --active-linking
```

---

## Library Health

Check your library after ingestion:

```bash
./rdf health --format json
```

Returns:
- Total documents
- Documents by status
- OCR quality distribution
- Missing embeddings
- Orphaned chunks
- Recommended actions

---

## Troubleshooting

### Document Stuck in Processing

```bash
# Check processing folder
ls library/NEW_DOCS/processing/

# Move back to incoming
mv library/NEW_DOCS/processing/* library/NEW_DOCS/incoming/

# Re-run ingestion
./rdf ingest NEW_DOCS/
```

### PDF Extraction Fails

1. Check if PDF is password-protected
2. Try opening in PDF viewer first
3. Check for scanned images (needs OCR profile)
4. Use OCR profile for problematic scans:
   ```bash
   ./rdf ingest document.pdf --ocr-profile standard
   ```

### DOCX Extraction Issues

1. Ensure file isn't corrupted
2. Try opening in Word/LibreOffice first
3. Check for embedded objects

### Duplicate Detection

Files with identical content (SHA-256 hash) are automatically skipped:

```bash
# Check library health for duplicate info
./rdf health --format json
```

---

## Bulk Ingestion Tips

### Large Document Sets

1. Copy files in batches (100-500 at a time)
2. Run ingestion between batches
3. Monitor disk space
4. Check for failures before continuing

```bash
# Ingest in batches
cp batch1/*.pdf NEW_DOCS/
./rdf ingest NEW_DOCS/

cp batch2/*.pdf NEW_DOCS/
./rdf ingest NEW_DOCS/
```

### Consistent File Naming

Use a bulk rename tool to standardize filenames:

```bash
# Example: Replace spaces with underscores
for f in NEW_DOCS/*; do
  mv "$f" "${f// /_}"
done
```

### Progress Monitoring

```bash
# Check workflow status
./rdf status

# View library health
./rdf health --format json
```

---

## Editing Metadata

Fix metadata after ingestion:

```bash
# Edit document metadata
./rdf edit-meta DOC_023 --title "Correct Title" --author "Real Author"
```

---

## Database Records

After ingestion, documents are registered in the database:

```sql
-- View ingested documents
SELECT document_id, title, word_count, processing_status
FROM documents
ORDER BY created_at DESC
LIMIT 10;

-- Check processing status counts
SELECT processing_status, COUNT(*)
FROM documents
GROUP BY processing_status;
```

---

## Next Steps

After ingestion, continue with:

- [CLI User Guide](CLI_USER_GUIDE.md) - Search and research commands
- [Canonical Workflows](CANONICAL_WORKFLOWS.md) - Standard workflows
- [Developer Guide](DEVELOPER_GUIDE.md) - Extending the framework
