# Database Schema Reference

Complete reference for the PostgreSQL database schema used in the Research Development Framework v3.2.

## Migrations

| Migration | Description |
|-----------|-------------|
| `schema.sql` | Core tables (documents, chunks, concepts, topics, authors, book_projects) |
| `001_writer_workbench.sql` | Saved snippets, pinned items, writing preferences |
| `002_system_settings.sql` | System configuration, user preferences |
| `003_knowledge_rules.sql` | Knowledge rules engine, living glossaries |
| `004_v3_features.sql` | Citation freezing, pinning, gaps, sessions |

**Note:** The base `schema.sql` contains all core tables. Migration files add incremental features.

Apply migrations:
```bash
cd /var/www/html/research/Research_development

# Load credentials from .env.db
source .env.db
PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" -f database/migrations/004_v3_features.sql
```

## Connection Information

Credentials are stored in `.env.db` at the project root:

```bash
# .env.db
DB_HOST=localhost
DB_PORT=5432
DB_NAME=research_dev_db
DB_USER=research_dev_user
DB_PASS=your_password_here
```

### Connection String

```
postgresql://$DB_USER:$DB_PASS@$DB_HOST:$DB_PORT/$DB_NAME
```

### Connect via CLI

```bash
# Load credentials and connect
source .env.db
PGPASSWORD="$DB_PASS" psql -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME"
```

### Connect via Python

```python
from pipeline.db_utils import get_db_connection

with get_db_connection() as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT COUNT(*) FROM documents")
        count = cur.fetchone()[0]
        print(f"Total documents: {count}")
```

---

## Extensions

The database uses these PostgreSQL extensions:

| Extension | Purpose |
|-----------|---------|
| `pgvector` | Vector similarity search for semantic search |
| `pg_trgm` | Fuzzy text matching and similarity |

---

## Core Tables

### authors

Stores author information.

| Column | Type | Description |
|--------|------|-------------|
| `author_id` | SERIAL PK | Auto-increment ID |
| `name` | VARCHAR(255) | Full name |
| `name_normalized` | VARCHAR(255) | Lowercase for matching |
| `birth_year` | INTEGER | Birth year |
| `death_year` | INTEGER | Death year |
| `biography` | TEXT | Author biography |
| `nationality` | VARCHAR(100) | Country/nationality |
| `created_at` | TIMESTAMP | Record creation time |
| `updated_at` | TIMESTAMP | Last update time |

### documents

Main document metadata table.

| Column | Type | Description |
|--------|------|-------------|
| `document_id` | VARCHAR(100) PK | Unique identifier (e.g., DOC_001_EN) |
| `title` | VARCHAR(500) | Document title |
| `subtitle` | VARCHAR(500) | Optional subtitle |
| `author_id` | INTEGER FK | Reference to authors |
| `publication_year` | INTEGER | Year published |
| `language_code` | VARCHAR(10) | ISO 639-1 code (en, de, etc.) |
| `edition_type` | VARCHAR(50) | original, translation, revised, etc. |
| `publisher` | VARCHAR(255) | Publisher name |
| `source_file` | VARCHAR(500) | Original filename |
| `file_path` | VARCHAR(1000) | Path to markdown file |
| `content_hash` | VARCHAR(64) | SHA-256 for duplicate detection |
| `word_count` | INTEGER | Total word count |
| `page_count` | INTEGER | Page count (if known) |
| `chapter_count` | INTEGER | Number of chapters |
| `has_index` | BOOLEAN | Has index section |
| `has_bibliography` | BOOLEAN | Has bibliography |
| `processing_status` | VARCHAR(50) | pending, processing, completed, failed |
| `pipeline_version` | VARCHAR(20) | Version of processing pipeline |
| `quality_status` | ENUM | excellent, good, fair, poor, unusable, archived |
| `quality_score` | SMALLINT | Quality score (0-100) |
| `quality_notes` | TEXT | Quality assessment notes |
| `reocr_attempts` | INTEGER | Number of re-OCR attempts |
| `reocr_last_attempt` | TIMESTAMP | Last re-OCR attempt time |
| `reocr_last_method` | VARCHAR(50) | Last OCR method used |
| `ai_generated` | BOOLEAN | Was content AI-generated |
| `ai_model` | VARCHAR(100) | AI model used (if applicable) |
| `ai_prompt` | TEXT | Prompt used (if applicable) |
| `notes` | TEXT | Processing notes |
| `created_at` | TIMESTAMP | Record creation time |
| `updated_at` | TIMESTAMP | Last update time |

> **Note on `reocr_` columns:** The prefix is `reocr_` (not `re_ocr_`) to match the `reocr_document.py` script naming convention.

### files

Tracks all file versions.

| Column | Type | Description |
|--------|------|-------------|
| `file_id` | SERIAL PK | Auto-increment ID |
| `document_id` | VARCHAR(100) FK | Reference to documents |
| `filename` | VARCHAR(500) | File name |
| `file_path` | VARCHAR(1000) | Full file path |
| `file_type` | VARCHAR(20) | pdf, docx, md, txt, epub |
| `file_size_bytes` | BIGINT | File size |
| `content_hash` | VARCHAR(64) | SHA-256 hash |
| `is_primary` | BOOLEAN | Primary file for document |
| `version` | INTEGER | File version number |
| `upload_date` | TIMESTAMP | When uploaded |
| `last_modified` | TIMESTAMP | Last modification |
| `needs_reprocessing` | BOOLEAN | Flag for reprocessing |

---

## Chunking and Embedding Tables

### chunks

Searchable text units with embeddings.

| Column | Type | Description |
|--------|------|-------------|
| `chunk_id` | VARCHAR(100) PK | Unique ID (e.g., DOC_001_C0001) |
| `document_id` | VARCHAR(100) FK | Reference to documents |
| `chapter_number` | INTEGER | Chapter number |
| `chunk_sequence` | INTEGER | Order within document |
| `chunk_text` | TEXT | The chunk content |
| `chunk_tokens` | INTEGER | Token count |
| `embedding` | vector(1536) | OpenAI embedding vector |
| `chunk_text_tsv` | tsvector | Full-text search vector |
| `pipeline_version` | VARCHAR(20) | Pipeline version |
| `created_at` | TIMESTAMP | Creation time |

**Indexes:**
- `idx_chunks_embedding` - IVFFlat index for vector search
- `idx_chunks_fts` - GIN index for full-text search

---

## Classification Tables

### categories

Document categories.

| Column | Type | Description |
|--------|------|-------------|
| `category_id` | SERIAL PK | Auto-increment ID |
| `name` | VARCHAR(100) UNIQUE | Category name |
| `description` | TEXT | Description |
| `parent_category_id` | INTEGER FK | Parent category (hierarchical) |
| `sort_order` | INTEGER | Display order |
| `created_at` | TIMESTAMP | Creation time |

### topics

Research topics.

| Column | Type | Description |
|--------|------|-------------|
| `topic_id` | SERIAL PK | Auto-increment ID |
| `name` | VARCHAR(100) UNIQUE | Topic name |
| `description` | TEXT | Description |
| `keywords` | TEXT[] | Related keywords array |
| `created_at` | TIMESTAMP | Creation time |

### concepts

Fine-grained concept taxonomy.

| Column | Type | Description |
|--------|------|-------------|
| `concept_id` | SERIAL PK | Auto-increment ID |
| `name` | VARCHAR(100) UNIQUE | Concept name |
| `category` | VARCHAR(50) | Concept category |
| `description` | TEXT | Description |
| `aliases` | TEXT[] | Alternative names |
| `created_at` | TIMESTAMP | Creation time |

### concept_relationships

Triple-based knowledge graph relationships (Subject-Predicate-Object).

| Column | Type | Description |
|--------|------|-------------|
| `relationship_id` | SERIAL PK | Auto-increment ID |
| `source_concept_id` | INTEGER FK | Source concept (subject) |
| `target_concept_id` | INTEGER FK | Target concept (object) |
| `relationship_type` | VARCHAR(50) | Predicate (supports, influences, derived_from, contradicts, etc.) |
| `weight` | FLOAT | Relationship strength (default 1.0) |
| `document_id` | VARCHAR(50) FK | Source document |
| `chunk_id` | VARCHAR(50) FK | Source chunk |
| `extraction_method` | VARCHAR(50) | How extracted (openai, gliner, hybrid, manual) |
| `confidence` | FLOAT | Extraction confidence (0.0-1.0) |
| `bidirectional` | BOOLEAN | True if relationship applies both ways |
| `verified` | BOOLEAN | Manually verified |
| `created_at` | TIMESTAMP | Creation time |

**Indexes:**
- `idx_concept_rel_source` - Source concept lookup
- `idx_concept_rel_target` - Target concept lookup
- `idx_concept_rel_type` - Relationship type filtering
- `idx_concept_rel_semantic` - Composite index for graph traversal

**Recommended Additional Index for Semantic Graph Queries:**
```sql
-- Optimizes semantic_graph.py traversal queries that filter by relationship_type
CREATE INDEX IF NOT EXISTS idx_concept_rel_traversal
ON concept_relationships(source_concept_id, relationship_type);
```

**Relationship Types:**
- `related` - General association
- `supports` - Evidence for a claim
- `influences` - Causal influence
- `derived_from` - Source/origin relationship
- `contradicts` - Opposing viewpoints
- `is_part_of` - Hierarchical containment
- `is_type_of` - Classification hierarchy

**Usage with Semantic Graph:**
```python
# Query relationships by type
from pipeline.semantic_graph import SemanticGraphTraverser

traverser = SemanticGraphTraverser()
results = traverser.traverse(
    start_concept="Consciousness",
    relationship_types=["influences", "supports"],
    max_depth=2
)
```

---

## Junction Tables

### document_categories

Links documents to categories.

| Column | Type |
|--------|------|
| `document_id` | VARCHAR(100) FK, PK |
| `category_id` | INTEGER FK, PK |

### document_topics

Links documents to topics.

| Column | Type |
|--------|------|
| `document_id` | VARCHAR(100) FK, PK |
| `topic_id` | INTEGER FK, PK |
| `relevance_score` | FLOAT |

### document_concepts

Links documents to concepts with mention counts.

| Column | Type |
|--------|------|
| `document_id` | VARCHAR(100) FK, PK |
| `concept_id` | INTEGER FK, PK |
| `mention_count` | INTEGER |

### chunk_concepts

Links chunks to concepts for precise location.

| Column | Type |
|--------|------|
| `chunk_id` | VARCHAR(100) FK, PK |
| `concept_id` | INTEGER FK, PK |
| `mention_count` | INTEGER |

---

## Processing Tables

### processing_queue

Task queue for background processing.

| Column | Type | Description |
|--------|------|-------------|
| `queue_id` | SERIAL PK | Auto-increment ID |
| `document_id` | VARCHAR(100) FK | Document reference |
| `file_id` | INTEGER FK | File reference |
| `process_type` | VARCHAR(50) | ingest, chunk, embed, etc. |
| `priority` | INTEGER | 1 (highest) to 10 (lowest) |
| `status` | VARCHAR(20) | pending, processing, completed, failed |
| `error_message` | TEXT | Error details if failed |
| `attempts` | INTEGER | Number of attempts |
| `max_attempts` | INTEGER | Maximum retry attempts |
| `created_at` | TIMESTAMP | Queue time |
| `started_at` | TIMESTAMP | Processing start |
| `completed_at` | TIMESTAMP | Processing complete |

**Used by:** `pipeline/task_worker.py`

### text_quality

OCR quality assessment.

| Column | Type | Description |
|--------|------|-------------|
| `quality_id` | SERIAL PK | Auto-increment ID |
| `document_id` | VARCHAR(100) FK | Document reference |
| `file_id` | INTEGER FK | File reference |
| `quality_grade` | VARCHAR(20) | excellent, good, fair, poor, very_poor |
| `gibberish_ratio` | FLOAT | Ratio of gibberish text |
| `avg_word_length` | FLOAT | Average word length |
| `punctuation_ratio` | FLOAT | Punctuation density |
| `sentence_quality_score` | FLOAT | Sentence structure quality |
| `do_not_process` | BOOLEAN | Flag for unfixable documents |
| `process_notes` | TEXT | Quality notes |
| `assessed_at` | TIMESTAMP | Assessment time |

### change_log

Audit trail for changes.

| Column | Type | Description |
|--------|------|-------------|
| `log_id` | SERIAL PK | Auto-increment ID |
| `document_id` | VARCHAR(100) | Document reference |
| `file_id` | INTEGER | File reference |
| `action` | VARCHAR(50) | create, update, delete, reprocess |
| `old_value` | JSONB | Previous value |
| `new_value` | JSONB | New value |
| `changed_by` | VARCHAR(100) | User/system |
| `changed_at` | TIMESTAMP | Change time |

---

## Analytics Tables

### search_history

Search query analytics.

| Column | Type | Description |
|--------|------|-------------|
| `search_id` | SERIAL PK | Auto-increment ID |
| `query_text` | TEXT | Search query |
| `search_type` | VARCHAR(20) | semantic, keyword, hybrid |
| `filters_applied` | JSONB | Applied filters |
| `results_count` | INTEGER | Number of results |
| `response_time_ms` | INTEGER | Query time in ms |
| `user_session` | VARCHAR(100) | Session identifier |
| `created_at` | TIMESTAMP | Search time |

---

## Book Compilation Tables

> **Architecture Note:** The book workflow uses two complementary storage systems:
> - **Active Projects:** JSON files in `projects/books/BOOK_xxx/` (workflow state, outline, drafts)
> - **Published/Archived:** Database records in `book_projects` and `book_chapters` tables
>
> **Archival Process:** Currently, archiving a completed project to the database is a manual process. After completing Phase 9 (Compilation), you can optionally insert project metadata into the database tables below for long-term cataloging. The file-based project remains the authoritative source during active work.
>
> **Future Enhancement:** A `rdf archive` command could automate moving completed projects to database storage.

### book_projects

Track book compilation projects (for published/archived books).

| Column | Type | Description |
|--------|------|-------------|
| `project_id` | SERIAL PK | Auto-increment ID |
| `project_name` | VARCHAR(255) | Project name |
| `description` | TEXT | Project description |
| `status` | VARCHAR(50) | draft, in_progress, review, published |
| `output_formats` | TEXT[] | Target formats |
| `created_at` | TIMESTAMP | Creation time |
| `updated_at` | TIMESTAMP | Last update |

### book_chapters

Individual chapters within projects.

| Column | Type | Description |
|--------|------|-------------|
| `chapter_id` | SERIAL PK | Auto-increment ID |
| `project_id` | INTEGER FK | Book project reference |
| `chapter_number` | INTEGER | Chapter number |
| `title` | VARCHAR(255) | Chapter title |
| `source_document_id` | VARCHAR(100) FK | Source document reference |
| `source_file_path` | VARCHAR(1000) | Source file path |
| `content_markdown` | TEXT | Chapter markdown content |
| `word_count` | INTEGER | Word count |
| `status` | VARCHAR(50) | draft, written, edited, final |
| `sort_order` | INTEGER | Display order |
| `created_at` | TIMESTAMP | Creation time |
| `updated_at` | TIMESTAMP | Last update |

---

## V3 Tables

### research_gaps

Tracks research gaps identified during analysis.

| Column | Type | Description |
|--------|------|-------------|
| `gap_id` | SERIAL | Primary key |
| `description` | TEXT | Gap description |
| `suggested_query` | TEXT | Search query to fill gap |
| `source_session_id` | VARCHAR(50) | Originating session |
| `source_project_id` | VARCHAR(50) | Originating project |
| `source_subject` | TEXT | Subject being researched |
| `status` | VARCHAR(20) | pending, pinned, ignored, filled |
| `priority` | INTEGER | Priority level |
| `pinned_at` | TIMESTAMP | When pinned |
| `filled_at` | TIMESTAMP | When filled |
| `search_results_count` | INTEGER | Results found |
| `created_at` | TIMESTAMP | Creation time |
| `updated_at` | TIMESTAMP | Last update |

### research_sessions

Tracks interactive research sessions with state machine.

| Column | Type | Description |
|--------|------|-------------|
| `session_id` | VARCHAR(50) | Primary key |
| `original_question` | TEXT | User's research question |
| `sub_queries` | JSONB | Planned sub-queries |
| `state` | VARCHAR(30) | planning, searching, synthesizing, complete |
| `current_iteration` | INTEGER | Current iteration count |
| `max_iterations` | INTEGER | Max iterations allowed |
| `results_collected` | JSONB | Chunks found |
| `gaps_identified` | JSONB | Gaps found |
| `synthesis` | TEXT | Final synthesis |
| `total_input_tokens` | INTEGER | Token usage |
| `total_output_tokens` | INTEGER | Token usage |
| `estimated_cost_usd` | NUMERIC(10,6) | Estimated cost |
| `budget_limit_usd` | NUMERIC(10,2) | Budget limit |
| `requires_approval` | BOOLEAN | Interactive mode flag |
| `last_checkpoint` | TEXT | Pending approval |
| `approved_steps` | JSONB | Approved steps |
| `created_at` | TIMESTAMP | Creation time |
| `updated_at` | TIMESTAMP | Last update |
| `completed_at` | TIMESTAMP | Completion time |
| `project_id` | VARCHAR(50) | Associated project |

### V3 Columns on Documents Table

| Column | Type | Description |
|--------|------|-------------|
| `bibtex_key` | VARCHAR(100) | Citation key |
| `bibtex_key_frozen` | BOOLEAN | Key is frozen |
| `bibtex_key_frozen_at` | TIMESTAMP | When frozen |
| `is_pinned` | BOOLEAN | Document is pinned |
| `pin_priority` | INTEGER | Pin priority (higher = more important) |
| `pinned_at` | TIMESTAMP | When pinned |
| `pin_notes` | TEXT | Notes about pinning |

---

## Views

### v_documents_full

Document overview with author information.

```sql
SELECT * FROM v_documents_full;
```

### v_queue_status

Processing queue status summary.

```sql
SELECT * FROM v_queue_status;
```

### v_document_stats

Database statistics.

```sql
SELECT * FROM v_document_stats;
```

---

## Helper Functions

### generate_document_id

Generate unique document ID.

```sql
SELECT generate_document_id('DOC', 1, 'EN');
-- Returns: DOC_001_EN
```

### get_chunk_context

Get surrounding chunks for context.

```sql
SELECT * FROM get_chunk_context('DOC_001_C0045', 2);
```

Returns previous 2, current, and next 2 chunks.

---

## Common Queries

### List all documents

```sql
SELECT document_id, title, word_count, processing_status
FROM documents
ORDER BY created_at DESC;
```

### Search by keyword

```sql
SELECT c.chunk_id, c.chunk_text, d.title
FROM chunks c
JOIN documents d ON c.document_id = d.document_id
WHERE c.chunk_text_tsv @@ plainto_tsquery('english', 'search term')
ORDER BY ts_rank(c.chunk_text_tsv, plainto_tsquery('english', 'search term')) DESC
LIMIT 10;
```

### Semantic search

```sql
SELECT c.chunk_id, c.chunk_text, d.title,
       1 - (c.embedding <=> '[query_embedding]'::vector) AS similarity
FROM chunks c
JOIN documents d ON c.document_id = d.document_id
WHERE c.embedding IS NOT NULL
ORDER BY c.embedding <=> '[query_embedding]'::vector
LIMIT 10;
```

### Get document concepts

```sql
SELECT c.name, c.category, dc.mention_count
FROM document_concepts dc
JOIN concepts c ON dc.concept_id = c.concept_id
WHERE dc.document_id = 'DOC_001'
ORDER BY dc.mention_count DESC;
```

### Processing status summary

```sql
SELECT processing_status, COUNT(*) as count
FROM documents
GROUP BY processing_status
ORDER BY count DESC;
```

### Quality status summary

```sql
SELECT quality_status, COUNT(*) as count
FROM documents
WHERE quality_status IS NOT NULL
GROUP BY quality_status
ORDER BY count DESC;
```

---

## Maintenance

### Vacuum and analyze

```sql
VACUUM ANALYZE;
```

### Rebuild indexes

```sql
REINDEX DATABASE research_dev_db;
```

### Check table sizes

```sql
SELECT relname AS table_name,
       pg_size_pretty(pg_total_relation_size(relid)) AS total_size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
```
