# Voynich Harmonic-Topological Analysis — Reproducibility Repository

**Paper:** *Graph-Hodge Analysis of the Voynich Manuscript as Generator-Class Discrimination*
**Author:** Derek Earnhart (OriginStoryModel)
**Date:** April 2026

---

## Reproducibility Scope — Read This First

This repository has **two reproducibility tiers**. It is critical to understand which tier you are using:

### Tier A — Processed CSVs only (fully self-contained)

> **The inferential BEH table, null-control verification, and all reported statistics can be reproduced from the processed CSVs alone. The raw IVTFF file is NOT required for this tier.**

This is the primary reproducibility path. All seven `data/processed/*.csv` files and `data/processed/summary.json` are included in the repository. Running the two commands below is sufficient to regenerate every number in the paper:

```bash
python src/entropy_tests.py --processed_dir data/processed --results_dir results \
    --permutations 10000 --seed 42

python src/null_controls.py --processed_dir data/processed --results_dir results
```

### Tier B — Full raw IVTFF re-extraction (requires raw file download)

> **If you want to re-run the IVTFF parsing and feature extraction from scratch, you must download `LSI_ivtff_0d.txt` separately.**

The raw file `data/raw/LSI_ivtff_0d.txt` is NOT included in this repository (it is a third-party file maintained by the open Voynich transcription community). See [Downloading the Raw IVTFF File](#downloading-the-raw-ivtff-file) below.

Once the raw file is in `data/raw/`, the full pipeline is:

```bash
python src/parse_ivtff.py  --input data/raw/LSI_ivtff_0d.txt --transcriber H
python src/clean_eva.py    --input data/processed/tokens_clean.csv \
                           --output data/processed/tokens_normalized.csv
python src/transition_graph.py --tokens_csv data/processed/tokens_clean.csv \
                                --output_dir results/
python src/hodge_core.py   --tokens_csv data/processed/tokens_clean.csv \
                           --output_dir results/
python src/entropy_tests.py --processed_dir data/processed --results_dir results \
                            --permutations 10000 --seed 42
python src/null_controls.py --processed_dir data/processed --results_dir results
```

---

## Downloading the Raw IVTFF File

The Landini-Stolfi IVTFF EVA archive is maintained by the Voynich transcription community.

**Primary source:** https://www.voynich.nu/data/beta/LSI_ivtff_0d.txt

**Expected SHA256 checksum:**
```
3f3f2af18cde10efe75c582f49b07b651c3397022fcbfa5854fecc424c121afa  LSI_ivtff_0d.txt
```

Verify before use:
```bash
sha256sum data/raw/LSI_ivtff_0d.txt
```

The checksum must match exactly. If it does not, the file version may differ from the one used to generate the processed CSVs, and results may differ slightly.

---

## Repository Layout

```
voynich_repo/
├── README.md                          ← this file
├── LICENSE                            ← MIT license for repository code
├── requirements.txt                   ← pip dependencies
│
├── data/
│   ├── raw/
│   │   ├── SHA256SUMS.txt             ← checksums for all files
│   │   └── LSI_ivtff_0d.txt           ← NOT included; download separately
│   │
│   └── processed/                     ← INCLUDED; sufficient for Tier A
│       ├── aggregate_features.csv     ← full-manuscript and stratum-level features
│       ├── page_features.csv          ← per-page entropy and graph features (225 pages)
│       ├── paragraph_unit_features.csv← per-paragraph-unit features
│       ├── matched_controls.csv       ← 20-trial frequency-shuffle + uniform-vocab controls
│       ├── matched_controls_summary.csv
│       ├── top_tokens.csv             ← most frequent EVA word tokens
│       └── summary.json              ← machine-readable extraction summary
│
├── src/
│   ├── parse_ivtff.py                 ← IVTFF parser (Tier B only)
│   ├── clean_eva.py                   ← EVA token cleaning utilities
│   ├── transition_graph.py            ← token-transition graph construction
│   ├── hodge_core.py                  ← Hodge 1-Laplacian decomposition
│   ├── entropy_tests.py               ← BEH inferential statistics (primary script)
│   └── null_controls.py               ← null-control verification
│
├── results/
│   ├── beh_inferential.csv            ← completed inferential BEH table
│   └── beh_tables_publication.md     ← Markdown + LaTeX ready for manuscript
│
└── preregistration/
    ├── analysis_plan.md               ← pre-specified hypotheses and thresholds
    └── thresholds.yaml                ← machine-readable pass/fail criteria
```

---

## Quick Start (Tier A — 2 commands)

```bash
# 1. Install dependencies
pip install -r requirements.txt

# 2. Regenerate inferential table
python src/entropy_tests.py \
    --processed_dir data/processed \
    --results_dir results \
    --permutations 10000 \
    --seed 42

# 3. Verify null controls
python src/null_controls.py \
    --processed_dir data/processed \
    --results_dir results
```

Expected output of Step 2 (also regenerates results/beh_tables_publication.md):
```
Computing: Currier A vs B / page      n=114/82
Computing: Currier A vs B / paragraph n=141/111
Computing: Hand 1 vs 2 / page         n=86/45
Computing: Hand 1 vs 2 / paragraph    n=92/64

Two-sided BEH supported (all q < 0.05): True
Original one-sided direction (A/1 > B/2) failed: True
```

---

## Key Results (pre-computed in `results/beh_inferential.csv`)

| Comparison | Level | n₁/n₂ | ΔH | 95% CI | p (2s) | q (BH) | g | δ |
|---|---|---|---|---|---|---|---|---|
| Currier A vs B | page | 114/82 | −0.943 | [−1.123, −0.762] | < 0.001 | < 0.001 | −1.544 | −0.707 |
| Currier A vs B | paragraph | 141/111 | −0.695 | [−0.901, −0.484] | < 0.001 | < 0.001 | −0.870 | −0.457 |
| Hand 1 vs 2 | page | 86/45 | −0.750 | [−0.908, −0.595] | < 0.001 | < 0.001 | −1.810 | −0.780 |
| Hand 1 vs 2 | paragraph | 92/64 | −0.351 | [−0.557, −0.133] | < 0.001 | < 0.001 | −0.578 | −0.381 |

**Two-sided BEH: SUPPORTED** (all BH-adjusted q < 0.001)
**Original one-sided BEH (A/1 > B/2): REFUTED** (all observed ΔH < 0)

---

## Checksum Verification

To verify all included files:
```bash
# From repo root — verifies data/processed and results files only
# (data/raw/LSI_ivtff_0d.txt line will fail if raw file not present; that is expected)
sha256sum -c data/raw/SHA256SUMS.txt 2>/dev/null | grep -v "No such file"
```

All included files should report `OK`. The raw IVTFF line will report
`No such file or directory` if not downloaded — this is expected for Tier A users.

---

## Randomness and Reproducibility

All repository stochastic operations use deterministic NumPy generators seeded with 42.
The primary inferential/statistical scripts use `numpy.random.default_rng(42)`, and
there are no calls to the legacy `numpy.random.seed()` interface. Output files include:
- Random seed: 42
- Permutations: 10,000
- Bootstrap resamples: 10,000

Results should be bit-for-bit identical across runs on any platform running
NumPy ≥ 1.24 with the same seed.

---

## Manuscript Update Guide

After regenerating `results/beh_inferential.csv`:

1. **Must:** Replace Table 7 (Section 9.4) with the LaTeX table in `results/beh_tables_publication.md`.
2. **Must:** Add footnote disclosing paragraph-level discrepancy:
   *"Paragraph-level Currier B mean = 6.407 (computed) vs. 6.453 (earlier draft);
   Hand 2 paragraph mean = 6.077 (computed) vs. 6.115 (earlier draft). Counts differ
   by 1–2 units due to paragraph-boundary definitions. Directional finding unchanged."*
3. **Must:** Update Reproducibility Statement with public repo URL and SHA256SUMS reference.
4. **Recommended:** Add to Abstract: two-sided BEH confirmed with q < 0.001 and
   effect sizes g = −0.578 to −1.810.
5. **Recommended:** Add to Section 9.2: biological section has the highest successor
   entropy (3.734) of any section, supporting structural heterogeneity across domains.

---

## Dependencies

```
numpy>=1.24
pandas>=2.0
scipy>=1.11
```

`scipy` is required only for `hodge_core.py` (sparse linear algebra). The primary
inferential script `entropy_tests.py` uses only `numpy` and `pandas`.

---

## License

Repository code is released under the MIT License; see `LICENSE`. The IVTFF transcription data is maintained by the
Voynich transcription community and is not covered by this license.

---

## Citation

Earnhart, D. (2026). *Graph-Hodge Analysis of the Voynich Manuscript as
Generator-Class Discrimination: Lossless Glyph Encodings, Substitution-Invariant
Transition Flows, and Falsifiable Structural Tests.* [Preprint].
Repository: [INSERT OSF/ZENODO/GITHUB URL BEFORE SUBMISSION]