Dataset Overview
Note: Summary of communication with AI labs. Represents idea #4 during the batch.
To train the next generation of AI models, we need to train them on data that shows human reasoning. We've curated a sample of published scientific papers and their edit histories, showing how they progressed and improved over time, and have results that they improve model reasoning capabilities.
We compiled a comprehensive dataset of scientific paper edit histories, featuring 602 versions across 10 papers. This collection includes:
- 9 published journal papers
- 1 MIT MEng thesis
- All papers formatted in Markdown with inline LaTeX
Paper Distribution and Versions
Here's a breakdown of the papers and their version counts:
-
Online Optimal Landing Control of the MIT Mini Cheetah
- 147 versions
- Published in ICRA 2022
- DOI: 10.1109/ICRA46639.2022.9811796
-
Federated Learning for Resource Constrained Devices
- 228 versions
- Notable for extensive tabular content
-
Fetal Brain Studies Series
- Multiple papers investigating brain development
- Combined 142 versions across related papers
- Published in various journals including IJMS and FASEB Journal
-
Natural Killer B Cells Research
- 18 versions total (9 versions each across 2 papers)
- Notable for substantial changes per version
- Published in Frontiers in Immunology and Critical Reviews in Immunology
-
Hetero-Multimeric Chitinase-Containing Plasmodium Complex
- 46 versions
- Research on mosquito midgut invasion
- DOI: 10.3389/fcimb.2020.615343
-
NKB Cells: A Double-Edged Sword Against Inflammatory Diseases
- 9 versions (with substantial changes per version)
- Published in Frontiers in Immunology
- DOI: 10.3389/fimmu.2022.972435
-
Regulation of Brain-Placental Axis in Nocturnal Mammals
- 38 versions
- Published in Placenta
- DOI: 10.1016/j.placenta.2024.08.001
-
Role of Caveolin-1 in Metabolic Programming of Fetal Brain
- 39 versions
- Published in iScience
- DOI: 10.1016/j.isci.2023.107710
-
Role of Natural Killer and B Cell Interaction
- 9 versions (with substantial changes per version)
- Published in Critical Reviews in Immunology
- DOI: 10.1080/08830185.2023.2172406
-
Sexually Dimorphic Transcriptomic Changes in Domestic Pigs
- 34 versions
- Study of developing fetal brain
- Published in Cells
- DOI: 10.3390/cells10092439
Preliminary Testing Results
Model Performance
The team conducted initial finetuning experiments using:
- Gemma-2b
- Gemma-7b
- Training set: 500 originals and edits
- Synthetic prompts generated via GPT-4
Key Findings
Human Evaluation Results
-
Gemma-2b Performance
- 81% preference rate from human evaluators
- Tested across 81 edits
-
Gemma-7b Performance
- 74% preference rate from human evaluators
- Demonstrates promising scaling potential
Secondary Evaluations
- Conducted on Gemma-2b
- 60 edits tested
- 18% improvement in identifying correct edits
- Multiple choice format with synthetic incorrect options
Additional Dataset Collections
Mathematics Notes
- Complete edit history available
- Eventually formatted into research paper
- arXiv:2207.04831
Academic Test Sets
Current collection includes problems and solutions from:
- Advanced Algorithms (TS1)
- Discrete Math (TS2)
- Probability (TS3)
- Integral Calculus (TS4-6)
Future Implications
The preliminary results suggest significant potential for:
- Training larger models on complete paper histories
- Improving edit quality even with smaller models
- Expanding into new types of versioned datasets
- Creating high-quality solutions for diverse problem types
The team believes these improvements will scale effectively with:
- Larger datasets
- Increased model sizes
- More diverse document types
Looking Forward
This dataset represents just the beginning of what's possible with versioned content. The team envisions expanding into:
- More academic disciplines
- Different document types
- Varied problem-solving domains