Aryan Jain

Dataset Overview

Note: Summary of communication with AI labs. Represents idea #4 during the batch.

To train the next generation of AI models, we need to train them on data that shows human reasoning. We've curated a sample of published scientific papers and their edit histories, showing how they progressed and improved over time, and have results that they improve model reasoning capabilities.

We compiled a comprehensive dataset of scientific paper edit histories, featuring 602 versions across 10 papers. This collection includes:

9 published journal papers
1 MIT MEng thesis
All papers formatted in Markdown with inline LaTeX

Paper Distribution and Versions

Here's a breakdown of the papers and their version counts:

Online Optimal Landing Control of the MIT Mini Cheetah
- 147 versions
- Published in ICRA 2022
- DOI: 10.1109/ICRA46639.2022.9811796
Federated Learning for Resource Constrained Devices
- 228 versions
- Notable for extensive tabular content
Fetal Brain Studies Series
- Multiple papers investigating brain development
- Combined 142 versions across related papers
- Published in various journals including IJMS and FASEB Journal
Natural Killer B Cells Research
- 18 versions total (9 versions each across 2 papers)
- Notable for substantial changes per version
- Published in Frontiers in Immunology and Critical Reviews in Immunology
Hetero-Multimeric Chitinase-Containing Plasmodium Complex
- 46 versions
- Research on mosquito midgut invasion
- DOI: 10.3389/fcimb.2020.615343
NKB Cells: A Double-Edged Sword Against Inflammatory Diseases
- 9 versions (with substantial changes per version)
- Published in Frontiers in Immunology
- DOI: 10.3389/fimmu.2022.972435
Regulation of Brain-Placental Axis in Nocturnal Mammals
- 38 versions
- Published in Placenta
- DOI: 10.1016/j.placenta.2024.08.001
Role of Caveolin-1 in Metabolic Programming of Fetal Brain
- 39 versions
- Published in iScience
- DOI: 10.1016/j.isci.2023.107710
Role of Natural Killer and B Cell Interaction
- 9 versions (with substantial changes per version)
- Published in Critical Reviews in Immunology
- DOI: 10.1080/08830185.2023.2172406
Sexually Dimorphic Transcriptomic Changes in Domestic Pigs
- 34 versions
- Study of developing fetal brain
- Published in Cells
- DOI: 10.3390/cells10092439

Preliminary Testing Results

Model Performance

The team conducted initial finetuning experiments using:

Gemma-2b
Gemma-7b
Training set: 500 originals and edits
Synthetic prompts generated via GPT-4

Key Findings

Human Evaluation Results

Gemma-2b Performance
- 81% preference rate from human evaluators
- Tested across 81 edits
Gemma-7b Performance
- 74% preference rate from human evaluators
- Demonstrates promising scaling potential

Secondary Evaluations

Conducted on Gemma-2b
60 edits tested
18% improvement in identifying correct edits
Multiple choice format with synthetic incorrect options

Additional Dataset Collections

Mathematics Notes

Complete edit history available
Eventually formatted into research paper
arXiv:2207.04831

Academic Test Sets

Current collection includes problems and solutions from:

Advanced Algorithms (TS1)
Discrete Math (TS2)
Probability (TS3)
Integral Calculus (TS4-6)

Future Implications

The preliminary results suggest significant potential for:

Training larger models on complete paper histories
Improving edit quality even with smaller models
Expanding into new types of versioned datasets
Creating high-quality solutions for diverse problem types

The team believes these improvements will scale effectively with:

Larger datasets
Increased model sizes
More diverse document types

Looking Forward

This dataset represents just the beginning of what's possible with versioned content. The team envisions expanding into:

More academic disciplines
Different document types
Varied problem-solving domains

Aryan Jainaryanj {at} mit {dot} edu

Leveraging Scientific Papers as an Advanced Data Format