Aryan Jainaryanj {at} mit {dot} edu

Leveraging Scientific Papers as an Advanced Data Format

8/13/2024

Dataset Overview

Note: Summary of communication with AI labs. Represents idea #4 during the batch.

To train the next generation of AI models, we need to train them on data that shows human reasoning. We've curated a sample of published scientific papers and their edit histories, showing how they progressed and improved over time, and have results that they improve model reasoning capabilities.

We compiled a comprehensive dataset of scientific paper edit histories, featuring 602 versions across 10 papers. This collection includes:

  • 9 published journal papers
  • 1 MIT MEng thesis
  • All papers formatted in Markdown with inline LaTeX

Paper Distribution and Versions

Here's a breakdown of the papers and their version counts:

  1. Online Optimal Landing Control of the MIT Mini Cheetah

  2. Federated Learning for Resource Constrained Devices

    • 228 versions
    • Notable for extensive tabular content
  3. Fetal Brain Studies Series

    • Multiple papers investigating brain development
    • Combined 142 versions across related papers
    • Published in various journals including IJMS and FASEB Journal
  4. Natural Killer B Cells Research

    • 18 versions total (9 versions each across 2 papers)
    • Notable for substantial changes per version
    • Published in Frontiers in Immunology and Critical Reviews in Immunology
  5. Hetero-Multimeric Chitinase-Containing Plasmodium Complex

  6. NKB Cells: A Double-Edged Sword Against Inflammatory Diseases

  7. Regulation of Brain-Placental Axis in Nocturnal Mammals

  8. Role of Caveolin-1 in Metabolic Programming of Fetal Brain

  9. Role of Natural Killer and B Cell Interaction

  10. Sexually Dimorphic Transcriptomic Changes in Domestic Pigs

Preliminary Testing Results

Model Performance

The team conducted initial finetuning experiments using:

  • Gemma-2b
  • Gemma-7b
  • Training set: 500 originals and edits
  • Synthetic prompts generated via GPT-4

Key Findings

Human Evaluation Results

  • Gemma-2b Performance

    • 81% preference rate from human evaluators
    • Tested across 81 edits
  • Gemma-7b Performance

    • 74% preference rate from human evaluators
    • Demonstrates promising scaling potential

Secondary Evaluations

  • Conducted on Gemma-2b
  • 60 edits tested
  • 18% improvement in identifying correct edits
  • Multiple choice format with synthetic incorrect options

Additional Dataset Collections

Mathematics Notes

  • Complete edit history available
  • Eventually formatted into research paper
  • arXiv:2207.04831

Academic Test Sets

Current collection includes problems and solutions from:

  • Advanced Algorithms (TS1)
  • Discrete Math (TS2)
  • Probability (TS3)
  • Integral Calculus (TS4-6)

Future Implications

The preliminary results suggest significant potential for:

  1. Training larger models on complete paper histories
  2. Improving edit quality even with smaller models
  3. Expanding into new types of versioned datasets
  4. Creating high-quality solutions for diverse problem types

The team believes these improvements will scale effectively with:

  • Larger datasets
  • Increased model sizes
  • More diverse document types

Looking Forward

This dataset represents just the beginning of what's possible with versioned content. The team envisions expanding into:

  • More academic disciplines
  • Different document types
  • Varied problem-solving domains