MCP Hub
Back to servers

Auto-claude-code-research-in-sleep

ARIS ⚔️ (Auto-Research-In-Sleep) — Claude Code skills for autonomous ML research: cross-model review loops, idea discovery, and experiment automation via Codex MCP

GitHub
Stars
294
Forks
29
Updated
Mar 13, 2026
Validated
Mar 14, 2026

Auto-claude-code-research-in-sleep (ARIS ⚔️)

中文版 README | English

Hero

Score Progression

🌙 Let Claude Code do research while you sleep. Wake up to find your paper scored, weaknesses identified, experiments run, and narrative rewritten — autonomously.

Custom Claude Code skills for autonomous ML research workflows. These skills orchestrate cross-model collaboration — Claude Code drives the research while an external LLM (via Codex MCP) acts as a critical reviewer. 🔀 Also supports alternative model combinations (e.g., GLM + GPT, GLM + MiniMax) — no Claude API required.

💭 Why not self-play with a single model? Using Claude Code subagents or agent teams for both execution and review is technically possible, but tends to fall into local minima — the same model reviewing its own patterns creates blind spots. Claude Code's strength is fast, fluid execution; Codex (GPT-5.4 xhigh) is slower but more deliberate and rigorous in critique. These complementary styles — speed × rigor — produce better outcomes than either model talking to itself.

📈 Score Progression (Real Run)

A real overnight 4-round run on an ML research project, from borderline reject to submission-ready:

RoundScoreWhat Happened
Initial5.0/10Borderline reject
Round 16.5/10Added standard metrics, discovered metric decoupling
Round 26.8/10Key claim failed to reproduce, pivoted narrative
Round 37.0/10Large seed study killed main improvement claim
Round 47.5/10Diagnostic evidence solidified, submission ready

The loop autonomously ran 20+ GPU experiments, rewrote the paper's narrative framing, and killed claims that didn't hold up — all without human intervention.

💡 Idea Discovery (New)

Don't have a concrete idea yet? Just give a research direction — /idea-creator handles the rest:

  1. 📚 Survey the landscape (recent papers, open problems, recurring limitations)
  2. 🧠 Brainstorm 8-12 concrete ideas via GPT-5.4 xhigh
  3. 🔍 Filter by feasibility, compute cost, and quick novelty search
  4. 🛡️ Validate top ideas with deep novelty check + devil's advocate review
  5. 🧪 Pilot top 2-3 ideas in parallel on different GPUs (30 min - 2 hr each)
  6. 🏆 Rank by empirical signal — ideas with positive pilot results rise to the top

The output is a ranked IDEA_REPORT.md with hypotheses, pilot results, reviewer objections, and a suggested execution order. Ideas that fail are documented too, saving future dead-end exploration.


🔄 Workflows

These skills compose into a full research lifecycle. The two workflows can be used independently or chained together:

  • Exploring a new area (e.g., writing a survey)? Start with Workflow 1 → /idea-discovery
  • Already have an idea + initial plan? Jump straight to Workflow 2 → /auto-review-loop
  • Full pipeline? Workflow 1 → Workflow 2 → /research-pipeline — from literature survey all the way to submission

⚠️ Important: These tools accelerate research, but they don't replace your own critical thinking. Always review generated ideas with your domain expertise, question the assumptions, and make the final call yourself. The best research comes from human insight + AI execution, not full autopilot.

Full Pipeline 🚀

/research-lit → /idea-creator → /novelty-check → implement → /run-experiment → /auto-review-loop → submit
  (survey)      (brainstorm)    (verify novel)    (code)      (deploy & run)    (review & fix)     (done!)
  ├──── Workflow 1: Idea Discovery ────┤              ├──────── Workflow 2: Auto Loop ────────┤

📝 Blog post: 梦中科研全流程开源

Workflow 1: Literature & Idea Discovery 🔍

"What's the state of the art? Where are the gaps?"

┌─────────────────────────────────────────────────────────────┐
│                  Idea Discovery                              │
│                                                              │
│   /research-lit     /idea-creator     /novelty-check         │
│   (find papers)     (brainstorm)      (verify novelty)       │
│         │                │                  │                │
│         ▼                ▼                  ▼                │
│   ┌──────────┐     ┌──────────┐       ┌──────────┐         │
│   │ Scan     │────▶│ Generate │──────▶│ Check if │         │
│   │ local    │     │ 8-12     │       │ idea is  │         │
│   │ papers + │     │ ideas    │       │ novel    │         │
│   │ search   │     │ + rank   │       │          │         │
│   └──────────┘     └──────────┘       └──────────┘         │
│                          │                  │                │
│                          ▼                  ▼                │
│                    ┌──────────┐       ┌──────────┐         │
│                    │ Filter   │──────▶│ External │         │
│                    │ by cost, │       │ LLM      │         │
│                    │ novelty  │       │ evaluates│         │
│                    └──────────┘       └──────────┘         │
│                                                              │
│   Typical flow:                                              │
│   1. /research-lit "discrete diffusion models"  (local → online) │
│   2. /idea-creator "DLLMs post training"               │
│   3. Review ranked ideas, pick top 2-3                       │
│   4. /novelty-check "top idea" (deep verification)           │
│   5. /research-review "top idea" (critical feedback)         │
│   6. Implement → /run-experiment → /auto-review-loop         │
└─────────────────────────────────────────────────────────────┘

Skills involved: research-lit + idea-creator + novelty-check + research-review

💡 One-command shortcut: /idea-discovery "your research direction" runs this entire workflow automatically.

🔄 Human-in-the-loop: Each phase presents results and waits for your feedback. Not happy? Tell it what's missing — it refines the prompt and regenerates. Trust the defaults? It auto-proceeds with the top-ranked option. You decide how hands-on to be.

⚙️ Pilot experiment budgets (max hours, timeout, GPU budget) are configurable — see Customization.

📝 Blog post: Claude Code 两月 NeurIPS 指北

Workflow 2: Auto Research Loop 🔁 (sleep & wake up to results)

"Review my paper, fix what's wrong, repeat until it's good."

┌─────────────────────────────────────────────────────────────┐
│                    Auto Review Loop                          │
│                                                              │
│   /research-review          /auto-review-loop                │
│   (single deep review)      (autonomous loop)                │
│         │                         │                          │
│         ▼                         ▼                          │
│   ┌──────────┐   ┌──────────┐   ┌──────────┐               │
│   │ External  │──▶│ Implement│──▶│ Monitor  │──▶ repeat     │
│   │ LLM      │   │ fixes    │   │ results  │    until       │
│   │ reviews  │   │ & run    │   │          │    score ≥ 6   │
│   └──────────┘   │ experiments│  └──────────┘               │
│                   └──────────┘                               │
│                                                              │
│   When reviewer suggests a new method direction:             │
│   /novelty-check — verify idea isn't already published       │
│                                                              │
│   Supporting skills:                                         │
│   /run-experiment    — deploy to local/remote GPU            │
│   /analyze-results   — interpret experiment outputs          │
│   /monitor-experiment — check progress, collect results      │
└─────────────────────────────────────────────────────────────┘

Skills involved: auto-review-loop + research-review + novelty-check + run-experiment + analyze-results + monitor-experiment

💡 One-command shortcut: /auto-review-loop "your paper topic" runs this entire workflow automatically.

🛡️ Key safety features:

  • 🔒 MAX_ROUNDS = 4 — prevents infinite loops; stops early if score threshold is met
  • ⏱️ > 4 GPU-hour experiments skipped — won't launch massive jobs; flags them for manual follow-up
  • 🧠 Prefer reframing over new experiments — when both can address a weakness, chooses the cheaper path
  • 🪞 No hiding weaknesses — explicit rule: "Do NOT hide weaknesses to game a positive score"
  • 🔧 Fix before re-review — must actually implement fixes before resubmitting; no empty promises

⚙️ MAX_ROUNDS, score threshold, and GPU limits are configurable — see Customization.

📝 Blog post: 开源 | 睡觉 Claude 自动跑实验改文


🧰 All Skills

SkillDescriptionNeeds Codex MCP?
💡 idea-creatorGenerate and rank research ideas given a broad direction (brainstorm + filter + validate)Yes
🔬 research-reviewSingle-round deep review from external LLM (xhigh reasoning)Yes
🔁 auto-review-loopAutonomous multi-round review→fix→re-review loop (max 4 rounds)Yes
📚 research-litScan local paper library + search online, analyze related work, find gapsNo
📊 analyze-resultsAnalyze experiment results, compute statistics, generate insightsNo
👀 monitor-experimentMonitor running experiments, check progress, collect resultsNo
🔍 novelty-checkVerify research idea novelty against recent literature before implementingYes
🚀 run-experimentDeploy experiments to local (MPS/CUDA) or remote GPU serversNo
🎨 pixel-artGenerate pixel art SVG illustrations for READMEs, docs, or slidesNo
🔭 idea-discoveryWorkflow 1 pipeline: research-lit → idea-creator → novelty-check → research-reviewYes
🏗️ research-pipelineFull pipeline: Workflow 1 → implement → Workflow 2, from direction to submissionYes

⚙️ Setup

Prerequisites

  1. Claude Code installed
  2. (For review skills) Codex CLI installed and configured as MCP server:
    npm install -g @openai/codex
    claude mcp add codex -s user -- codex mcp-server
    

Install Skills

git clone https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep.git
cd Auto-claude-code-research-in-sleep

# Install all skills globally
cp -r skills/* ~/.claude/skills/

# Or install specific skills
cp -r skills/auto-review-loop ~/.claude/skills/
cp -r skills/research-lit ~/.claude/skills/

Usage

> /idea-creator DLLMs post training
> /research-lit discrete diffusion language models
> /research-review my paper on training dynamics in D-LLMs
> /auto-review-loop ML paper on factorized gap diagnosis
> /run-experiment train.py --lr 1e-4 --epochs 100
> /analyze-results figures/*.json
> /monitor-experiment server5
> /idea-discovery discrete diffusion language models
> /research-pipeline DLLMs post training

🌙 Auto-Allow for Overnight Runs (Optional)

To run the auto-review loop without clicking permission prompts, add to .claude/settings.local.json:

{
  "permissions": {
    "allow": [
      "mcp__codex__codex",
      "mcp__codex__codex-reply",
      "Write",
      "Edit",
      "Skill(auto-review-loop)"
    ]
  }
}

🖥️ GPU Server Setup (For Auto-Experiments)

When GPT-5.4 says "run an ablation study" or "add a baseline comparison", Claude Code automatically writes the experiment script and deploys it to your GPU server. For this to work, Claude Code needs to know your server environment.

Add your server info to your project's CLAUDE.md:

## Remote Server

- SSH: `ssh my-gpu-server` (key-based auth, no password)
- GPU: 4x A100
- Conda env: `research` (Python 3.10 + PyTorch)
- Activate: `eval "$(/opt/conda/bin/conda shell.bash hook)" && conda activate research`
- Code directory: `/home/user/experiments/`
- Use `screen` for background jobs: `screen -dmS exp0 bash -c '...'`

Claude Code reads this and knows how to SSH in, activate the environment, and launch experiments. GPT-5.4 (the reviewer) only decides what experiments to run — Claude Code figures out how based on your CLAUDE.md.

No server? The review and rewriting skills still work without GPU access. Only experiment-related fixes will be skipped (flagged for manual follow-up).

🏗️ How It Works

┌─────────────────────────────────────────────────┐
│                 Claude Code                      │
│                                                  │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐   │
│  │  Read     │    │  Write   │    │  SSH to  │   │
│  │  project  │───▶│  code &  │───▶│  GPU     │   │
│  │  context  │    │  scripts │    │  server  │   │
│  └──────────┘    └──────────┘    └──────────┘   │
│       │                               │          │
│       ▼                               ▼          │
│  ┌──────────────────────────────────────────┐    │
│  │         Codex MCP (External LLM)         │    │
│  │                                          │    │
│  │  Round 1: "Score 5/10. Weaknesses: ..."  │    │
│  │  Round 2: "Score 6.5. Better, but ..."   │    │
│  │  Round 3: "Score 7.0. Almost there..."   │    │
│  │  Round 4: "Score 7.5. Ready." ✅         │    │
│  └──────────────────────────────────────────┘    │
└─────────────────────────────────────────────────┘

The key insight: Claude Code handles execution (reading files, writing code, running experiments, collecting results) while the external LLM handles evaluation (scoring, identifying weaknesses, suggesting fixes). This separation creates a genuine feedback loop — neither model is grading its own work.

🎛️ Customization

Skills are plain Markdown files. Fork and customize:

Auto Review Loop (auto-review-loop)

ConstantDefaultDescription
MAX_ROUNDS4Maximum review→fix→re-review iterations
POSITIVE_THRESHOLD6/10Score at which the loop stops (submission-ready)
> 4 GPU-hour skip4hExperiments exceeding this are flagged for manual follow-up

Idea Discovery (idea-discovery / idea-creator)

ConstantDefaultDescription
PILOT_MAX_HOURS2hSkip any pilot estimated to take longer per GPU
PILOT_TIMEOUT_HOURS3hHard timeout — kill runaway pilots, collect partial results
MAX_PILOT_IDEAS3Maximum number of ideas to pilot in parallel
MAX_TOTAL_GPU_HOURS8hTotal GPU budget across all pilots
AUTO_PROCEEDtrueAuto-continue with top-ranked option if user doesn't respond. Set false to always wait for explicit approval

Override inline: /idea-discovery "topic" — pilot budget: 4h per idea, wait for my approval at each step

Literature Search (research-lit)

ConstantDefaultDescription
PAPER_LIBRARYpapers/, literature/Local directories to scan for PDFs before searching online
MAX_LOCAL_PAPERS20Max local PDFs to scan (first 3 pages each)

Override inline: /research-lit "topic" — paper library: ~/Zotero/storage/

General (all skills using Codex MCP)

ConstantDefaultDescription
REVIEWER_MODELgpt-5.4OpenAI model used via Codex MCP. Options: gpt-5.4, o3, gpt-4o, etc.
  • Prompt templates — tailor the review persona and evaluation criteria
  • allowed-tools — restrict or expand what each skill can do

🔀 Alternative Model Combinations

Don't have Claude / OpenAI API access? You can swap in other models — same cross-model architecture, different providers.

RoleDefaultAlt A: GLM + GPTAlt B: GLM + MiniMax
Executor (Claude Code)Claude Opus/SonnetGLM-5 (ZhiPu API)GLM-5 (ZhiPu API)
Reviewer (Codex MCP)GPT-5.4GPT-5.4 (OpenAI API)MiniMax-M2.5 (MiniMax API)
Need OpenAI API?YesYesNo

Step 1: Install Claude Code & Codex CLI

npm install -g @anthropic-ai/claude-code
npm install -g @openai/codex

Step 2: Configure ~/.claude/settings.json

Open with: nano ~/.claude/settings.json

Alt A: GLM (executor) + GPT (reviewer) — Only replace Claude, keep GPT-5.4 as reviewer
{
    "env": {
        "ANTHROPIC_AUTH_TOKEN": "your_zai_api_key",
        "ANTHROPIC_BASE_URL": "https://api.z.ai/api/anthropic",
        "API_TIMEOUT_MS": "3000000",
        "ANTHROPIC_DEFAULT_HAIKU_MODEL": "glm-4.5-air",
        "ANTHROPIC_DEFAULT_SONNET_MODEL": "glm-4.7",
        "ANTHROPIC_DEFAULT_OPUS_MODEL": "glm-5"
    },
    "mcpServers": {
        "codex": {
            "command": "/opt/homebrew/bin/codex",
            "args": [
                "mcp-server"
            ]
        }
    }
}

Codex CLI uses your existing OPENAI_API_KEY (from ~/.codex/config.toml or environment) — no extra config needed for the reviewer side.

Alt B: GLM (executor) + MiniMax (reviewer) — No Claude or OpenAI API needed
{
    "env": {
        "ANTHROPIC_AUTH_TOKEN": "your_zai_api_key",
        "ANTHROPIC_BASE_URL": "https://api.z.ai/api/anthropic",
        "API_TIMEOUT_MS": "3000000",
        "ANTHROPIC_DEFAULT_HAIKU_MODEL": "glm-4.5-air",
        "ANTHROPIC_DEFAULT_SONNET_MODEL": "glm-4.7",
        "ANTHROPIC_DEFAULT_OPUS_MODEL": "glm-5",
        "CODEX_API_KEY": "your_minimax_api_key",
        "CODEX_API_BASE": "https://api.minimax.chat/v1/",
        "CODEX_MODEL": "MiniMax-M2.5"
    },
    "mcpServers": {
        "codex": {
            "command": "/opt/homebrew/bin/codex",
            "args": [
                "mcp-server"
            ]
        }
    }
}

Save: Ctrl+OEnterCtrl+X

Step 3: Install Skills & Run

git clone https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep.git
cd Auto-claude-code-research-in-sleep
cp -r skills/* ~/.claude/skills/

# Launch Claude Code (now powered by GLM)
claude

Step 4: Let GLM Read the Project ⚠️ IMPORTANT

🔴 Do NOT skip this step. GLM's prompt handling differs from Claude's. You must let GLM read through the project once to ensure skills are correctly parsed.

After launching claude, run in the conversation:

Read through this project and verify all skills are working:
/idea-creator, /research-review, /auto-review-loop, /novelty-check,
/idea-discovery, /research-pipeline, /research-lit, /run-experiment,
/analyze-results, /monitor-experiment, /pixel-art

For each skill, confirm: (1) it loads without errors, (2) the frontmatter is parsed correctly.

This lets GLM (acting as Claude Code) familiarize itself with the skill files and catch any compatibility issues upfront — rather than discovering them mid-workflow when it's expensive to fail.

⚠️ Note: Alternative models may behave differently from Claude and GPT-5.4. You may need to adjust REVIEWER_MODEL in the skills and tune prompt templates for best results. The core cross-model architecture remains the same.

📋 Roadmap

  • Zotero MCP integration — read papers, tags, and annotations directly from Zotero library
  • More executor × reviewer combinations (Gemini, DeepSeek, etc.)

💬 Community

Join the WeChat group for discussion on Claude Code + AI-driven research workflows:

WeChat Group QR Code

⭐ Star History

GitHub stars

Star History Chart

License

MIT

Reviews

No reviews yet

Sign in to write a review