Datacurve

Active

Frontier coding data for training and evaluating LLMs

Winter 2024Founded 20244 peopleSan Francisco, CA, USA

datacurve.ai/ ↗LinkedIn ↗X ↗See on the Idea Map B2B momentum

Generate ideas →

About

We generate expert quality coding data at scale for fine-tuning LLMs

From their website

as of Jun 7, 2026datacurve.ai ↗

Services

Datacurve positions itself as a data engine for frontier AI, focusing on custom data, evaluation, and research infrastructure to advance long-horizon reasoning, software engineering, and data science. It emphasizes data collection, rigorous annotations, and benchmarking to improve model capabilities.

Datacurve provides research and data collection infrastructure for frontier AI, including datasets and benchmarks (e.g., DeepSWE long-horizon coding benchmark) used to evaluate and iterate AI models. The platform centers on curated data, annotations, and evaluation to guide model learning and capability development across long-horizon reasoning tasks and software/data science workflows.

Who it’s for: AI research labs, academic research groups, and enterprises building or evaluating frontier AI models that require long-horizon reasoning, coding benchmarks, and rigorous data pipelines for model training and evaluation.

Features

long-horizon coding benchmark
custom data for AI training
data collection infrastructure
evaluation and reinforcement workflows
datasets for reasoning and software engineering

research infrastructure and benchmarking focus; mentions of products, research, and careers imply active development and recruitment

Founders · 2

Serena GeFounder

Waterloo

Started building software in high school - built a climbing training app with Team Canada athletes. Studied at Waterloo CS for a year then dropped out. Worked with the Cohere CTO on LLM reasoning and coding capabilities through synthetic data. Went to YC W24, pivoted 3 times until Datacurve. Now scaling high quality coding data production pipelines at Datacurve to enable next generation coding models

LinkedIn ↗X ↗

Charley LeeFounder

Google

Waterloo

Hacking on things since middle school. Went to Waterloo CS, interned at Google, then dove into AI research on multi-modal RL and training browser-use agents. Went through YC W24, pivoted a few times, and landed on Datacurve – now providing the data infrastructure for frontier LLMs.

LinkedIn ↗X ↗

Launch

Launched on Y Combinator · Mar 2024

View launch post ↗

Providing code data by the best engineers, so you can build the most capable model

Datacurve provides expert-quality code data at scale generated by skilled software engineers. It serves AI dev-tool startups and foundation model labs with data for tasks like code completion, debugging, refactoring, and code generation to improve model capabilities.

From the original launch (Mar 2024) — may be outdated.