Explore Neptune Scale: tracker for foundation models -> Tour a live project 📈

Iframe cover image
The experiment tracker for foundation model training

If it’s not responsive,
it’s not working

It’s hard to iterate fast enough on massive model training when your experiment tracker lags behind you. Neptune makes it easy to monitor months-long jobs and visualize massive amounts of data in almost real-time — with 100% accuracy. Without crashing the UI. So you can find failing runs in less time — and eliminate wasted spend.

Other experiment trackers can’t handle
the scale of your training:

Poor responsiveness:
With other experiment trackers, you wait hours for run data to load, charts to render, or search to respond. Staying focused is a challenge. You think fast, your tools should, too.
Poor accuracy:
Other tools take shortcuts like downsampling your data. And when you can’t see all your metrics, missing errors in your models is easy. Can you really be confident in your work if you can’t catch every single spike?
Poor architecture:
Other experiment trackers can’t ingest all your data. Your data warehouse wasn’t built to analyze training metrics. That’s a rock and a hard place – and your models in the middle.

To train at hyperscale without the headwind – you need Neptune

icon Responsive & accurate UI at scale

View and analyze thousands of metrics
in milliseconds

With Neptune’s web app, you can render huge (100k+) runs tables. Or compare thousands of metrics on a single chart — minus the screen freeze you get with other tools.

And, because we don’t downsample data, your visualizations are 100% accurate – down to a single metric spike.

Real-time experiment tracking with confidence is now a reality.

video placeholder video placeholder
icon COMING SOON: Forking of runs

Track months-long model training
with more confidence

video placeholder video placeholder

Forking new runs from any step of your experiment makes it possible to:

  • Test multiple configs at the same time. Stop the runs that don’t improve accuracy. And continue from the most accurate last step. No more wasting millions on training experiments that won’t converge.
  • Restart failed training sessions from any previous step. Your training history is inherited. And you can see your entire experiment on a single chart. No more wasting time on workarounds that give you inconsistent results. 

Forking of runs is only available in Neptune Scale at the moment. Request early access to this version.

icon Self-hosted deployment

Deploy on-prem or in your private cloud
— from day one

Unlike other tools, we built Neptune’s architecture, data model, and algorithms for maximum scalability.

For example, Neptune can ingest 100k data points per second — asynchronously (based on Kafka).

So you can track all the metrics, results, and metadata you generate — while keeping your data safe.

icon 30+ integrations

Speaks fluently with your stack

  • Any code
  • Training frameworks
  • HPO frameworks
  • Automation frameworks
Any code
import neptune

# Connect to Neptune and create a run
run = neptune.init_run()

# Log hyperparameters
run["parameters"] = {
    "batch_size": 64,
    "dropout": 0.5,
    "optimizer": {"type": "SGD", "learning_rate": 0.001},
}
# Log dataset versions
run["data/train_version"].track_files("train/images")

# Log the training process
for epoch in range(100):
    accuracy = ...
    run["train/accuracy"].append(accuracy)

# Log test metrics and charts
run["test/f1_score"] = test_score
run["test/confusion_matrix"].upload(fig)

# Log model weights and versions
run["model/weights"].upload("my_model.pkl")

# Stop logging to your run
run.stop()
decor
Training frameworks
import neptune
from neptune_pytorch import NeptuneLogger

run = neptune.init_run()
neptune_logger = NeptuneLogger(
    run=run,
    model=model,
)
from lightning.pytorch.loggers import NeptuneLogger

neptune_logger = NeptuneLogger()

trainer = Trainer(
    ...,
    logger=neptune_logger,
)

trainer.fit(...)
training_args = TrainingArguments(
    ...
    report_to="neptune",
)

trainer = Trainer(
    model,
    training_args,
    ...,
)
# Track and version data files used for training
run["datasets/version"].track_files("s3://path/to/object")

# Log training parameters
params = {
    "num_epochs": 10,
    ...,
}
run["training/model/params"] = params

# Log metrics in the training loop
for epoch in range(params["num_epochs"]):
    ...

    # Log metrics for the epoch
    run["training/train/loss"].append(loss)
    run["training/train/accuracy"].append(accuracy)
  
 # Upload trained model to Neptune
 model.save("my_model.keras")
 run["model"].upload("my_model.keras")
import neptune
from neptune_tensorflow_keras import NeptuneCallback

run = neptune.init_run()
neptune_callback = NeptuneCallback(run=run)

model.fit(
    ...,
    callbacks=[neptune_callback],
)
from composer.loggers import NeptuneLogger

trainer = Trainer(
    ...,
    loggers=NeptuneLogger(),
)
import neptune
from neptune_sklearn import create_classifier_summary

run = neptune.init_run()

run["cls_summary"] = create_classifier_summary(
    classifier,
    X_train,
    X_test,
    y_train,
    y_test,
)
import neptune
from neptune_lightgbm import NeptuneCallback, create_booster_summary

run = neptune.init_run()
neptune_callback = NeptuneCallback(run=run)

# Log training metrics live
gbm = lgbm.train(
    ...,
    callbacks=[neptune_callback],
)

# Log model summary after training
run["lgbm_summary"] = create_booster_summary(booster=gbm)
import neptune
from neptune_xgboost import NeptuneCallback

run = neptune.init_run()
neptune_callback = NeptuneCallback(run=run)

xgb.train(
    ...,
    callbacks=[neptune_callback],
)
decor decor
HPO frameworks
import neptune
from neptune_optuna import NeptuneCallback

run = neptune.init_run()
neptune_callback = NeptuneCallback(run)

...
study.optimize(
    ...,
    callbacks=[neptune_callback],
)
decor decor
Automation frameworks
from neptune_airflow import NeptuneLogger

with DAG(
...
) as dag:

	def your_task(**context):
		logger = NeptuneLogger()
		return task_results(logger, **context)
def step_function(
...,
neptune_run: neptune.handler.Handler,
):
    ...
    neptune_run["field"] = value
    ...
from zenml.integrations.neptune.experiment_trackers.run_state import (
    get_neptune_run
)

@step(experiment_tracker="neptune_tracker", ...)
def my_step():
    neptune_run = get_neptune_run()
    neptune_run["sys/name"] = "My custom run name" 
    neptune_run["params/lr"] = params.lr 
    ...
decor decor

Loved by 60000+ researchers. Trusted by enterprises.

Logo image Logo image Logo image Logo image Logo image Logo image
Avatar
Ronen Ben-David Algorithms Team Lead, HP
We primarily use Neptune for training monitoring, particularly for loss tracking, which is crucial to decide whether to stop training if it’s not converging properly. It’s also invaluable for comparing experiments and presenting key insights through an intuitive dashboard to our managers and product owners. And we really appreciate the excellent chat support.
Avatar
Vadim Markovtsev Founding Engineer at poolside
I really appreciate that I’ve never seen any outage in Neptune. And since we’re training an LLM, that it’s super critical to not have any outages in our loss curve. Other than that, there are things you often take for granted in a product: reliability, flexibility, quality of support. Neptune nails those and gives us the confidence.

The largest models require
the most scalable experiment tracker

Interested to know how Neptune can help you with that?