Launching jobs on Lakeflow

Lakeflow is Databrick's way to manage jobs on a cluster. Databricks lets you edit scripts in their UI (they call these "notebooks"). That's too simple for some our more complex experiments. Thankfully, Databricks supports lots of other ways to submit jobs.

This tool is an opinionated way to spawn jobs on Databricks: It asks you author your code as a Python package, which forces you to specific its dependencies. It then uploads that package (as a python wheel) for Databricks to run it. This is heavier-weight than Databrick's notebook approach of shipping scripts, but it lets you capture large package dependencies across repos via git submodules. It's lighter weight than other job submisison systems that operate on docker containers. For most of our work, wheels represent all the containerization we neeed.

It has one more opinion: that uv is a good way to capture those Python dependencies, with a pyproject.toml.

Once you have your packaged defined in a pyproject.toml, you can use this tool to build the wheel, upload it to Databricks, and spawn copies of it with different command line arguments. Databricks gives you a UI to check the state of your jobs.

The tool provides several interfaces:

A CLI you can use from the shell.
An MCP server you can connect to from an AI agent.
A set of Python functions you can call from a Python program.

Getting access to Databricks

Check if you have access to Databrick by visiting this url. If you get stuck in an infinite loop where Databricks sends you a code that doesn't work, it means you don't have an account. Ask for one in #help-data-platform.

Your package's structure

This package assumes the package you want to run on the cluster has a structure like this and it can be run with uv run:

my_project/
├── pyproject.toml
├── src/
│   └── my_package/
│       ├── __init__.py
│       └── my_package_py.py

It also assumes you've added an entry point to your pyproject.toml called "lakeflow-task". If your package is called my_package, and it has a driver script called my_package_py.py, and the main function this script is called main, you would define the "lakeflow-task" entry point like this:

[project.scripts]
lakeflow-task = "my_package.my_package_py:main"

The pakage lakeflow_demo under this directory gives you a concrete example of this.

Building and launching your package with the CLI

To run the package on the cluster, first build the wheel, then upload it, then tell Databricks to run it:

Build the wheel:

uv run lakeflow.py build-wheel ~/my_project
# Output: /path/to/dist/my_package-0.1.0-py3-none-any.whl

This outputs the local wheel path, which we'll use in the next step.

Upload the wheel:

uv run lakeflow.py upload-wheel /path/to/dist/my_package-0.1.0-py3-none-any.whl
# Output: /Users/me/wheels/my_package-0.1.0-py3-none-any.whl

This outputs the remote wheel path, which we'll use in the next step.

Create the Job:
```
python lakeflow.py create-job \
  "my-lakeflow-job" \
  "my-package" \
  "/Users/me/wheels/my_package-0.1.0-py3-none-any.whl" \
  --max-workers 4
# Output: 123456 (Job ID)
```
This returns the job ID, which we'll use in the next step. This doesn't yet run any jobs. It just starts a cluster that can run them. The --max-workers argument sets the maximum number of workers for autoscaling.
Trigger a Run:
```
python lakeflow.py trigger-run 123456 argv1 argv2
```
This starts one instance of the job with the given arguments. If you have shards of data, you can call this operation multiple times with different arguments to kick off a bunch of jobs in parallel. argv will be populated with the arguments, and the environment variable DATABRICKS_RUN_ID will be populated with the run ID.
Monitor the Runs:
```
python lakeflow.py list-job-runs 123456
```
This lists the runs for the given job ID.

Using Python programmatic interface

The above illustrated how to use the CLI. You might find it easier to use the programmatic interface to the package instead. See run_lakeflow_demo.py for an example.

Using the MCP server

You can install this package as an MCP server. To do that, add this to ~/.cursor/mcp.json:

{
  "mcpServers": {
    "lakeflow": {
      "command": "/Users/arahimi/.local/bin/uv",
      "args": [
        "run",
        "--quiet",
        "--directory",
        "/Users/arahimi/lakeflow-mcp",
        "python",
        "lakeflow.py"
      ],
      "env": {
        "DATABRICKS_HOST": "https://hims-machine-learning-staging-workspace.cloud.databricks.com",
        "DATABRICKS_TOKEN": "<your tocken>>"
      }
    },
    ...
  }
}

Then you can ask the agent to do things like this:

let's launch 4 copies of this job on lakeflow, and pass them the arguments "fi", "fie", "fo", and "fum" respectively.

lakeflow-mcp