hub / github.com/databricks/spark-deep-learning

github.com/databricks/spark-deep-learning @v1.6.0 sqlite

repository ↗ · DeepWiki ↗ · release v1.6.0 ↗

484 symbols 1,615 edges 72 files 221 documented · 46%

README

Deep Learning Pipelines for Apache Spark

Deep Learning Pipelines provides high-level APIs for scalable deep learning in Python with Apache Spark.

Overview
Building and running unit tests
Spark version compatibility
Support
Releases
Quick user guide
Working with images in Spark
Transfer learning
Distributed hyperparameter tuning
Applying deep learning models at scale
Deploying models as SQL functions
License

Overview

Deep Learning Pipelines provides high-level APIs for scalable deep learning in Python with Apache Spark.

The library comes from Databricks and leverages Spark for its two strongest facets:

In the spirit of Spark and Spark MLlib, it provides easy-to-use APIs that enable deep learning in very few lines of code.
It uses Spark's powerful distributed engine to scale out deep learning on massive datasets.

Currently, TensorFlow and TensorFlow-backed Keras workflows are supported, with a focus on: * large-scale inference / scoring * transfer learning and hyperparameter tuning on image data

Furthermore, it provides tools for data scientists and machine learning experts to turn deep learning models into SQL functions that can be used by a much wider group of users. It does not perform single-model distributed training - this is an area of active research, and here we aim to provide the most practical solutions for the majority of deep learning use cases.

For an overview of the library, see the Databricks blog post introducing Deep Learning Pipelines. For the various use cases the package serves, see the Quick user guide section below.

The library is in its early days, and we welcome everyone's feedback and contribution.

Maintainers: Bago Amirbekian, Joseph Bradley, Yogesh Garg, Sue Ann Hong, Tim Hunter, Siddharth Murching, Tomas Nykodym, Lu Wang

Deprecation

The following submodules and classes are deprecated and will be removed in the next release of sparkdl. Please use Pandas UDF instead. * sparkdl.graph * sparkdl.udf * sparkdl.KerasImageFileTransformer * sparkdl.KerasTransformer * sparkdl.DeepImagePredictor * sparkdl.DeepImageFeaturizer * sparkdl.TFImageTransformer * sparkdl.TFTransformer

The class sparkdl.KerasImageFileEstimator is deprecated and will be removed in the next release of sparkdl. To replace a KerasImageFileEstimator workflow, please use Distributed Hyperopt with SparkTrials to distribute model tuning.

Building and running unit tests

To compile this project, run build/sbt assembly from the project home directory. This will also run the Scala unit tests.

To run the Python unit tests, run the run-tests.sh script from the python/ directory (after compiling). You will need to set a few environment variables, e.g.

# Be sure to run build/sbt assembly before running the Python tests
sparkdl$ SPARK_HOME=/usr/local/lib/spark-2.3.0-bin-hadoop2.7 PYSPARK_PYTHON=python3 SCALA_VERSION=2.11.8 SPARK_VERSION=2.3.0 ./python/run-tests.sh

Spark version compatibility

To work with the latest code, Spark 2.3.0 is required and Python 3.6 & Scala 2.11 are recommended . See the travis config for the regularly-tested combinations.

Compatibility requirements for each release are listed in the Releases section.

Support

You can ask questions and join the development discussion on the DL Pipelines Google group.

You can also post bug reports and feature requests in Github issues.

Releases

Visit Github Release Page to check the release notes.

Downloads and installation

Deep Learning Pipelines is published as a Spark Package. Visit the Spark Package page to download releases and find instructions for use with spark-shell, SBT, and Maven.

Quick user guide

Deep Learning Pipelines provides a suite of tools around working with and processing images using deep learning. The tools can be categorized as

Working with images in Spark : natively in Spark DataFrames
Transfer learning : a super quick way to leverage deep learning
Distributed hyperparameter tuning : via Spark MLlib Pipelines
Applying deep learning models at scale - to images : apply your own or known popular models to make predictions or transform them into features
Applying deep learning models at scale - to tensors : of up to 2 dimensions
Deploying models as SQL functions : empower everyone by making deep learning available in SQL.

To try running the examples below, check out the Databricks notebook in the Databricks docs for Deep Learning Pipelines, which works with the latest release of Deep Learning Pipelines. Here are some Databricks notebooks compatible with earlier releases: 0.1.0, 0.2.0, 0.3.0, 1.0.0, 1.1.0, 1.2.0.

Working with images in Spark

The first step to apply deep learning on images is the ability to load the images. Spark and Deep Learning Pipelines include utility functions that can load millions of images into a Spark DataFrame and decode them automatically in a distributed fashion, allowing manipulation at scale.

Using Spark's ImageSchema

from pyspark.ml.image import ImageSchema
image_df = ImageSchema.readImages("/data/myimages")

or if custom image library is needed:

from sparkdl.image import imageIO as imageIO
image_df = imageIO.readImagesWithCustomFn("/data/myimages",decode_f=<your image library, see imageIO.PIL_decode>)

The resulting DataFrame contains a string column named "image" containing an image struct with schema == ImageSchema.

image_df.show()

Why images? Deep learning has shown to be powerful for tasks involving images, so we have added native Spark support for images. The goal is to support more data types, such as text and time series, based on community interest.

Transfer learning

Deep Learning Pipelines provides utilities to perform transfer learning on images, which is one of the fastest (code and run-time-wise) ways to start using deep learning. Using Deep Learning Pipelines, it can be done in just several lines of code.

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
from sparkdl import DeepImageFeaturizer

featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3")
lr = LogisticRegression(maxIter=20, regParam=0.05, elasticNetParam=0.3, labelCol="label")
p = Pipeline(stages=[featurizer, lr])

model = p.fit(train_images_df)    # train_images_df is a dataset of images and labels

# Inspect training error
df = model.transform(train_images_df.limit(10)).select("image", "probability",  "uri", "label")
predictionAndLabels = df.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("Training set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))

DeepImageFeaturizer supports the following models from Keras:

InceptionV3
Xception
ResNet50
VGG16
VGG19

Distributed hyperparameter tuning

Getting the best results in deep learning requires experimenting with different values for training parameters, an important step called hyperparameter tuning. Since Deep Learning Pipelines enables exposing deep learning training as a step in Spark’s machine learning pipelines, users can rely on the hyperparameter tuning infrastructure already built into Spark MLlib.

For Keras users

To perform hyperparameter tuning with a Keras Model, KerasImageFileEstimator can be used to build an Estimator and use MLlib’s tooling for tuning the hyperparameters (e.g. CrossValidator). KerasImageFileEstimator works with image URI columns (not ImageSchema columns) in order to allow for custom image loading and processing functions often used with keras.

To build the estimator with KerasImageFileEstimator, we need to have a Keras model stored as a file. The model could be Keras built-in model or user trained model.

from keras.applications import InceptionV3

model = InceptionV3(weights="imagenet")
model.save('/tmp/model-full.h5')

We also need to create an image loading function that reads the image data from a URI, preprocesses them, and returns the numerical tensor in the keras Model input format. Then, we can create a KerasImageFileEstimator that takes our saved model file.

import PIL.Image
import numpy as np
from keras.applications.imagenet_utils import preprocess_input
from sparkdl.estimators.keras_image_file_estimator import KerasImageFileEstimator

def load_image_from_uri(local_uri):
  img = (PIL.Image.open(local_uri).convert('RGB').resize((299, 299), PIL.Image.ANTIALIAS))
  img_arr = np.array(img).astype(np.float32)
  img_tnsr = preprocess_input(img_arr[np.newaxis, :])
  return img_tnsr

estimator = KerasImageFileEstimator( inputCol="uri",
                                     outputCol="prediction",
                                     labelCol="one_hot_label",
                                     imageLoader=load_image_from_uri,
                                     kerasOptimizer='adam',
                                     kerasLoss='categorical_crossentropy',
                                     modelFile='/tmp/model-full-tmp.h5' # local file path for model
                                   )

We can use it for hyperparameter tuning by doing a grid search using CrossValidataor.

from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

paramGrid = (
  ParamGridBuilder()
  .addGrid(estimator.kerasFitParams, [{"batch_size": 32, "verbose": 0},
                                      {"batch_size": 64, "verbose": 0}])
  .build()
)
bc = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label" )
cv = CrossValidator(estimator=estimator, estimatorParamMaps=paramGrid, evaluator=bc, numFolds=2)

cvModel = cv.fit(train_df)

Applying deep learning models at scale

Spark DataFrames are a natural construct for applying deep learning models to a large-scale dataset. Deep Learning Pipelines provides a set of Spark MLlib Transformers for applying TensorFlow Graphs and TensorFlow-backed Keras Models at scale. The Transformers, backed by the Tensorframes library, efficiently handle the distribution of models and data to Spark workers.

Applying deep learning models at scale to images

Deep Learning Pipelines provides several ways to apply models to images at scale: * Popular images models can be applied out of the box, without requiring any TensorFlow or Keras code * TensorFlow graphs that work on images * Keras models that work on images

Applying popular image models

There are many well-known deep learning models for images. If the task at hand is very similar to what the models provide (e.g. object recognition with ImageNet classes), or for pure exploration, one can use the Transformer DeepImagePredictor by simply specifying the model name.

from pyspark.ml.image import ImageSchema
from sparkdl import DeepImagePredictor

image_df = ImageSchema.readImages(sample_img_dir)

predictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3", decodePredictions=True, topK=10)
predictions_df = predictor.transform(image_df)

DeepImagePredictor supports the same set of models from Ke

Core symbols most depended-on inside this repo

run

called by 16

python/sparkdl/graph/builder.py

importGraphFunction

called by 10

python/sparkdl/graph/builder.py

_reverseChannels

called by 9

python/sparkdl/image/imageIO.py

asGraphFunction

called by 9

python/sparkdl/graph/builder.py

imageArrayToStruct

called by 8

python/sparkdl/image/imageIO.py

getInputCol

called by 8

python/sparkdl/transformers/named_image.py

getOutputCol

called by 8

python/sparkdl/transformers/named_image.py

_getSampleJPEGDir

called by 7

python/tests/transformers/image_utils.py

Shape

Method 280

Function 132

Class 66

Route 6

Languages

Python98%

TypeScript2%

Modules by API surface

python/sparkdl/param/shared_params.py38 symbols

python/sparkdl/transformers/keras_applications.py34 symbols

python/sparkdl/transformers/named_image.py32 symbols

python/tests/graph/test_import.py24 symbols

python/sparkdl/transformers/tf_image.py19 symbols

python/tests/graph/test_utils.py18 symbols

python/tests/transformers/named_image_test.py17 symbols

python/sparkdl/image/imageIO.py17 symbols

python/sparkdl/estimators/keras_image_file_estimator.py15 symbols

dev/run.py15 symbols

python/tests/tests.py14 symbols

python/tests/image/test_imageIO.py14 symbols

Dependencies from manifests, versioned

Deprecated1.2.7 · 1×

PyNaCl1.2.1 · 1×

cloudpickle0.5.2 · 1×

coverage4.4.1 · 1×

h5py2.7.0 · 1×

horovod0.16.0 · 1×

keras2.2.4 · 1×

nose1.3.7 · 1×

pandas0.19.1 · 1×

parameterized0.6.1 · 1×

paramiko2.4.0 · 1×

pillow4.1.1 · 1×

For agents

$ claude mcp add spark-deep-learning \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact