MCPcopy
hub / github.com/databricks/spark-deep-learning

github.com/databricks/spark-deep-learning @v1.6.0 sqlite

repository ↗ · DeepWiki ↗ · release v1.6.0 ↗
484 symbols 1,615 edges 72 files 221 documented · 46%
README

Deep Learning Pipelines for Apache Spark

Build Status Coverage

Deep Learning Pipelines provides high-level APIs for scalable deep learning in Python with Apache Spark.

Overview

Deep Learning Pipelines provides high-level APIs for scalable deep learning in Python with Apache Spark.

The library comes from Databricks and leverages Spark for its two strongest facets:

  1. In the spirit of Spark and Spark MLlib, it provides easy-to-use APIs that enable deep learning in very few lines of code.
  2. It uses Spark's powerful distributed engine to scale out deep learning on massive datasets.

Currently, TensorFlow and TensorFlow-backed Keras workflows are supported, with a focus on: * large-scale inference / scoring * transfer learning and hyperparameter tuning on image data

Furthermore, it provides tools for data scientists and machine learning experts to turn deep learning models into SQL functions that can be used by a much wider group of users. It does not perform single-model distributed training - this is an area of active research, and here we aim to provide the most practical solutions for the majority of deep learning use cases.

For an overview of the library, see the Databricks blog post introducing Deep Learning Pipelines. For the various use cases the package serves, see the Quick user guide section below.

The library is in its early days, and we welcome everyone's feedback and contribution.

Maintainers: Bago Amirbekian, Joseph Bradley, Yogesh Garg, Sue Ann Hong, Tim Hunter, Siddharth Murching, Tomas Nykodym, Lu Wang

Deprecation

The following submodules and classes are deprecated and will be removed in the next release of sparkdl. Please use Pandas UDF instead. * sparkdl.graph * sparkdl.udf * sparkdl.KerasImageFileTransformer * sparkdl.KerasTransformer * sparkdl.DeepImagePredictor * sparkdl.DeepImageFeaturizer * sparkdl.TFImageTransformer * sparkdl.TFTransformer

The class sparkdl.KerasImageFileEstimator is deprecated and will be removed in the next release of sparkdl. To replace a KerasImageFileEstimator workflow, please use Distributed Hyperopt with SparkTrials to distribute model tuning.

Building and running unit tests

To compile this project, run build/sbt assembly from the project home directory. This will also run the Scala unit tests.

To run the Python unit tests, run the run-tests.sh script from the python/ directory (after compiling). You will need to set a few environment variables, e.g.

# Be sure to run build/sbt assembly before running the Python tests
sparkdl$ SPARK_HOME=/usr/local/lib/spark-2.3.0-bin-hadoop2.7 PYSPARK_PYTHON=python3 SCALA_VERSION=2.11.8 SPARK_VERSION=2.3.0 ./python/run-tests.sh

Spark version compatibility

To work with the latest code, Spark 2.3.0 is required and Python 3.6 & Scala 2.11 are recommended . See the travis config for the regularly-tested combinations.

Compatibility requirements for each release are listed in the Releases section.

Support

You can ask questions and join the development discussion on the DL Pipelines Google group.

You can also post bug reports and feature requests in Github issues.

Releases

Visit Github Release Page to check the release notes.

Downloads and installation

Deep Learning Pipelines is published as a Spark Package. Visit the Spark Package page to download releases and find instructions for use with spark-shell, SBT, and Maven.

Quick user guide

Deep Learning Pipelines provides a suite of tools around working with and processing images using deep learning. The tools can be categorized as

To try running the examples below, check out the Databricks notebook in the Databricks docs for Deep Learning Pipelines, which works with the latest release of Deep Learning Pipelines. Here are some Databricks notebooks compatible with earlier releases: 0.1.0, 0.2.0, 0.3.0, 1.0.0, 1.1.0, 1.2.0.

Working with images in Spark

The first step to apply deep learning on images is the ability to load the images. Spark and Deep Learning Pipelines include utility functions that can load millions of images into a Spark DataFrame and decode them automatically in a distributed fashion, allowing manipulation at scale.

Using Spark's ImageSchema

from pyspark.ml.image import ImageSchema
image_df = ImageSchema.readImages("/data/myimages")

or if custom image library is needed:

from sparkdl.image import imageIO as imageIO
image_df = imageIO.readImagesWithCustomFn("/data/myimages",decode_f=<your image library, see imageIO.PIL_decode>)

The resulting DataFrame contains a string column named "image" containing an image struct with schema == ImageSchema.

image_df.show()

Why images? Deep learning has shown to be powerful for tasks involving images, so we have added native Spark support for images. The goal is to support more data types, such as text and time series, based on community interest.

Transfer learning

Deep Learning Pipelines provides utilities to perform transfer learning on images, which is one of the fastest (code and run-time-wise) ways to start using deep learning. Using Deep Learning Pipelines, it can be done in just several lines of code.

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
from sparkdl import DeepImageFeaturizer

featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3")
lr = LogisticRegression(maxIter=20, regParam=0.05, elasticNetParam=0.3, labelCol="label")
p = Pipeline(stages=[featurizer, lr])

model = p.fit(train_images_df)    # train_images_df is a dataset of images and labels

# Inspect training error
df = model.transform(train_images_df.limit(10)).select("image", "probability",  "uri", "label")
predictionAndLabels = df.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("Training set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))

DeepImageFeaturizer supports the following models from Keras:

  • InceptionV3
  • Xception
  • ResNet50
  • VGG16
  • VGG19

Distributed hyperparameter tuning

Getting the best results in deep learning requires experimenting with different values for training parameters, an important step called hyperparameter tuning. Since Deep Learning Pipelines enables exposing deep learning training as a step in Spark’s machine learning pipelines, users can rely on the hyperparameter tuning infrastructure already built into Spark MLlib.

For Keras users

To perform hyperparameter tuning with a Keras Model, KerasImageFileEstimator can be used to build an Estimator and use MLlib’s tooling for tuning the hyperparameters (e.g. CrossValidator). KerasImageFileEstimator works with image URI columns (not ImageSchema columns) in order to allow for custom image loading and processing functions often used with keras.

To build the estimator with KerasImageFileEstimator, we need to have a Keras model stored as a file. The model could be Keras built-in model or user trained model.

from keras.applications import InceptionV3

model = InceptionV3(weights="imagenet")
model.save('/tmp/model-full.h5')

We also need to create an image loading function that reads the image data from a URI, preprocesses them, and returns the numerical tensor in the keras Model input format. Then, we can create a KerasImageFileEstimator that takes our saved model file.

import PIL.Image
import numpy as np
from keras.applications.imagenet_utils import preprocess_input
from sparkdl.estimators.keras_image_file_estimator import KerasImageFileEstimator

def load_image_from_uri(local_uri):
  img = (PIL.Image.open(local_uri).convert('RGB').resize((299, 299), PIL.Image.ANTIALIAS))
  img_arr = np.array(img).astype(np.float32)
  img_tnsr = preprocess_input(img_arr[np.newaxis, :])
  return img_tnsr

estimator = KerasImageFileEstimator( inputCol="uri",
                                     outputCol="prediction",
                                     labelCol="one_hot_label",
                                     imageLoader=load_image_from_uri,
                                     kerasOptimizer='adam',
                                     kerasLoss='categorical_crossentropy',
                                     modelFile='/tmp/model-full-tmp.h5' # local file path for model
                                   ) 

We can use it for hyperparameter tuning by doing a grid search using CrossValidataor.

from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

paramGrid = (
  ParamGridBuilder()
  .addGrid(estimator.kerasFitParams, [{"batch_size": 32, "verbose": 0},
                                      {"batch_size": 64, "verbose": 0}])
  .build()
)
bc = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label" )
cv = CrossValidator(estimator=estimator, estimatorParamMaps=paramGrid, evaluator=bc, numFolds=2)

cvModel = cv.fit(train_df)

Applying deep learning models at scale

Spark DataFrames are a natural construct for applying deep learning models to a large-scale dataset. Deep Learning Pipelines provides a set of Spark MLlib Transformers for applying TensorFlow Graphs and TensorFlow-backed Keras Models at scale. The Transformers, backed by the Tensorframes library, efficiently handle the distribution of models and data to Spark workers.

Applying deep learning models at scale to images

Deep Learning Pipelines provides several ways to apply models to images at scale: * Popular images models can be applied out of the box, without requiring any TensorFlow or Keras code * TensorFlow graphs that work on images * Keras models that work on images

Applying popular image models

There are many well-known deep learning models for images. If the task at hand is very similar to what the models provide (e.g. object recognition with ImageNet classes), or for pure exploration, one can use the Transformer DeepImagePredictor by simply specifying the model name.

from pyspark.ml.image import ImageSchema
from sparkdl import DeepImagePredictor

image_df = ImageSchema.readImages(sample_img_dir)

predictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3", decodePredictions=True, topK=10)
predictions_df = predictor.transform(image_df)

DeepImagePredictor supports the same set of models from Ke

Core symbols most depended-on inside this repo

run
called by 16
python/sparkdl/graph/builder.py
importGraphFunction
called by 10
python/sparkdl/graph/builder.py
_reverseChannels
called by 9
python/sparkdl/image/imageIO.py
asGraphFunction
called by 9
python/sparkdl/graph/builder.py
imageArrayToStruct
called by 8
python/sparkdl/image/imageIO.py
getInputCol
called by 8
python/sparkdl/transformers/named_image.py
getOutputCol
called by 8
python/sparkdl/transformers/named_image.py
_getSampleJPEGDir
called by 7
python/tests/transformers/image_utils.py

Shape

Method 280
Function 132
Class 66
Route 6

Languages

Python98%
TypeScript2%

Modules by API surface

python/sparkdl/param/shared_params.py38 symbols
python/sparkdl/transformers/keras_applications.py34 symbols
python/sparkdl/transformers/named_image.py32 symbols
python/tests/graph/test_import.py24 symbols
python/sparkdl/transformers/tf_image.py19 symbols
python/tests/graph/test_utils.py18 symbols
python/tests/transformers/named_image_test.py17 symbols
python/sparkdl/image/imageIO.py17 symbols
python/sparkdl/estimators/keras_image_file_estimator.py15 symbols
dev/run.py15 symbols
python/tests/tests.py14 symbols
python/tests/image/test_imageIO.py14 symbols

Dependencies from manifests, versioned

Deprecated1.2.7 · 1×
PyNaCl1.2.1 · 1×
cloudpickle0.5.2 · 1×
coverage4.4.1 · 1×
h5py2.7.0 · 1×
horovod0.16.0 · 1×
keras2.2.4 · 1×
nose1.3.7 · 1×
pandas0.19.1 · 1×
parameterized0.6.1 · 1×
paramiko2.4.0 · 1×
pillow4.1.1 · 1×

For agents

$ claude mcp add spark-deep-learning \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact