MCPcopy
hub / github.com/sql-machine-learning/sqlflow

github.com/sql-machine-learning/sqlflow @v0.4.2 sqlite

repository ↗ · DeepWiki ↗ · release v0.4.2 ↗
2,093 symbols 8,126 edges 375 files 688 documented · 33%
README

SQLFlow

CI codecov GoDoc License Go Report Card

What is SQLFlow

SQLFlow is a bridge that connects a SQL engine, e.g. MySQL, Hive or MaxCompute, with TensorFlow, XGBoost and other machine learning toolkits. SQLFlow extends the SQL syntax to enable model training, prediction and model explanation.

Motivation

The current experience of development ML based applications requires a team of data engineers, data scientists, business analysts as well as a proliferation of advanced languages and programming tools like Python, SQL, SAS, SASS, Julia, R. The fragmentation of tooling and development environment brings additional difficulties in engineering to model training/tuning. What if we marry the most widely used data management/processing language SQL with ML/system capabilities and let engineers with SQL skills develop advanced ML based applications?

There are already some work in progress in the industry. We can write simple machine learning prediction (or scoring) algorithms in SQL using operators like DOT_PRODUCT. However, this requires copy-n-pasting model parameters from the training program to SQL statements. In the commercial world, we see some proprietary SQL engines providing extensions to support machine learning capabilities.

  • Microsoft SQL Server: Microsoft SQL Server has the machine learning service that runs machine learning programs in R or Python as an external script.
  • Teradata SQL for DL: Teradata also provides a RESTful service, which is callable from the extended SQL SELECT syntax.
  • Google BigQuery: Google BigQuery enables machine learning in SQL by introducing the CREATE MODEL statement.

None of the existing solution solves our pain point, instead we want it to be fully extensible.

  1. This solution should be compatible to many SQL engines, instead of a specific version or type.
  2. It should support sophisticated machine learning models, including TensorFlow for deep learning and XGBoost for trees.
  3. We also want the flexibility to configure and run cutting-edge ML algorithms including specifying feature crosses, at least, no Python or R code embedded in the SQL statements, and fully integrated with hyperparameter estimation.

Quick Overview

Here are examples for training a TensorFlow DNNClassifer model using sample data Iris.train, and running prediction using the trained model. You can see how cool it is to write some elegant ML code using SQL:

sqlflow> SELECT *
FROM iris.train
TO TRAIN DNNClassifier
WITH model.n_classes = 3, model.hidden_units = [10, 20]
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL class
INTO sqlflow_models.my_dnn_model;

...
Training set accuracy: 0.96721
Done training
sqlflow> SELECT *
FROM iris.test
TO PREDICT iris.predict.class
USING sqlflow_models.my_dnn_model;

...
Done predicting. Predict table : iris.predict

How to use SQLFlow

Contributing Guidelines

Roadmap

SQLFlow will love to support as many mainstream ML frameworks and data sources as possible, but we feel like the expansion would be hard to be done merely on our own, so we would love to hear your options on what ML frameworks and data sources you are currently using and build upon. Please refer to our roadmap for specific timelines, also let us know your current scenarios and interests around SQLFlow project so we can prioritize based on the feedback from the community.

Feedback

Your feedback is our motivation to move on. Please let us know your questions, concerns, and issues by filing GitHub Issues.

License

Apache License 2.0

Published

  • An arXiv paper at https://arxiv.org/abs/2001.06846
  • Demo Videos
  • 01/19/2020: https://www.youtube.com/watch?v=qUjQn7ePbto
  • 10/04/2019: https://www.youtube.com/watch?v=zIkwOQ_davw
  • 04/01/2019: https://www.youtube.com/watch?v=zIkwOQ_davw

Extension points exported contracts — how you extend this code

SQLFlowStmt (Interface)
SQLFlowStmt has multiple implementations: TrainStmt, PredictStmt, ExplainStmt and standard SQL. [7 implementers]
go/ir/ir.go
Parser (Interface)
Parser abstract a parser of a SQL engine, for example, Hive, MySQL, TiDB, MaxCompute. [2 implementers]
go/parser/external/parser.go
TableWriter (Interface)
TableWriter write the Table a special format, the example code of ASCII formatter: table := NewTableWriter("ascii", 102 [2 …
go/step/tablewriter/ascii.go
Executor (Interface)
Executor call code geneartor to generate submitter program and execute it. [1 implementers]
go/executor/executor.go
Codegen (Interface)
Codegen generates workflow YAML [1 implementers]
go/workflow/workflow.go
ParseInterface (Interface)
(no doc) [1 implementers]
java/parse-interface/src/main/java/org/sqlflow/parser/parse/ParseInterface.java
FeatureColumn (Interface)
FeatureColumn corresponds to the COLUMN clause in TO TRAIN. [8 implementers]
go/ir/feature_column.go
Workflow (Interface)
Workflow submits workflow task and trace step status [1 implementers]
go/workflow/workflow.go

Core symbols most depended-on inside this repo

Errorf
called by 376
go/log/log.go
String
called by 120
go/parser/ast/expr.go
Run
called by 112
go/sqlflowserver/sqlflowserver.go
join
called by 99
python/runtime/xgboost/tracker.py
Close
called by 87
go/pipe/pipe.go
connectAndRunSQL
called by 85
go/cmd/sqlflowserver/testing.go
Parse
called by 62
go/parser/external/parser.go
get_field_desc
called by 58
python/runtime/feature/column.py

Shape

Function 1,160
Method 687
Class 120
Struct 94
TypeAlias 12
Route 11
Interface 9

Languages

Go55%
Python42%
Java3%
TypeScript1%

Modules by API surface

python/runtime/feature/column.py67 symbols
python/runtime/xgboost/feature_column.py59 symbols
go/ir/feature_column.go45 symbols
go/ir/ir.go38 symbols
python/runtime/xgboost/tracker.py36 symbols
python/couler/couler/argo.py36 symbols
go/ir/ir_generator.go34 symbols
go/executor/executor.go33 symbols
go/cmd/sqlflow/main_test.go33 symbols
go/executor/pai.go28 symbols
python/runtime/db_test.py24 symbols
go/ir/derivation.go22 symbols

Dependencies from manifests, versioned

github.com/BurntSushi/graphics-gov0.0.0-2016012921570 · 1×
github.com/alecthomas/templatev0.0.0-2019071801265 · 1×
github.com/aliyun/aliyun-oss-go-sdkv2.0.5+incompatible · 1×
github.com/apache/thriftv0.13.0 · 1×
github.com/argoproj/argov2.4.3+incompatible · 1×
github.com/argoproj/pkgv0.0.0-2020062421511 · 1×
github.com/asaskevich/govalidatorv0.0.0-2019042411103 · 1×
github.com/baiyubin/aliyun-sts-go-sdkv0.0.0-2018032606232 · 1×
github.com/beltran/gohivev1.0.0 · 1×
github.com/bmizerany/assertv0.0.0-2016061122193 · 1×

Datastores touched

(mysql)Database · 1 repos
irisDatabase · 1 repos

For agents

$ claude mcp add sqlflow \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact