In a post the other day, I described how to test if machine learning with R and/or Python was set up correctly within SQL Server 2017.
One of the comments on that post, said that the info was useful but they were still to be convinced why you'd want to have machine learning in the database in the first place.
I see several reasons for this. SQL Server Machine Learning Services is the result of embedding a predictive analytics and data science engine within SQL Server. Consider what happens in most data science groups today, where this type of approach isn't used.
I routinely see data scientists working with large amounts of data in generic data stores. This might mean that they have data in stores like Hadoop/HDInsight or Azure Data Lake Store but in many cases, I just see operating system files, often even just CSV files. Both the R and Python languages make it really easy to create data frames from these types of files. But where did this data come from? In some cases, it will have come from the generic data store, but in most cases that I see, it has come from within a database somewhere.
And that raises a number of questions:
- What effort is required to extract that data (particularly for large volumes)?
- How up to date is the data?
- What is the security context for that data?
Often the answers to these questions aren't great. What I see is data science people extracting data from existing databases into CSV files, and then loading them up and processing them in tools like RStudio. And mostly, I see that data being processed single-threaded in those tools.
The outcome of this work though, is either analytics or (more commonly), trained predictive models.
Having Machine Learning in SQL Server helps here in several ways. First, you can utilize the same security model that you're using for any other access to that same data. Second, as the data volumes grow, you aren't needing to move (and then refresh) the data. You can process it right where it is. Third, you can take advantage of the multi-threaded architecture of SQL Server.
With Operational Analytics in SQL Server 2016 and later (basically non-clustered columnstore indexes with delayed aggregation, built over transactional data), you might even be able to have the outcomes really up to date.
While being able to train and retrain predictive models is really important, and is hard work, it's when you use those models to create predictions that the real value becomes apparent. Trained models are quite lightweight execution-wise. You can add predictions right into your queries along with your other returned data, and very efficiently. This is where having Machine Learning within the database engine truly shines.
And you don't necessarily even need to create the predictive models. The SQL Server team have provided a series of world-class pretrained models that you can load directly into and bind to, an instance of SQL Server.