DevOps: What is a managed service?

Some time back, I decided to use a WordPress host for my blog because the time for me at sqlblog.com was coming to an end, community service was on its last legs, and WordPress seemed the obvious platform for a blog. Fellow MVP Adam Machanic made it really easy for me to migrate to a WordPress site with a tool that he had created.

Along with another site, I put them onto BlueHost and had reasonable hopes for them. However, a few weeks back I found that my blog had simply stopped working. I couldn't even log into the control panel to work out what was wrong.

The support chat that they offer is beyond slow. Every time I've used it, it takes ages to have anyone connect, and then while you're talking to them, it's like talking to someone on Valium. There are really long delays between responses. I can only assume that they are trying to talk to lots of people at once.

In the end, they told me that I'd run out of disk space. For the life of me, I couldn't work out how that was possible, given the amount of space purchased. Eventually it became obvious that the thing occupying all the space were some backups that I didn't configure, that were being performed into a folder that I couldn't see unless I could log on, and with no reasonable cleanup policy to get rid of old backups ie: You could argue that it was designed to fail.

The amazing part was when I asked them to just delete the old backups, because I was unable to, they told me they'd arrange it but it could take up to another day or so to do that. No-one who had permission to delete them was at work.

That's when I decided that I need to move away from this service, even though I'd prepaid it for quite a while into the future. We'd moved the other website already anyway, and so I thought I'd look for a managed hosting service for WordPress.

Many people mentioned they had good outcomes with InMotion hosting. When I checked it out, I saw they had "WordPress-Optimized Managed Web Hosting". I looked at the list of features and was very, very impressed with what they offered and eventually signed up.

One of the first things the guy activating my account explained though, is that I need to get involved in the backups. Turns out they do a backup every 24 to 36 hours, over the top of the previous backup. They only keep a single backup for me.

I pointed out that a backup that's done over the top of a previous backup really isn't a backup at all. 

The guy told me that to use their service, I needed to make sure I kept local copies regularly, in case something went wrong. I asked him "if I need to do my own backups, what exactly about their service is managed?" He couldn't answer me and said he needed to ask someone else in the company.

Surely keeping reliable backups of the site (and me not having to do it) is the number #1 thing that a managed service should provide.

Sadly, I've had to cancel the service before I even got started with it. It's a real pity because there's much to like about their offerings, and they are responsive.

I've never been a fan of GoDaddy (had some beyond-horrid experiences dealing with them in the past), but as a comparison, it's informative to look at what they advertise for backups in their managed service. They say they keep daily backups for 30 days, and have a one-click restore process to go back to any of them.

Microsoft does similar things with Azure SQL Database. I don't worry about the backups, they keep many, and at any point I can just roll a database back to a previous point within the last few weeks.

That's more what I'd expect in a "managed" service.

While it's easy to call a service "managed", here's a hint: if I need to be involved with the core admin functions like backup, it's not a managed service.

 

 

DevOps: Human testing is still critical – and make sure they can spell

I've spent a lot of time lately working with automated testing systems.

Unit tests are great for checking blocks of code, etc. but integration tests and functionality tests are critical to projects. An area that is still tricky though is UI testing. There are some really good tools for this.

Selenium has been very popular and works well if you access it programmatically. There are good framework libraries for .NET and we've been using those. Selenium also had a pretty good IDE that could be used for recording but it's pretty much adrift now as it was based on an older version of Firefox and doesn't work on the latest versions. You could install it on an older version for testing but those versions aren't safe to use so the recommendation is to avoid it. Working with it programmatically is not that hard though.

This article shows a number of alternatives: https://www.guru99.com/selenium-alternatives.html

I've tried quite a few now but had pretty mixed results, so this area is still evolving. I had fairly good results with Katalon Studio but ended up quite frustrated with trying to use its recorder. So I ended up back to coding Selenium directly.

One thing that's been very clear to me though, is that no matter how good all this automated UI testing is though, you still need humans to do it as well. You can see examples above of basic UI issues.

Here's an example of a UI logic error:

It's also very clear that the humans that do this need to be able to spell.

No matter how good looking your application is, if it's full of spelling errors typos, or graphical errors, it won't look professional.

DevOps: Keep moving forward not backwards when deploying to production

I was asked a curious question the other day: “Are all changes made to production systems actually deployments?”

I wish the answer was yes, but unfortunately patches do get applied directly to production systems without going through standard deployment processes.

Everyone understands the scenario. Code has been deployed using the normal processes, then someone has something that simply must be fixed urgently in the productions systems. At that point, you have two choices:

  • The traditional choice is to urgently patch the production server, then reapply those changes back into the development process.
  • The ideal DevOps approach is that an urgent patch is made in earlier environments then deployed through the normal processes.

The reason that people opt for the traditional choice above is that going back through the processes simply takes too long in most organizations. The problem with this option is that there’s no guarantee that the change that was made in production will be reapplied the same way when it goes back into the normal processes. Worse, it might not be applied at all. Even worse, if there are multiple production servers, it might be applied differently to each of those as well.

Image by Alice Achterhof

Then you have a real mess to deal with. Deployments start to break, and because they are then painful, they become less frequent and more risky, and then they break even more. It’s a horrible cycle that you can easily get into.

The aim when working with DevOps is to streamline the process for doing this to the point where you won’t hesitate to have it go through the normal deployment processes. You urgently patch and deploy the patch instead.

DevOps: SQL Server and Unit Test Challenges

I had a previous life as a developer and ran a software-development house. Even then, I was very focussed on data. I don’t think I’ve ever had a BD (before data) period. I see almost everything I’ve ever worked on in data-related terms, so perhaps it’s the time before I focussed on data.

But what this does mean is that whenever I get together with other data-related people, I’m one of the people who is asking why things that are taken for granted in the developer community, aren’t present in SQL Server and its T-SQL language.

Testing is one of these challenges. I’ve had old wise colleagues in the past who see this simply:

It’s pointless to write code that you can’t test

And yet, T-SQL development is still one of the areas where that’s a challenge. Over the years, I’ve had many discussions with SQL Server product team members about this, and still haven’t been persuasive enough to convince them to do something about it.

I should note at this point though, that if you’re someone who sees the database as just a storage for your objects, that no code should exist in the database at all, and that the performance you’re currently seeing is good enough, then read no further.

But I currently live in a world where performance matters and many operations need to be done as close to the data as possible, with as little data movement as possible.

If you are developing in T-SQL today though, writing unit tests and test harnesses is just way more challenging than it should be. In many cases, you simply can’t do it in any practical way, at least not within T-SQL. There have been some great attempts like TSQLUnit (https://sourceforge.net/projects/tsqlunit/), and tSQLt (http://tsqlt.org/) but these aren’t really where I want to be. It's not for lack of trying, but it's for a lack of support within the product itself.

A simple example might help.

I can write code in T-SQL to trap and handle errors, including system errors (ie: those with error numbers less than 50,000). But if I want to test that error handling, I’m fresh out of luck. The “normal” way to do that in high-level languages is to write code to just throw that error and check what happens. But the product group have decided that we’re not allowed to raise system errors.

When I’ve discussed this with them, I get responses like “ah but then we’d get errors reported and we’d have no idea where they came from”. I get that they’re thinking from a conservative product support point of view, but that sort of thing can be handled. There would be a way to handle this if there was a desire to do so. And what I’ve never understood is the lack of interest in doing so. Product group members are, by and large, developers whose lives must take testing as part of their own routine.

Error handling is just one example issue though.

So we’re left with two options really:

  • Spend our time asking the product group for changes to support better quality development in T-SQL
  • Wrap all our unit tests in other languages, accept that we just can’t test some things, and deal with language limitations.

As we move into a DevOps-centric world, this is going to be more of an issue. I’d love to see us pushing for the first option and having the expectation that people developing with T-SQL should have access to similar developer-focused tooling to those building SQL Server itself.

DevOps: Avoiding SQL Server Clone databases growing way too large

I’ve recently been making use of SQL Clone from Redgate, at various client sites. I have to say that I really like it.

The basic concept is that you take an existing SQL Server database, you create an “image” from it, then you create database clones from that image.

Under the covers, it’s basically a differencing technology. The initial image is like an initial set of database files, and each clone is then files that are differenced from that. That makes it very easy to spin up clones, and to break them down again. While this isn’t really suitable for much in the way of performance or load testing, as everyone is sharing the same base, it’s perfect for general development.

It’s also a really good fit for use with source control systems like Git that developers endlessly create branches in. You can just create a clone for use with each branch, and avoid developers trampling on top of each other all the time. Each developer gets to have separate databases (or even multiple databases) and gets to test on full data volumes, without all the storage space that would otherwise be required.

What I have run into lately though, is a site where all of the clone databases were getting bigger and bigger constantly. Instead of being 40 or 50 MB as they started out, each was now taking hundreds of gigabytes.

That was pretty easy to track down, and it was caused by routine database maintenance tasks that the customer was scheduling. In particular, they were regularly rebuilding the indexes in the database. In a clone database that’s based on differencing technology, that’s not going to make sense at all. Rebuilding 200GB of tables or indexes could easily end up with a 200GB clone fine. And if you do that to all your clones, well you can see the problem.

What is needed is to avoid performing index maintenance on the clone databases. Given that to SQL Server, they just look like any other database, I’d suggest having a naming scheme for the clones, to make it obvious which they are, to exclude them from these operations.

DevOps: Should migration-based deployments use idempotent scripts?

In my last DevOps-related post, I discussed some of the issues that arise when using migration-based deployments.

Migration-based deployments are based on a series of T-SQL scripts that apply changes to your database from its current state to a desired final state.

One of the questions that arises though, is whether or not these T-SQL scripts should be idempotent ie:

Should the script be written so that you get the same outcome if you run the scripts multiple times?

Creating T-SQL scripts that you can reliably run more than once isn't easy. It's not required for migration-based deployments if all goes well, it might make them more reliable, and/or make them more useful when errors occur during deployment. However, achieving that isn't easy.

In essence, creating scripts like this means that instead of writing code like this:

you write code like this:

But what would happen if the table is there but different?

Sometimes it's also messier than you'd hope. For example, CREATE SCHEMA has to be the first statement in a batch. You can't write:

Instead, you need to write something like:

And apart from now losing help from Intellisense, etc. within the script items, this sort of thing just gets harder and harder.

My experience is that apart from situations that are fairly trivial, creating seriously idempotent scripts is much, much harder than it looks, and it's far too easy to create scripts that can still end up with different end states.

I think you are far better off validating the schema before running the migration, and if failure occurs, recover and re-run whenever possible.

 

 

DevOps: Dealing with issues in migration-based deployment of databases

In an earlier post, I described the difference between state-based deployments and migration-based deployments. What I want to talk about today are two of the main issues that can arise in a migration-based deployment and what to do about one of them.

When you are using a migration-based deployment technique, you are storing all the scripts that take a database from its current state to the desired final state. While tools like Ready Roll make this is a relatively easy way to perform migrations, it suffers from a few key problems:

Performance

When you are applying a series of steps to make all the required changes to a database, you may end up taking much longer than a move directly from the current state to the final state would involve. For example, you might add a new index, then later decide to add another column to it. Clearly it would be faster to just create the final index, rather than to create the index then modify it.

I often see deployments where indexes are applied, removed, reapplied, tables added, then removed again, etc. This isn't a big problem for small databases but can be a major issue for larger databases.

While this is a real issue, it's one that doesn't have a simple answer when you are using migration-based deployments. It's not one that I'm going to discuss further today. Let's talk about another issue that you can do something about.

Drift

If you have multiple production servers, then servers in other environments (like UAT, Development, Test, etc.), in a perfect world, the schemas of the databases in each of these environments would be identical. Often though, that's not what happens.

If an urgent production fix is required, someone might make a change directly on a production server. Worse, they might apply the same change differently on each of the production servers. Then they might forget to apply the same change (or apply it differently) in other environments. No matter what the cause, the schemas have drifted.

Now if you create migration scripts in the development environment, and even test them in a test environment, those same scripts might break when applied to the production servers because the scripts are dependent on the schemas being the same.

To work around this issue, your deployment process needs to start with schema comparisons, using either Visual Studio's schema compare, or a tool like Red-Gate's SQL Compare.

You might not be able to get tools like this into the production environment, but generally, you should be able to script the database(s) in the production environment (and other environments), load the scripts in a test environment, and run the comparisons there.

If they aren't in sync, you need to fix that first.

Once you know that you schemas are in sync, you can proceed with a migration-based deployment with more confidence.

 

DevOps: Declarative is where we want to be

If you have ever tried to write scripts to configure operating systems, you'd realize just how hard that is to get correct.

For example, if you need to ensure that your virtual machine has two network adapters, and they need to have a specific set of IP addresses, how do you do that?

Image by Markus Spiske

The traditional approach for PowerShell (and similar tools) was to try to write a step-by-step script to configure the network adapters the way you want. But where do you start? Do you write a script to check for any existing adapters and loop through them to try to remove them? Do you try to add the ones you want, and then remove the others?

You'll quickly realize that you get into very, very complex coding because you might not be able to be sure what your starting point is. Everything has to stay on the path that you prescribed.

And worse, what happens if you run this script more than once?

That's the problem with imperative code. You are telling the operating system the steps required for configuration.

We don't want to be doing this.

With a declarative approach, we tell the system how we want it to end up ie: the desired state, and let it worry about how to get configured that way. This is what we want to be doing instead of writing imperative code.

PowerShell offers DSC (desired state configuration) where you describe in a JSON file, the way you want the configuration to end up. A bonus in this approach is that it's idempotent ie: no matter how many times you run it, you end up with the same outcome.

It's important wherever possible to be doing declarative configuration not imperative configuration. In later posts, I'll talk more about how.

Opinion: There's a plague we need to stop

I've concluded that many software vendors (particularly large ones) don't understand how much support users of their software provide to each other, and how critical that support is.

The SQL and data communities are a good example of this. When someone has a problem and are wondering how to solve it, they don't call Microsoft or Google or Oracle (or whichever vendor) first. If they're lucky, they ask a colleague for help. But most will simply make a Google search (or yes a Bing search) to try to find an answer.

No matter how obscure an error message might be, if someone else has struggled with it before, at least there's a chance that on an online forum, someone will have spelled out what caused it for them.

Even cryptic values like app IDs in Windows that look like this:

{ADA41B3C-C6FD-4A08-8CC1-D6EFDE67BE7D}

can be matched to an error or an application that's causing the error.

Most of this happens without the vendor even being involved.

So one of my pet hates (which Microsoft have heard loud and clear on internal mailing lists) is applications that break this pattern.

Every time I have an error that says:

and nothing else, I want to scream. Even that's enough to get an answer sometimes. "Every time I click on XXX and drag the YYY, I get an error saying Oops. Something went wrong!" might lead to a posting that solves the issue but it's so much tougher when there's no other info.

A plea to developers:

At the time the error occurs, even if you don't know exactly what happened, you must know something about what you expected and what happened. Tell us something. No matter how cryptic.

Another related trend is where there is an error message but it's a GUID:

C3479B1C-B1F3-4404-8F3C-AFA00AA0FEAF

And we think: "Great. We have something to work with" only to find that the GUID changes every time the error occurs and is only meaningful to the support team at the vendor organization.

Please don't do this either.

Give us something repeatable that we can use to help each other.

SQL: Code for errors and be pleased when they don't occur

I spend a lot of time in large organizations that have spent an absolute fortune on highly-available systems, yet when those systems fail over (just as they were designed to do), most of the applications in the building break.

Why?

Because the developers have assumed that nothing ever breaks and have written their code in a far too optimistic manner. Did they do their jobs?

No

Is it possible for their next layer of code to deal with, say, a server disappearing for a few seconds? Of course it is. But it's not going to happen by accident. It's even more important in a cloud-based world.

There was a question about deadlocks again recently on one of our local mailing lists. Can you deal with deadlocks?

Yes

Again though, none of this is automatic. But allowing for (and perhaps even expecting) failure is one of the differences in building enterprise level code rather than toy code.

Image by Ryan Fields

Plan for failure and be pleasantly surprised when it doesn't happen often. But don't plan for perfection or you'll be disappointed.

While it is possible to handle deadlocks within T-SQL code, I prefer to catch them in the next layer of code (let's call it client code here), as there are other types of errors that should be retried at that level anyway.

Applications should have retry logic to cope with things like:

  • Deadlock (error 1205)
  • Snapshot concurrency violations (error 3960)
  • Server disconnection (can be due to network issues, fail-over of HA-based systems, etc.)
  • Various resource issues on the server

It's important to get into the habit of assuming that a transaction that you need to apply to the DB might work, rather than assuming that it will work. Always apply it via logic like:

  • While we haven't applied the transaction to the server, and while the retry time/count hasn't expired, let's try to make it happen.
  • If an error occurs, depending upon the error, we might back off for a while and try again.
  • For things like deadlocks, it's good to have some sort of exponential back-off with a random component.
  • Some errors are pointless to retry (ie: a primary key violation probably isn't ever going to work)

Once you build it that way, things become much more robust and resilient. The user should generally be unaware of these issues, apart from a slight processing delay.