Opinion: Designing Databases to Minimize Damage During Application Intrusions

Intrusions into computer systems are happening all the time now. We need to address this issue as an industry, but it’s important to understand that the way we design databases plays a big role in the impacts that occur during intrusions.

If you don’t accept that you could have an intrusion, you are living in La La Land. (See https://en.wikipedia.org/wiki/Fantasy_prone_personality)

A bug in any one of the frameworks that you use, the code that you write, the protocols that you use, the operating system or hosting services that you use can potentially expose you to an intrusion.

So do we just give up?

No, what you need to ensure is that when an intrusion occurs, the damage or impact is minimized. We do this in all other industries. For example, people working in high locations don’t expect to fall but they (generally) make sure that if they do, while something nasty might happen, it won’t be disastrous.

I routinely see web applications and middleware that can access any part of a database that it wants. The developers love this as it’s easy to do. But it exposes you to major risks. If the application is trampled on, you’ve opened up everything.

I always want to put mitigation in place and to limit the damage.

If your plan is to have your application connect to the database as one user, and you make that user a database owner (db0), or a combination of db_datareader and db_datawriter, or worse, a system administrator; then you don't have a plan.

A better plan is this:

  • Create a schema for the application – let’s call it WebApp
  • In the WebApp schema, create only the views and procedures that define what you want the application to be able to do (ie: it’s basically a contract between the database and the application)
  • Create a new user (from a SQL login or, better-still, a domain service account) for the application to connect through.
  • Grant that user EXECUTE and SELECT permission on the WebApp schema (and nothing else)

Then if the application is trampled on, the most that it can do is the list of things that you’ve defined in that schema and nothing else.

We need to start building systems more defensively, and this is reason #82938429 for why I just don’t like most ORMs as they tend to encourage entirely the wrong behavior in this area. (Some let you do it better begrudgingly).

SQL: Concatenating Column Values as Strings in T-SQL (Using CONCAT and CONCAT_WS)

There were a number of new T-SQL functions introduced in SQL Server 2012. As I've mentioned before, I get excited when there are new T-SQL functions.

Some, I'm not so excited about. EOMONTH was in that category, not because of the functionality, but because of the name (wish it was ENDOFMONTH), and lack of symmetry (lack of a STARTOFMONTH or BEGINNINGOFMONTH).

One that I thought was curious was CONCAT. I thought "why on earth do we need a function to concatenate strings. I can already do that. But when I got into using it, I realized how wonderful it was.

The problem with concatenating values is that you first need to convert them to strings. CONCAT does that automatically with all parameters that you pass to it. (It takes a list of values as parameters and you must have at least two parameters). You can see here how I can mix data types in the parameter list:

All the values are implicitly cast to a string. OK, so that's a little bit useful, but still no big deal?

The really powerful aspect is that it ignores NULL parameters in the list. (The documentation says that it implicitly casts them to an empty string but based on discussions I've had with the product group lately, my guess is that it simply ignores any parameter that's NULL).

Now that's something that's much messier with normal T-SQL. The problem is that when you concatenate anything that's NULL with the + operator in T-SQL, the answer is NULL, no matter what the other values are:

But this handles it nicely:

But notice that we're still not quite there. It's a pain to need to specify the separator each time (I've used N' ' as a single unicode space). More of a pain though, is notice that I still have two separators between Tim and Taylor in the example above.

CONCAT_WS in SQL Server 2017 comes to the rescue for that. It lets you specify the separator once, and ignores NULLs in the parameter list, and importantly, doesn't add the separator when the value is NULL. That's a pain if you want to use the function to create a string like a CSV as output (and I wish it had been a parameter to this function) but for this use case, it's perfect.

 

 

 

 

 

 

Shortcut: Using Database Snapshots to Provide Large Unit Testing Data with Quick Restores

SQL Server databases have a reputation for being hard to test, or at least hard to test appropriately.

For good testing, and particularly for unit tests, you really want the following:

  • Database in a known state before each test
  • Database containing large amounts of (preferably masked) data (production-sized)
  • Quick restore after each test before the next test

For most databases, this is hard to achieve. The restore after each test means that a normal database restore can't be used. What I often see instead, is people using transactions to try to achieve this i.e. the process becomes:

  • Start a transaction
  • Run the test
  • Check the results
  • Roll back the transaction

In some situations, that works well but the minute that you start trying to test transactional code, things fall apart quickly. SQL Server doesn't support truly nested transactions. When you execute a ROLLBACK, it doesn't matter how deep this occurs, the outer transaction is being rolled back too.

One option that I think many people should consider is database snapshots.

Your testing mechanism becomes:

  • Create a snapshot of the database to be tested
  • Test against the original database
  • Check the results
  • Revert the database to the snapshot

Both creating a snapshot database, and restoring from the snapshot are very quick operations. The creation is always quick and the revert time depends upon how many pages were changed during the test. That's often not many.

Creating a snapshot is described here: https://msdn.microsoft.com/en-us/library/ms175876(v=sql.105).aspx

Reverting a database from a snapshot is described here: https://msdn.microsoft.com/en-us/library/ms189281(v=sql.105).aspx

 

 

 

SDU Tools: Find String Within a SQL Server Database (Did you mask your database properly?)

I've mentioned that I work on a lot of financial systems. Masking and trimming the production databases so they can be used in development is important, yet I see many issues with masking.

A few months back, I was working on a SQL Server database from a large financial client, and the database was supposed to have been masked. However, what they missed was that the XML metadata associated with images stored in the database still had the actual client's details. Any developer in the organization could have retrieved private details of all the members of the fund.

This just isn't acceptable.

I decided to add a tool to our free SDU Tools collection that would help with this. It's useful way beyond just checking masking though. It lets you look everywhere in a database for a given string. It checks a wide variety of data types that might contain the value and it automatically checks every column in every table.

So even if you are just wondering where a value is stored in a database, this might help. Here's an example:

The procedure is called FindStringWithinADatabase. It takes 3 parameters. The first is the name of the database (in this case WideWorldImporters), the second is the string to search for (in this case Kayla), and the third is a flag to indicate whether or not the full contents of the actual rows should be displayed as well (in this case no).

For checking masking, you can enter a name that you know is real and find everywhere that it exists within the database. Or if you are just running an application, and see a value and wonder where it's stored, you can use this too.

Here's what happens if you set that last parameter to 1 (to indicate that you want to see the rows):

After the same initial output, all relevant rows from each table in the first list are displayed.

You can see the tool in action here:

You can find more information on our SDU Tools here:

https://sqldownunder.com/sdu-tools

We hope you find it useful.

SDU Tools: New T-SQL Statement: STRING_SPLIT plus Split Delimited String Tool

I've mentioned that I love it when SQL Server gets new T-SQL functionality.

A useful function that was added in SQL Server 2016 was STRING_SPLIT. You can see it in action in the main image for this post.

It's another function that is great as far as it goes, but when I was a kid, this would be described as "nice try but no cigar".

It works, and it's fast, so what's missing. I think two things:

  • You need to be able to know the position of each value retrieved from the string.
  • You often need to trim the values as well.

So, what do we do if we need that, or if we aren't yet on SQL Server 2016 anyway?

We have a tool for that. The SDU Tool SplitDelimitedString does these things. It's implemented in boring T-SQL but it just works. I've seen attempts to do this via XML queries, and they are fast, but they also screw up with certain characters in the strings (like <, >, %, etc.). So I went for slow and reliable. Here's what it does:

The 3rd parameter indicates whether values should be trimmed or not. In this case, I asked for trimming.

You can see it in action here:

You can find out more about our free SDU Tools here:

http://sdutools.sqldownunder.com

Opinion: Don't Design Databases for One Version of One App

I've pointed out in previous blog posts that I'm not a fan of ORMs. What I'm even less of a fan of is code-first design, particularly in combination with an ORM like the Entity Framework.

It might seem cool and shiny and if you are just whipping up a small proof of concept or test app, please feel free to do it, but the minute you think of doing it for enterprise systems, just don't. And if you have colleagues wanting to do this, please just say no.

For most organizations, the data that they own is one of the most (if not the most) valuable asset the company has. The data will generally outlive generations of applications and just be morphed from shape to shape over time.

The data will often be accessed by many different applications, often created with different technology stacks. You might think you'll be able to corral all access to the data via your app; and again you'll be wrong.

So designing the data storage to suit the needs of a single version of a single application at a single point in time, is what we baseball umpires know as "a big call".

Umpires know to make calls like this emphatically and confidently.

But this is not a call that you should be making. It's the quickest way to start building disconnected silos of information that don't represent the business or how the information in the business is inter-related.

 

SQL: Columns – how big is too big?

When designing databases, one question that comes up all the time is how large columns should be.

For numbers, the answer is always "big enough but not too big". This week I've been working at a site where the client numbers were stored in int columns. Given the clients are Australians and the Australian Bureau of Statistics Population Clock says there are just under 25 million of us, an integer seems a pretty safe bet, given it can store positive numbers up over two billion. It's hard to imagine that number being exceeded, but I've seen people deciding that it needs to be a bigint. I doubt that. Even if we count all our furry friends, we aren't going to get to that requirement.

I was recently at a site where they were changing all their bigint columns to uniqueidentifier columns (ie: GUID columns) because they were worried about running out of bigint values. In a word, that's ridiculous. While it's easy to say "64 bit integer", I can assure you that understanding the size of one is out of our abilities. In 1992, I saw an article that said if you cleared the register of a 64 bit computer and put it in a loop just incrementing it (adding one), on the fastest machine available that day, you'd hit the top value in 350 years. Now machines are much faster now than back then, but that's a crazy big number.

For dates, again you need to consider some time into the future. It's likely that smalldatetime just isn't going to cut it. Most retirement fund and insurance companies are already working with dates past the end of its range. What you do need to consider is the precision of the time if you're storing time values as well.

The real challenge comes with strings. I've seen developer groups that just say "make them all varchar(max)" (or nvarchar(max) if they are after multi-byte strings). Let's just say that's not a great idea.

But if they aren't all going to be max data types, what size should they be? One approach is to investigate the existing data. If you haven't used it, SQL Server Integration Services has a Data Profiling Task that's actually pretty nice at showing you what the data looks like. If you haven't tried it, it's worth a look. It can show you lots of characteristics of your data.

One thing that I see people miss all the time though, are standard data items. I was at a site yesterday where sometimes email addresses were 70 characters, sometimes 100 characters, other times 1000 characters, and all in the same database. This is a mess and means that when data is copied from one place in the database to another, there might be a truncation issue or failure.

Clearly you could make all the email addresses 1000 characters but is that sensible? Prior to SQL Server 2016, that made them too big to be in an index. I'm guessing you might want to index the email addresses.

So what is the correct size for an email address? The correct answer is to use standards when they exist.

To quote RFC3696:

In addition to restrictions on syntax, there is a length limit on
email addresses. That limit is a maximum of 64 characters (octets)
in the "local part" (before the "@") and a maximum of 255 characters
(octets) in the domain part (after the "@") for a total length of 320
characters.

So my #1 recommendation is that if there is a standard, use it.

SDU Podcast: Show 72 with guest Power BI General Manager Kamal Hathi

I had the great pleasure to record another SQL Down Under podcast last week with the Power BI general manager Kamal Hathi.

In the show, Kamal and I discuss the current state and potential futures for Power BI, its relationship to SQL Server Reporting Services, and its development and extensibility models.

You'll find the show here: http://www.sqldownunder.com/Podcasts

I hope you enjoy it.

Note: We had a few unexpected audio issues with the recording. Sorry about that. We'll do better next time 🙂 It's still pretty good and I'll still think you'll find it interesting.

Opinion: Mature consultants don't always want to tear down the house

I work with data. I understand that for most organizations, that the data they own is the most valuable asset the company owns.

One thing I've learned from working with data is that unlike application software, data generally outlives generations of applications, is often used by many different applications, and typically just morphs from shape to shape over time. It almost never gets totally refreshed.

This is a good thing.

I've been in the industry long enough to see many types of consultants. One type that I have the least time for, is the type that always wants to tear down or replace whatever is already there at an organization. It's far easier to just say "let's replace it all" than to try to work out what to do.

Many of these consultants don't really understand what's already in place, but because it looks different to what they are used to, it must be wrong, and it must be replaced.

A mature consultant might not like what they see but they take the time to consider what's already there.

The mature consultant is skilled enough to work out how to take the organization from a place that they don't want to be, to a place which is better than where they are now.

That's the real skill.

Opinion: ORMs: Are they a good idea when developing for SQL Server?

Many people know that I have an issue with ORMs (object relational mappers). It’s worth spending a blog post to describe why.

Unfortunately, I spend my life on the back end of trying to deal with the messes involved. The following are the key issues that I see:

Potentially horrid performance

image

I’ve been on the back end of this all the time. There are several reasons. One is that the frameworks generate horrid code to start with, the second is that they are typically quite resistant to improvement, the third is that they tend to encourage processing with far too much data movement.

I keep hearing that “the databases are getting so fast now that this loss of performance doesn’t matter”. I can tell you that I don’t live in a world where that’s true.

In the world that I live in, I regularly end up in software houses with major issues that they don’t know how to solve.

I was in a large financial in Sydney a while back. They were in the middle of removing the ORM that they’d chosen out of their app because try as they might, they couldn’t get anywhere near the required performance numbers.

Why had they called me in?

Because before they wrote off 8 months’ work for 240 developers, the management wanted another opinion. They were making the right decision but they wanted someone else to agree with them.

As another example, I was at a start-up software house last year. They had had a team of 10 developers building an application for the last four years. The business owner said that if it would support 1000 concurrent users, they would have a viable business. 5000 would make a good business. 500 they might survive. They had their first serious performance test two weeks before they had to show the investors. It fell over with 9 concurrent users. The management (and in this case the devs too) were panic-stricken.

For yet another recent example, I was in a  software house that had to deliver an app to a government department. They were already 4 weeks overdue and couldn’t get it out of UAT. They wanted a silver bullet. That’s not the point to then be discussing their architectural decisions, yet the architectural decisions that they’d made were the issue.

I’ve seen background processing jobs in a large UK-based financial service organisation. On a system with 48 processors, and 1.2 TB of memory, and 7 x 1 million UK pound 20TB flash drive arrays, one task. During that time, it issued 550 million SQL batches to be processed and almost nothing else would work well on the machine at the same time. The replacement job that we wrote in T-SQL issued 40,000 SQL batches and ran in just over 2 minutes. I think I can get that to 1/10 of that with further work. Guess which version is the right answer? What had caused this? Re-use of entity-level code through a custom ORM.

Minimal savings yet long term pain

image

Many of the ORMs give you an initial boost to “getting something done”. But at what cost? At best, on most projects that I see, it might save 10% of the original development time, on the first project. But as David Connor pointed out in his excellent TechEd talk with Joel Pobar (and as I’ve seen from so many other places), the initial development cost of a project is usually only around 10% of the overall development cost.

So what are we talking about? Perhaps 1% of the whole project? Putting yourself into a long-term restrictive straightjacket situation for the sake of a 1% saving is a big, big call. The problem is that it’s being decided by someone who isn’t looking at the lifetime cost, and often 90% of that lifetime cost comes out of someone else’s bucket.

Getting stuck in how it works

image

For years, code generated by tools like Linq to SQL was very poor. And this was a tool designed fully for SQL Server. Now imagine what the code that’s generated by a tool that doesn’t even know which backend database it’s using looks like.

That’s where Entity Framework (EF) started. Very poor choices are often made in the design of these tools. The whole reason that “optimize for ad hoc workloads” was added to SQL Server was to deal with the mess from the plan cache pollution caused by these tools.

A simple example is that on the SqlCommand object, they called AddWithValue() to add parameters to the parameters collection. When I used to teach ADO.NET, I’d explain why using that method call was a really bad idea. It provides the name of the parameter and the value, but no data type. So it tried to derive the data type from the data. SQL Server would quickly end up with a separate query plan for every combination of every length of string for every parameter. This is plan cache pollution.

And what could the developers do to fix it?

image

Nothing. Nada. Nil. NULL.

Because it was baked into how the framework worked. The framework eventually got changed a bit to have more generic sizes but still never addressed the actual issues.

Getting stuck with old code

image

SQL Server 2008 added a bunch of new data types. Spatial was one of my favourites. When did that get added to EF? Many, many, many years later. What could the developers do about it? Almost nothing except very hacky workarounds.

When you use a framework like this, you are totally at the mercy of what its developers feel is important, and that’s if they haven’t already lost interest in it yet.

When you code with a tool that was targeted cross-DB, you usually end up with a very poor choice of data types. You end up with lowest common denominator of everything, or you end up with something that doesn’t fit well with the framework.

Many of the authors of the frameworks quickly lose interest. Do you want to be stuck with it?

Boilerplate code

image

“Oh I don’t want to write all that boilerplate code” I hear people say.

Sure. I get that but are you incapable of code-generation? At least give yourself control over how the code works. And that’s mostly a one-time cost. You get to use it again, project after project. But at least if you don’t like it, you can fix it.

Summary

Look, I could go on for days about this stuff and the astonishing messes that I’ve seen.

You need to decide what you’re building. But at least consider where you are taking your advice from.

image

If you’re taking architectural advice from someone who hasn’t built real systems at scale, at least get further opinions.

If you are building toy apps ie: a replacement for CardFile, none of this matters but I live in a world where it really, really matters. And so do the vast majority of software houses that I spend time in.

Future

Also worth noting that if you want real performance from upcoming DB engines, you might want to rethink the idea of just using the database an an object repository. Even NOSQL options like CosmosDB (was DocumentDB) have a concept of a stored procedure within the engine. You write them in Javascript. But there’s a reason they are there.

In SQL Server 2016, this is even more stark. I have simple examples on SQL Server 2016 where a batch of code sent from an app executes in 26 seconds. If I move the table to in-memory, it runs in 25.5 seconds. If I move the code to a natively-compiled stored procedure, it runs in 60 milliseconds.