Snowflake for SQL Server Users – Part 12 – Parallelism when loading data from files into tables

When you are loading data into Snowflake, it's really important to achieve the maximum parallelism that you can. You want as many files loading in parallel as you have. I mentioned in earlier posts that the number of servers that you have, and the size of each of those servers, will determine the number of processor threads that are available to you from your virtual warehouses.

It would be pointless to have 32 processor threads waiting to load your data, and you provide Snowflake with one large file to load.

Instead, you should consider having at least as many smaller files to load as you have available processor threads. This means that you should favour many small files, rather than a smaller number of larger files.

Breaking up large files

This means that it can be desirable to break up existing large files. When I was doing Snowflake training, a number of people asked about good options for breaking up large text files. We couldn't find a good option so I wrote one.

You can find details of SDU_FileSplit here. As well as breaking up the files, it can transfer the header rows into the split files and more.

On Linux, you could just use the split command line utility.

File Sizes

In terms of how large the files should be, the Snowflake documentation recommends 10MB to 100MB of compressed data. (Most people load zip files).

That also means that if your files are much smaller, you should aggregate them before loading them, to get them into the target file size.

 

For an index to all posts in this series, see the first post here.

 

Leave a Reply

Your email address will not be published.