Befriending Dragons

Turn Scary Into Attainable

Leave a comment

Hadoop Likes Big Files

One of the frequently overlooked yet essential best practices for Hadoop is to prefer fewer, bigger files over more, smaller files. How small is too small and how many is too many? How do you stitch together all those small Internet of Things files into files “big enough” for Hadoop to process efficiently?

The Problem

One performance best practice for Hadoop is to have fewer large files as opposed to large numbers of small files. A related best practice is to not partition “too much”. Part of the reason for not over-partitioning is that it generally leads to larger numbers of smaller files.

Too small is smaller than HDFS block size (chunk size), or realistically small is something less than several times larger than chunk size. A very, very rough rule of thumb is files should be at least 1GB each and no more than maybe around 10,000-ish files per table. These numbers, especially the maximum total number of files per table, vary depending on many factors. However, it gives you a reference point. The 1GB is based on multiples of the chunk size while the 2nd is honestly a bit of a guess based on a typical small cluster.

Why Is It Important?

One reason for this recommendation is that Hadoop’s name node service keep track of all the files and where the internal chunks of the individual files are. The more files it has to track the more memory it needs on the head node and the longer it takes to build a job execution plan. The number and size of files also affects how memory is used on each node.

smallpiebigpieLet’s say your chunk size is 256MB. That’s the maximum size of each piece of the file that Hadoop will store per node. So if you have 10 nodes and a single 1GB file it would be split into 4 chunks of 256MB each and stored on 4 of those nodes (I’m ignoring the replication factor for this discussion). If you have 1000 files that are 1MB each (still a total data size of ~1GB) then every one of those files is a separate chunk and 1000 chunks are spread across those 10 nodes. NOTE: In Azure and WASB this happens somewhat differently behind the scenes – the data isn’t physically chunked up when initially stored but rather chunked up at the time a job runs.

With the single 1GB file the name node has 5 things to keep track of – the logical file plus the 4 physical chunks and their associated physical locations. With 1000 smaller files the name node has to track the logical file plus 1000 physical chunks and their physical locations. That uses more memory and results in more work when the head node service uses the file location information to build out the plan for how it will split out any Hadoop job into tasks across the many nodes. When we’re talking about systems that often have TBs or PBs of data the difference between small and large files can add up quickly.

The other problem comes at the time that the data is read by a Hadoop job. When the job runs on each node it loads the files the task tracker identified for it to work with into memory on that local node (in WASB the chunking is done at this point). When there are more files to be read for the same amount of data it results in more work and slower execution time for each task within each job. Sometimes you will see hard errors when operating system limits are hit related to the number of open files. There is also more internal work involved in reading the larger number of files and combining the data.


There are several options for stitching files together.

  • Combine the files as they land using the code that moves the files. This is the most performant and efficient method in most cases.
  • INSERT into new Hive tables (directories) which creates larger files under the covers. The output file size can be controlled with settings like hive.merge.smallfiles.avgsize and hive.merge.size.per.task.
  • Use a combiner in Pig to load the many small files into bigger splits.
  • Use the HDFS FileSystem Concat API
  • Write custom stitching code and make it a JAR.
  • Enable the Hadoop Archive (HAR). This is not very efficient for this scenario but I am including it for completeness.

There are several writeups out there that address the details of each of these methods so I won’t repeat them.

The key here is to work with fewer, larger files as much as possible in Hadoop. The exact steps to get there will vary depending on your specific scenario.

I hope you enjoyed this small bite of big data!

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow  image
@SQLCindy | @NealAnalytics | |




Understanding WASB and Hadoop Storage in Azure

Yesterday we learned Why WASB Makes Hadoop on Azure So Very Cool. Now let’s dive deeper into Windows Azure storage and WASB. I’ll answer some of the common questions I get when people first try to understand how WASB is the same as and different from HDFS.

What is HDFS?

The Hadoop Distributed File System (HDFS) is one of the core Hadoop components, it is how Hadoop manages data and storage. At a high level, when you load a file into Hadoop the “name node” uses HDFS to chunk the file into blocks and it spreads those blocks of data across the worker nodes within the cluster. Each chunk of data is stored on multiple nodes (assuming the replication factor is set to > 1) for higher availability. The name node knows where each chunk of data is stored and that information is used by the job manager to allocate tasks and resources appropriately across nodes.

What is WASB?

Windows Azure Storage Blob (WASB) is an extension built on top of the HDFS APIs. The WASBS variation uses SSL certificates for improved security. It in many ways “is” HDFS. However, WASB creates a layer of abstraction that enables separation of storage. This separation is what enables your data to persist even when no clusters currently exist and enables multiple clusters plus other applications to access a single piece of data all at the same time. This increases functionality and flexibility while reducing costs and reducing the time from question to insight.

What is an Azure blob store, an Azure storage account, and an Azure container? For that matter, what is Azure again?

Azure is Microsoft’s cloud solution. A cloud is essentially a collection of host data centers that you don’t have to directly manage. You can request services from that cloud. For example, you can request virtual machines and storage, data services such as SQL Azure Database or HDInsight, or services such as Websites or Service Bus. In Azure you store blobs on containers within Azure storage accounts. You grant access to a storage account, you create collections at the container level, and you place blobs (files of any format) inside the containers. This illustration from Microsoft’s documentation helps to show the structure:


How do I manage and configure block/chunk size and the replication factor with WASB?

You don’t. It’s not generally necessary. The data is stored in the Azure storage accounts, remaining accessible to many applications at once. Each blob (file) is replicated 3x within the data center. If you choose to use geo-replication on your account you also get 3 copies of the data in another data center within the same region. The data is chunked and distributed to nodes when a job is run. If you need to change the chunk size for memory related performance at run time that is still an option. You can pass in any Hadoop configuration parameter setting when you create the cluster or you can use the SET command for a given job.

Isn’t one of the selling points of Hadoop that the data sits with the compute? How does that work with WASB?

Just like with any Hadoop system the data is loaded into memory on the individual nodes at compute time (when the job runs). The difference with WASB is that the data is loaded from the storage accounts instead of from local disks. Given the way Azure data center backbones are built the performance is generally the same or better than if you used disks locally attached to the VMs.

How do I load data to Hadoop on Azure?

You use any of the many Azure data loading methods. There isn’t really anything special about loading data that will be used for Hadoop. As with data used by any other application there are some guidelines around directory structures, optimal numbers of files, and internal format but that is independent of data loading. Some common examples are AZCopy, CloudXplorer and other storage explorers, and SQL Server Integration Services (SSIS).

And yes, I will blog about those guidelines but not here. 🙂

Can I have multiple Hadoop clusters pointing to one storage account?


Can I have one Hadoop cluster pointing to multiple storage accounts?

Yes. Check!

See: Use Additional Storage Accounts with HDInsight Hive.

Can I have many Hadoop clusters pointing to multiple storage accounts?

Why, yes. Yes you can. Check!

Do I get to keep my data even if no Hadoop cluster currently exists?

What a fun day to say Yes. Check!

For a caveat see HDInsight: Hive Internal and External Tables Intro.

Is WASB available for any distribution of Hadoop other than HDInsight?

It is my pleasure to answer that with a resounding Yes. Check!

WASB is built into HDInsight (Microsoft’s Hadoop on Azure service) and is the default file system. WASB is also available in the Apache source code for Hadoop. Therefore when you install Hadoop, such as Hortonworks HDP or Cloudera EDH/CDH, on Azure VMs you can use WASB with some configuration changes to the cluster.

How do I manage files and directories?

Hive is the most common entry point for Hadoop jobs and with Hive you never point to a single file, you always point to a directory. If you are a stickler for details and want to point out that Azure doesn’t have directories, that’s technically true. However, Hadoop recognizes that a slash “/” is an indication of a directory. Therefore Hadoop treats the below Azure blob file as if it were AFile.txt in a directory structure of: SomeDirectory/ASubDirectory. But since you don’t access individual files in Hive you will reference either SomeDirectory or SomeDirectory/ASubDirectory.

Blob: wasb://

You can add, remove, and modify files in the Azure blob store without regard to whether a Hadoop cluster exists. Each time a job runs it reads the data that currently exists in the directory(s) it references. Hadoop itself can also write to files.

What about ORCFile, Parquet, and AVRO?

They are proprietary formats often used within Hadoop but rarely used outside of Hadoop. There are performance advantages to using those formats for “write once, read many” data inside Hadoop, but chances are high that you won’t then be able to access the data without going through one of your Hadoop clusters.

Should I have lots of small files?

NO! No!  

Why is too long to answer here. The short answer is to use files that are many multiples of the in-memory chunk size, in the GB or TB size range. Whenever possible use fewer, larger files instead of many small files. If necessary stitch the files together.

That’s your storage lesson for today – please put your additional Hadoop on Azure storage questions in the comments or send me a tweet! Thanks for stopping by!

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow  image
@SQLCindy | @NealAnalytics | |

!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?’http’:’https’;if(!d.getElementById(id)){js=d.createElement(s);;js.src=p+”://”;fjs.parentNode.insertBefore(js,fjs);}}(document,”script”,”twitter-wjs”);


Why WASB Makes Hadoop on Azure So Very Cool

Rescue dogData. It’s all about the data. We want to make more data driven decisions. We want to keep more data so we can make better decisions. We want that data stored cheaply, easily accessible, and quickly ingested. Hadoop promises to help with all those things. However, when you deal with Hadoop on-premises you have a multi-step process to load the data. Azure and WASB to the rescue!

With a typical Hadoop installation you load your data to a staging location then you import it into the Hadoop Distributed File System (HDFS) within a single Hadoop cluster. That data is manipulated, massaged, and transformed. Then you may export some or all of the data back to a non-HDFS system (a SAN, a file share, a website).

What’s different in the cloud? With Azure you have Azure Blob Storage Accounts. Data can be stored there as blobs in any format. That data can be accessed by various applications – including Hadoop without first doing a separate load into HDFS! This is made possible because Microsoft used the public extensions available with HDFS to create the Windows Azure Storage Blobs (WASB) interface between Hadoop and the Azure blob storage. This WASB code is available for any distributor of Hadoop in the Apache source code and it is the default storage system in HDInsight – Microsoft’s Hadoop on Azure PaaS offering. It is also available for Hortonworks HDP on Azure VMs or Cloudera EDH/CDH on Azure VMs with some manual configuration steps.

With WASB you load your data to Azure blobs at any time – whether Hadoop clusters currently exist or not. That way you aren’t paying for Hadoop compute time simply to load data. You spin up one or more clusters, point them at the data sets (yes, multiple clusters pointing to same data!), and run your Hadoop jobs. When you don’t need the system for a while you take down your Hadoop cluster(s) and the data is still there. At any point, whether one or more Hadoop clusters are accessing the data or not, other applications can still access and manipulate the data. For example, you could have data sitting on an Azure storage account that is being added to by a SQL Server Integration Services (SSIS) job. At the same time someone is using Power Query to load that data into PowerPivot while a website inserts new data to the same location. Meanwhile your R&D department can be running highly intensive jobs that require a large cluster up for many days or weeks at a time, and your sales team can have a separate, smaller cluster that’s up for a few hours a day – all pointing at the same data!

With this separation of storage and compute you have simplified your data accessibility, reduced data movement and copies, and reduced the time it takes to have your data available! That all adds up to lower costs and a faster, more data-driven time to insight.

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow  
@SQLCindy | @NealAnalytics | |

Leave a comment

Windows storport enhancement to help troubleshoot IO issues

For Windows 2008 and Windows 2008 R2 you can download a Windows storport enhancement (packaged as a hotfix). This enhancement can lead to faster root cause analysis for slow IO issues. Once you apply this Windows hotfix you can use Event Tracing for Windows (ETW) via perfmon or xperf to capture more detailed IO information that you can share with your storage team. This could be very useful for your SQL Server installations or any system where you suspect IO issues.


Enhancement information and background topics:

·         Storport ETW Logging to Measure Requests Made to a Disk Unit

·         Diagnose Slow I/O on Windows With Xperf (from the MSFT guys who wrote it – based on a pre-release version but the basics still apply)

·         xPerf: Once I Was Blind and Now I See

·         Storport Driver

The hotfix downloads are available from the top of each article and instructions are included in the body of each KB:

·         KB 979764 (Windows Server 2008 hotfix)

·         KB 978000 (Windows Server 2008 R2 hotfix)

· (xperf downloads)


What do those “IO requests taking longer than 15 seconds” messages on my SQL box mean?

You may be sometimes seeing stuck/stalled IO messages on one or more of your SQL Server boxes. This is something it is important to understand so I am providing some background information on it.


Here is the message you may see in the SQL error log:

SQL Server has encountered xxx occurrence(s) of IO requests taking longer than 15 seconds to complete on file [mdf_or_ldf_file_path_name] in database [dbname] (dbid). The OS file handle is 0x…. The offset of the latest long IO is: 0x….”.


The message indicates that SQL Server has been waiting on at least one I/O for 15 seconds or longer. The exact number of times you have exceeded this time for the specified file since the last message is included in the message. The messages will not be written more than once every five minutes. Keep in mind that read IOs on an average system should take no more than 10-20ms and writes should take no more than 3-5ms (the exact acceptable values vary depending on your business needs and technical configuration). So anything measured in seconds indicates a serious performance problem. The problem is NOT within SQL Server, this message indicates SQL has sent off an IO request and has waited more than 15 seconds for a response. The problem is somewhere in the disk IO subsystem. For example, the disk IO subsystem may have more load than it is designed to handle, there is a “bad” hardware or firmware somewhere along the path, filter drivers such as anti-virus software are interfering, your file layout is not optimal, or some IO subsystem setting such as HBA queue depth is not set optimally.


Though the root cause is IO, you can see other symptoms that are a side effect and may lead you down the wrong troubleshooting path. For example, if enough IO is backed up behind the stalled IO then you may see blocking in SQL Server (because locks that are usually taken for very short periods of time are now held for seconds), new connections may not be allowed, and the CPU usage can increase (because many threads are waiting), and a clustered SQL Server can fail over (because the IsAlive checks which are just SQL queries fail to complete like all the other queued queries). You may see other errors returned to the user or in the various logs, such as timeouts.


There are two ways to approach this problem. You can either reduce the IO on the system (change indexes or queries or archive data for example) or you can make the underlying system able to handle the IO load (fix hardware/firmware problems, change configurations, add disks or controllers, change the file layout, etc.).



·         897284  Diagnostics in SQL Server 2000 SP4 and in later versions help detect stalled and stuck I/O operations;EN-US;897284

·         Detecting and Resolving Stalled and Stuck I/O Issues in SQL Server 2000 SP 4



·         Every Windows 2003 SP1 or SP2 system should have this storport fix: 941276  A Windows Server 2003-based computer stops responding when the system is under a heavy load and when the Storport driver is being used;EN-US;941276

·         Use PerfMon to look at the disk counters for sec/read, sec/write, bytes/sec, current disk queue length, reads/sec, writes/sec

·         Collect data from sys.dm_io_virtual_file_stats and sys.dm_io_pending_io_requests.

·         Ask your storage admins to monitor the entire IO subsystem from the Windows system all the way through to the underlying disks.


Compilation of SQL Server TempDB IO Best Practices

It is important to optimize TempDB for good performance. In particular, I am focusing on how to allocate files.


TempDB is a unique database in several ways. The ones most relevant to this discussion are:

·         It is often one of the busiest databases on an instance. This means the performance of TempDB is critical to your instance’s overall performance.

·         It is recreated as a copy of model each time SQL Server starts, taking all the properties of model except for the location, number, and size of its data and log files.

·         TempDB has a very high rate of create/drop object activity. This means the system metadata related to object creation/deletion is heavily used.

·         Slightly different logging and latching behavior.


General recommendations:

·         Pre-size TempDB appropriately. Leave autogrow on with instant file initialization enabled, but try to configure the database so that it never hits an autogrow event. Make sure the autogrow growth increment is appropriate.

·         Follow general IO recommendations for fast IO.

·         If your TempDB experiences metadata contention (waitresource = 2:1:1 or 2:1:3), you should split out your data onto multiple files. Generally you will want somewhere between 1/4 and 1 file per physical core. If you don’t want to wait to see if any metadata contention occurs you may want to start out with around 1/4 to 1/2 the number of data files as CPUs up to about 8 files. If you think you might need more than 8 files we should do some testing first to see what the impact is. For example, if you have 8 physical CPUs you may want to start with 2-4 data files and monitor for metadata contention.

·         All TempDB data files should be of equal size.

·         As with any database, your TempDB performance may improve if you spread it out over multiple drives. This only helps if each drive or mount point is truly a separate IO path. Whether each TempDB will have a measurable improvement from using multiple drives depends on the specific system.

·         In general you only need one log file. If you need to have multiple log files because you don’t have enough disk space on one drive that is fine, but there is no direct benefit from having the log on multiple files or drives.

·         On SQL Server 2000 and more rarely on SQL Server 2005 or later you may want to enable trace flag -T1118.

·         Avoid shrinking TempDB (or any database) files unless you are very certain you will never need the space again.



·         Working with tempdb in SQL Server 2005

o   “Divide tempdb into multiple data files of equal size. These multiple files don’t necessarily be on different disks/spindles unless you are also encountering I/O bottlenecks as well. The general recommendation is to have one file per CPU because only one thread is active per CPU at one time.”

o   “Having too many files increases the cost of file switching, requires more IAM pages, and increases the manageability overhead.”

·         How many files should a database have? – Part 1: OLAP workloads

o   If you have too many files you can end up with smaller IO block sizes and decreased performance under extremely heavy load.

o   If you have too few files you can end up with decreased performance to GAM/SGAM contention (generally the problem you see in TempDB) or PFS contention (extremely heavy inserts).

o   The more files you have per database the longer it takes to do database recovery (bringing a database online, such as during SQL Server startup). This can become a problem with hundreds of files.

·         SQL Server Urban Legends Discussed

o   ” SQL Server uses asynchronous I/O allowing any worker to issue an I/O requests regardless of the number and size of the database files or what scheduler is involved.”

o   ” Tempdb is the database with the highest level of create and drop actions and under high stress the allocation pages, syscolumns and sysobjects can become bottlenecks.   SQL Server 2005 reduces contention with the ‘cached temp table’ feature and allocation contention skip ahead actions.”

·         Concurrency enhancements for the tempdb database

o   Note that this was originally written for SQL Server 2000 (the applies to section only lists 2000) and there are some tweaks/considerations for later versions that are not covered completely in this article. For example, -T1118 is not only much less necessary on SQL Server 2005+, it can in some cases cause problems.

·         FIX: Blocking and performance problems may occur when you enable trace flag 1118 in SQL Server 2005 if the temporary table creation workload is high;EN-US;936185

o   If you have SP2 based CU2 or later you will not see the problems described in this article. Also, on SP2 based CU2 or higher you are much less likely to even need -T1118 on a heavily used TempDB.

o   ” This hotfix significantly reduces the need to force uniform allocations by using trace flag 1118. If you apply the fix and are still encountering TEMPDB contention, consider also turning on trace flag 1118.”

·         Misconceptions around TF 1118

o   ” turn on TF1118, which makes the first 8 data pages in the temp table come from a dedicated extent “

o   “Instead of a 1-1 mapping between processor cores and tempdb data files (*IF* there’s latch contention), now you don’t need so many – so the recommendation from the SQL team is the number of data files should be 1/4 to 1/2 the number of processor cores (again, only *IF* you have latch contention). The SQL CAT team has also found that in 2005 and 2008, there’s usually no gain from having more than 8 tempdb data files, even for systems with larger numbers of processor cores. Warning: generalization – your mileage may vary – don’t post a comment saying this is wrong because your system benefits from 12 data files. It’s a generalization, to which there are always exceptions.”

·         Storage Top 10 Best Practices  

o   “Make sure to move TEMPDB to adequate storage and pre-size after installing SQL Server. “

o   “Performance may benefit if TEMPDB is placed on RAID 1+0 (dependent on TEMPDB usage). “

o   “This is especially true for TEMPDB where the recommendation is 1 data file per CPU. “

o   “Dual core counts as 2 CPUs; logical procs (hyperthreading) do not. “

o   “Data files should be of equal size – SQL Server uses a proportional fill algorithm that favors allocations in files with more free space.

o   “Pre-size data and log files. “

o   “Do not rely on AUTOGROW, instead manage the growth of these files manually. You may leave AUTOGROW ON for safety reasons, but you should proactively manage the growth of the data files. “

Optimizing tempdb Performance

Leave a comment

SQL Server with NetApp SAN

If you are planning to use  NetApp as the SAN for your SQL Server instance(s), take a look at these documents in addition to the normal SQL Server IO planning best practices documents.

TR-3779 Sizing best practice guide.

TR-3696 This is for the storage layout best practices.

White Paper on 1 TB DSS systems

SMSQL 5.0 Best Practice Guide

Microsoft® SQL Server 2005 Performance and Scalability Testing Using NetApp FAS920 Storage Systems