Befriending Dragons

Turn Scary Into Attainable

Leave a comment

Big Data for the SQL Eye

SQL Server is a great technology – I’ve been using it since 1993 when the user interface consisted of a query window with the options to save and execute and not much else. With every release there’s something new and exciting and there’s always something to learn about even the most familiar of features. However, not everyone uses SQL Server for every storage and compute opportunity – sad but true.

So what is a SQL geek to do in the face of all the new options out there – many under the umbrella of Big Data (distributed processing)? Why just jump right on in and learn it! No one can know all the pieces because it’s a big, fluid, messy collection of “things”. But don’t worry about that, start with one thing and build from there. Even if you never plan to implement a production Big Data system you need to learn about it – because if you don’t have some hands-on experience with it then someone who does have that experience will be influencing the decision makers without you. For a SQL Pro I suggest Hive as that easy entry point. At some point maybe Spark SQL will jump into that gap, but for now Hive is the easiest entry point for most SQL pros.

For more, I refer you to the talk I gave at the Pacific Northwest SQL Server User Group meeting on October 14, 2015. Excerpts are below, the file is attached.

Look, it’s SQL!

SELECT score, fun
WHERE type = ‘they pay me for this?’;

Here’s how that code looks from Visual Studio along with the links to how you find the output and logs:


And yet it’s more!

(fun STRING,
rank INT COMMENT ‘rank the greatness’,
type STRING)
COMMENT ‘two tables walk into a bar….’
LOCATION ‘/data/demo/’;


A mix of old and new

— read some data
SELECT ‘you cannot make me ‘, score, fun, type
WHERE score <= 0
ORDER BY score;

SELECT ‘when can we ‘, score, fun, type
WHERE score > 0


That’s Hive folks!


on Hadoop
on HDInsight
on AzureBig Data in the cloud!

Hadoop Shines When….
(refer to

Data exploration, analytics and reporting, new data-driven actionable insights
Rapid iterating
Unknown unknowns
Flexible scaling
Data driven actions for early competitive advantage or first to market
Low number of direct, concurrent users
Low cost data archival

Hadoop Anti-Patterns….

Replace system whose pain points don’t align with Hadoop’s strengths
OLTP needs adequately met by an existing system
Known data with a static schema
Many end users
Interactive response time requirements (becoming less true)
Your first Hadoop project + mission critical system


Azure has so much more

Go straight to the business code
Scale storage and compute separately
Open Source
Managed and unmanaged services
On-demand and 24×7 options
SQL Server

It’s a Polyglot

Stream your data into a lake
Pick the best compute for each task

And it’s Fun!

I hope you enjoyed this small bite of big data!




Create HDInsight Cluster in Azure Portal

Creating an HDInsight cluster from the Azure portal is very easy. However, sometimes you want all the choices and best practices explained as well as the "how to". I have created a series of slides with audio recordings to walk you through the process and choices. They are available as sessions 1-8 of "Create HDInsight Cluster in Azure Portal" on my YouTube channel Small Bites of Big Data.

Playlist Getting Started with HDInsight: 

  1. Why HDInsight:
  2. Azure Subscription:
  3. Azure Storage – WASB:
  4. Metastore:
  5. Create HDInsight:
  6. Hive Query:
  7. Load Demo Data:
  8. Pricing, Automation, and Wrapup:

PowerPoint deck: 


Why HDInsight?

HDInsight is Hadoop on Azure as a service.

  • Easy, cost effective, changeable scale out data processing
  • Lower TCO – easily add/remove/scale
  • Separation of storage and compute allows data to exist across clusters
  • Hortonworks HDP is one of the 3 major Hadoop
    distributors, the most purely open source
  • HDInsight *IS* Hortonworks HDP as a service in Azure (cloud)
  • Metastore (Hcatalog) exists independently across clusters via SQL DB
  • #, size, type of clusters are flexible and can all access the same data
  • Hive is a Hadoop component that makes data look like rows/columns for data warehouse type activities

It offers the standard advantages of Hadoop:

  • Scale-out
  • Load data now, add schema later (write once, read many)
  • Fail fast – iterate through many questions to find the right question
  • Faster time from question to insight
  • Hadoop is “just another data source” for BI, Analytics, Machine Learning

In addition you have the advantages of Hadoop in the cloud:

  • Instantly access data born in the cloud
  • Easily, cheaply load, share, and merge public or private data
  • Data exists independently across clusters (separation of storage and compute) via WASB on Azure storage accounts

Recording of why HDInsight on YouTube

Azure Subscription

You have many options to obtain a Microsoft Azure subscription:


Login to Azure Subscription

1. Login on Azure Portal

2. Use a Microsoft Account
Note: Some companies have federated their accounts and can use company accounts.


Choose Subscription

Most accounts will only have one Azure subscription associated with them. But if you seem to have unexpected resources, check to make sure you are in the expected subscription. The Subscriptions button is on the upper right of the Azure portal.



Add Accounts

Option: Add more Microsoft Accounts as admins of the Azure Subscription.

1. Choose SETTINGS at the very bottom on the left.

2. Then choose ADMINISTRATORS at the top. Click on the ADD button at the very bottom.

3. Enter a Microsoft Account or federated enterprise account that will be an admin.


Recording of getting started with an Azure subscription on YouTube

Azure Storage – WASB

I recommend you manually create at least one Azure storage account and container ahead of time. While the HDInsight creation dialogue gives the option of creating the storage account and container for you, that only works if you don’t plan to reuse data across clusters.

Create a Storage Account

1. Click on STORAGE in the left menu then NEW.

2. URL: Choose a lower-case storage account name that is unique within *

3. LOCATION: Choose the same location for the SQL Azure metastore database, the storage account(s), and HDInsight.

4. REPLICATION: Locally redundant stores fewer copies and costs less.


Repeat if you need additional storage.

Create a Container

1. Click on your storage account in the left menu then CONTAINERS on the top.

2. Choose CREATE A CONTAINER or choose the NEW button at the bottom.

3. Enter a lower-case NAME for the container, unique within that storage account.

4. Choose either Private or Public ACCESS. If there is any chance of sensitive or PII data being loaded to this container choose Private. Private access requires a key. HDInsight can be configured with that key during creation or keys can be passed in for individual jobs.

This will be the default container for the cluster. If you want to manage your data separately you may want to create additional containers.



Additional information about storage, including details on Windows Azure Storage Blobs (WASB) is on


Recording of creating an Azure storage account and container on YouTube.

Metastore (HCatalog)

In Azure you have the option to create a metastore for Hive and/or Oozie that exists independently of your HDInsight clusters. This allows you to reuse your Hive schemas and Oozie workflows as you drop and recreate your cluster(s). I highly recommend using this option for a production environment or anything that involves repeated access to the same, standard schemas and/or workflows.

Create a Metastore aka Azure SQL DB

Persist your Hive and Oozie metadata across cluster instances, even if no cluster exists, with an HCatalog metastore in an Azure SQL Database. This database should not be used for anything else. While it works to share a single metastore across multiple instances it is not officially tested or supported.

1. Click on SQL DATABASES then NEW and choose CUSTOM CREATE.

2. Choose a NAME unique to your server.

3. Click on the “?” to help you decide what TIER of database to create.

4. Use the default database COLLATION.

5. If you choose an existing SERVER you will share sysadmin access with other databases.


You can make the system more secure if you create a custom login on the Azure server. Add that login as a user in the database you just created. Grant it minimal read/write permissions in the database. This is not well documented or tested so the exact permissions needed for this are vague. You may see odd errors if you don’t grant the appropriate permissions.

Firewall Rules

In order to refer to the metastore from automated cluster creation scripts such as PowerShell your workstation must be added to the firewall rules.

1. Click on MANAGE then choose YES.

2. You can also use the MANAGE button to connect to the SQL Azure database and manage logins and permissions.


Recording of creating the metastore on YouTube.

Create the HDInsight Cluster

Now that we have the pre-requisites done we can move on to creating the cluster.

  • Quick Create through the Azure portal is the fastest way to get started with all the default settings.
  • The Azure portal Custom Create allows you to customize size, storage, and other configuration options.
  • You can customize and automate through code including .NET and PowerShell. This increases standardization and lets you automate the creation and deletion of clusters over time.
  • For all the examples here we will create a basic Hadoop cluster with Hive, Pig, and MapReduce.
  • A cluster will take several minutes to create, the type and size of the cluster have little impact on the time for creation.

Quick Create Option

For your first cluster choose a Quick Create.

1. Click on HDINSIGHT in the left menu, then NEW.

2. Choose Hadoop. HBase and Storm also include the features of a basic Hadoop cluster but are optimized for in-memory key value pairs (HBase) or alerting (Storm).

3. Choose a NAME unique in the domain.

4. Start with a small CLUSTER SIZE, often 2 or 4 nodes.

5. Choose the admin PASSWORD.

6. The location of the STORAGE ACCOUNT determines the location of the cluster.


Custom Create Option

You can also customize your size, admin account, storage, metastore, and more through the portal. We’ll walk through a basic Hadoop cluster.


1. Click on HDINSIGHT in the left menu, then NEW in the lower left.



Basic Info

1. Choose a NAME unique in the domain.

2. Choose Hadoop. HBase and Storm also include the features of a basic Hadoop cluster but are optimized for in-memory key-value pairs (HBase) or alerting (Storm).

3. Choose Windows or Linux as the OPERATING SYSTEM. Linux is only available if you have signed up for the preview.

4. In most cases you will want the default VERSION.


Size and Location

1. Choose the number of DATA NODES for this cluster. Head nodes and gateway nodes will also be created and they all use HDInsight cores. For information on how many cores are used by each node see the “Pricing details” link.

2. Each subscription has a billing limit set for the maximum number of HDInsight cores available to that subscription. To change the number available to your subscription choose “Create a support ticket.” If the total of all HDInsight cores in use plus the number needed for the cluster you are creating exceeds the billing limit you will receive a message: “This cluster requires X cores, but only Y cores are available for this subscription”. Note that the messages are in cores and your configuration is specified in nodes.

3. The storage account(s), metastore, and cluster will all be in the same REGION.


Cluster Admin

1. Choose an administrator USER NAME. It is more secure to avoid “admin” and to choose a relatively obscure name. This account will be added to the cluster and doesn’t have to match any existing external accounts.

2. Choose a strong PASSWORD of at least 10 characters with upper/lower case letters, a number, and a special character. Some special characters may not be accepted.


Metastore (HCatalog)

On the same page as the Hadoop cluster admin account you can optionally choose to use a common metastore (Hcatalog).

1. Click on the blue box to the right of “Enter the Hive/Oozie Metastore”. This makes more fields available.

2. Choose the SQL Azure database you created earlier as the METASTORE.

3. Enter a login (DATABASE USER) and PASSWORD that allow you to access the METASTORE database. If you encounter errors, try logging in to the database manually from the portal. You may need to open firewall ports or change permissions.


Default Storage Account

Every cluster has a default storage account. You can optionally specify additional storage accounts at cluster create time or at run time.

1. To access existing data on an existing STORAGE ACCOUNT, choose “Use Existing Storage”.

2. Specify the NAME of the existing storage account.

3. Choose a DEFAULT CONTAINER on the default storage account. Other containers (units of data management) can be used as long as the storage account is known to the cluster.

4. To add ADDITIONAL STORAGE ACCOUNTS that will be accessible without the user providing the storage account key, specify that here.


Additional Storage Accounts

If you specified there will be additional accounts you will see this screen.

1. If you choose “Use Existing Storage” you simply enter the NAME of the storage account.

2. If you choose “Use Storage From Another Subscription” you specify the NAME and the GUID KEY for that storage account.

image image

Script Actions

You can add additional components or configure existing components as the cluster is deployed. This is beyond the scope of this demo.

1. Click “add script action” to show the remaining parameters.

2. Enter a unique NAME for your action.

3. The SCRIPT URI points to code for your custom action.

4. Choose the NODE TYPE for deployment.


Create is Done!

Once you click on the final checkmark Azure goes to work and creates the cluster. This takes several minutes. When the cluster is ready you can view it in the portal.


Recording of HDInsight quick and custom create on YouTube

Query with Hive

For most people the easiest, fastest way to learn Hadoop is through Hive. Hive is also the most widely used component of Hadoop. When you use the Hive ODBC driver any ODBC-compliant app can access the Hive data as "just another data source". That includes Azure Machine Learning, Power BI, Excel, and Tableau.

Hive Console

The simplest, most relatable way for most people to use Hadoop is via the SQL-like, Database-like Hive and HiveQL (HQL).

1.  Put focus on your HDInsight cluster and choose QUERY CONSOLE to open a new tab in your browser. In my case it opens:

2.  Click on Hive Editor.



Query Hive

The query console defaults to selecting the first 10 rows from the pre-loaded sample table. This table is created when the cluster is created.

1. Optionally edit or replace the default query:
Select * from hivesampletable LIMIT 10;

2. Optionally name your query to make it easier to find in the job history.

3. Click Submit.

Hive is a batch system optimized for processing huge amounts of data. It spends several seconds up front splitting the job across the nodes and this overhead exists even for small result sets. If you are doing the equivalent of a table scan in SQL Server and have enough nodes in Hadoop, Hadoop will probably be faster than SQL Server. If your query uses indexes in SQL Server, then SQL Server will likely be faster than Hive.


View Hive Results

1. Click on the Query you just submitted in the Job Session. This opens a new tab.


2. You can see the text of the Job Query that was submitted. You can Download it.

3. The first few lines of the Job Output (query result) are available. To see the full output choose Download File.

4. The Job Log has details including errors if there are any.

5. Additional information about the job is available in the upper right.


View Hive Data in Excel Workbook

At this point HDInsight is “just another data source” for any application that supports ODBC.

1. Install the Microsoft Hive ODBC driver.

2. Define an ODBC data source pointing to your HDInsight instance.

3. From DATA choose From Other Sources and From Data Connection Wizard.


View Hive Data in PowerPivot

At this point HDInsight is “just another data source” for any application that supports ODBC.

1. Install the Microsoft Hive ODBC driver.

2. Define an ODBC data source pointing to your HDInsight instance.

3. Click on POWERPIVOT then choose Manage. This opens a new PowerPivot for Excel window.

4. Choose Get External Data then Others (OLEDB/ODBC).

Now you can combine the Hive data with other data inside the tabular PowerPivot data model.


Recording of querying Hive on YouTube

Load Demo Data

In the cloud you don’t have to load data to Hadoop, you can load data to an Azure Storage Account. Then you point your HDInsight or other WASB compliant Hadoop cluster to the existing data source. There many ways to load data, for the demo we’ll use CloudXplorer.

You use the Accounts button to add Azure, S3, or other data/storage accounts you want to manage.

In this example nealhadoop is the Azure storage account, demo is the container, and bacon is a “directory”. The files are bacon1.txt and bacon2.txt. Any Hive tables would point to the bacon directory, not to individual files. Drag and drop files from Windows Explorer to CloudXplorer.

Windows Azure Storage Explorers (2014)


Recording of loading demo data on YouTube


Once you have created the HDInsight cluster you can use it and play with it and try many things. When you are done, simply remove the cluster. If you created an independent metastore in SQL Azure you can use that same metastore and the same Azure storage account(s) the next time you create a cluster. You are charged for the existence of the cluster, not for the usage of it. So make sure you drop the cluster when you aren’t using it. You can use automation, such as PowerShell, to spin up a cluster that is configured the same every time and to drop it. Check the website for the most recent information.



Automate with PowerShell

With PowerShell, .NET, or the Cross-Platform cmd line tools you can specify even more configuration settings that aren’t available in the portal. This includes node size, a library store, and changing default configuration settings such as Tez and compression.

Automation allows you to standardize and with version control lets you track your configurations over time.

Sample PowerShell Script: HDInsight Custom Create If your HDInsight and/or Azure cmdlets don’t match the current documention or return unexpected errors run Web Platform Installer and check for a new version of “Microsoft Azure PowerShell with Microsoft Azure SDK” or “Microsoft Azure PowerShell (standalone).”


Recording of Pricing, Automation, and Wrapup on YouTube


  • HDInsight is Hadoop on Azure as a service, specifically Hortonworks HDP on either Windows or Linux
  • Easy, cost effective, changeable scale out data processing for a lower TCO – easily add/remove/scale
  • Separation of storage and compute allows data to exist across clusters via WASB
  • Metastore (Hcatalog) exists independently across clusters via SQL DB
  • #, size, type of clusters are flexible and can all access the same data
  • Instantly access data born in the cloud; Easily, cheaply load, share, and merge public or private data
  • Load data now, add schema later (write once, read many)
  • Fail fast – iterate through many questions to find the right question
  • Faster time from question to insight
  • Hadoop is “just another data source” for BI, Analytics, Machine Learning

I hope you enjoyed this Small Bite of Big Data! Happy Hadooping!

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow  
@SQLCindy | @NealAnalytics | |


Understanding WASB and Hadoop Storage in Azure

Yesterday we learned Why WASB Makes Hadoop on Azure So Very Cool. Now let’s dive deeper into Windows Azure storage and WASB. I’ll answer some of the common questions I get when people first try to understand how WASB is the same as and different from HDFS.

What is HDFS?

The Hadoop Distributed File System (HDFS) is one of the core Hadoop components, it is how Hadoop manages data and storage. At a high level, when you load a file into Hadoop the "name node" uses HDFS to chunk the file into blocks and it spreads those blocks of data across the worker nodes within the cluster. Each chunk of data is stored on multiple nodes (assuming the replication factor is set to > 1) for higher availability. The name node knows where each chunk of data is stored and that information is used by the job manager to allocate tasks and resources appropriately across nodes.

What is WASB?

Windows Azure Storage Blob (WASB) is an extension built on top of the HDFS APIs. The WASBS variation uses SSL certificates for improved security. It in many ways "is" HDFS. However, WASB creates a layer of abstraction that enables separation of storage. This separation is what enables your data to persist even when no clusters currently exist and enables multiple clusters plus other applications to access a single piece of data all at the same time. This increases functionality and flexibility while reducing costs and reducing the time from question to insight.

What is an Azure blob store, an Azure storage account, and an Azure container? For that matter, what is Azure again?

Azure is Microsoft’s cloud solution. A cloud is essentially a collection of host data centers that you don’t have to directly manage. You can request services from that cloud. For example, you can request virtual machines and storage, data services such as SQL Azure Database or HDInsight, or services such as Websites or Service Bus. In Azure you store blobs on containers within Azure storage accounts. You grant access to a storage account, you create collections at the container level, and you place blobs (files of any format) inside the containers. This illustration from Microsoft’s documentation helps to show the structure:


How do I manage and configure block/chunk size and the replication factor with WASB?

You don’t. It’s not generally necessary. The data is stored in the Azure storage accounts, remaining accessible to many applications at once. Each blob (file) is replicated 3x within the data center. If you choose to use geo-replication on your account you also get 3 copies of the data in another data center within the same region. The data is chunked and distributed to nodes when a job is run. If you need to change the chunk size for memory related performance at run time that is still an option. You can pass in any Hadoop configuration parameter setting when you create the cluster or you can use the SET command for a given job.

Isn’t one of the selling points of Hadoop that the data sits with the compute? How does that work with WASB?

Just like with any Hadoop system the data is loaded into memory on the individual nodes at compute time (when the job runs). The difference with WASB is that the data is loaded from the storage accounts instead of from local disks. Given the way Azure data center backbones are built the performance is generally the same or better than if you used disks locally attached to the VMs.

How do I load data to Hadoop on Azure?

You use any of the many Azure data loading methods. There isn’t really anything special about loading data that will be used for Hadoop. As with data used by any other application there are some guidelines around directory structures, optimal numbers of files, and internal format but that is independent of data loading. Some common examples are AZCopy, CloudXplorer and other storage explorers, and SQL Server Integration Services (SSIS).

And yes, I will blog about those guidelines but not here. 🙂

Can I have multiple Hadoop clusters pointing to one storage account?


Can I have one Hadoop cluster pointing to multiple storage accounts?

Yes. Check!

See: Use Additional Storage Accounts with HDInsight Hive.

Can I have many Hadoop clusters pointing to multiple storage accounts?

Why, yes. Yes you can. Check!

Do I get to keep my data even if no Hadoop cluster currently exists?

What a fun day to say Yes. Check!

For a caveat see HDInsight: Hive Internal and External Tables Intro.

Is WASB available for any distribution of Hadoop other than HDInsight?

It is my pleasure to answer that with a resounding Yes. Check!

WASB is built into HDInsight (Microsoft’s Hadoop on Azure service) and is the default file system. WASB is also available in the Apache source code for Hadoop. Therefore when you install Hadoop, such as Hortonworks HDP or Cloudera EDH/CDH, on Azure VMs you can use WASB with some configuration changes to the cluster.

How do I manage files and directories?

Hive is the most common entry point for Hadoop jobs and with Hive you never point to a single file, you always point to a directory. If you are a stickler for details and want to point out that Azure doesn’t have directories, that’s technically true. However, Hadoop recognizes that a slash "/" is an indication of a directory. Therefore Hadoop treats the below Azure blob file as if it were AFile.txt in a directory structure of: SomeDirectory/ASubDirectory. But since you don’t access individual files in Hive you will reference either SomeDirectory or SomeDirectory/ASubDirectory.

Blob: wasb://

You can add, remove, and modify files in the Azure blob store without regard to whether a Hadoop cluster exists. Each time a job runs it reads the data that currently exists in the directory(s) it references. Hadoop itself can also write to files.

What about ORCFile, Parquet, and AVRO?

They are proprietary formats often used within Hadoop but rarely used outside of Hadoop. There are performance advantages to using those formats for "write once, read many" data inside Hadoop, but chances are high that you won’t then be able to access the data without going through one of your Hadoop clusters.

Should I have lots of small files?

NO! No!  

Why is too long to answer here. The short answer is to use files that are many multiples of the in-memory chunk size, in the GB or TB size range. Whenever possible use fewer, larger files instead of many small files. If necessary stitch the files together.

That’s your storage lesson for today – please put your additional Hadoop on Azure storage questions in the comments or send me a tweet! Thanks for stopping by!

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow  image
@SQLCindy | @NealAnalytics | |

!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?’http’:’https’;if(!d.getElementById(id)){js=d.createElement(s);;js.src=p+”://”;fjs.parentNode.insertBefore(js,fjs);}}(document,”script”,”twitter-wjs”);


Azure Maximums and Resource Usage from PowerShell

Technorati Tags: ,

Have you ever struggled to find out how many VM cores, HDInsight cores, storage accounts, or other Azure resources your subscription is set to allow or how many you actually use? Maybe you want to use this information in your automation scripts to avoid trying to create components for which you don’t have resources.

quizzical owl

PowerShell to the rescue!

First a couple of key points. There are various maximums in Azure. Today we are talking about finding the currently configured maximums allowed for a specified subscription. There are default maximums (default limit) which you can increase for a given subscription by opening a billing support ticket. There are also hard maximums (maximum limit). However, with some products, such as HDInsight (Hadoop), you can get past some per-subscription maximums for dependent services by combining resources (storage accounts) from multiple subscriptions for a single HDInsight cluster. All the samples below find the current billing quota limitation and actual usage for the current subscription.

Let’s take a look at the information available on the subscription level cmdlet.

Start by checking which subscription is in focus / current for the PowerShell session.

(Get-AzureSubscription -Current).SubscriptionName

(Get-AzureSubscription -Current).CurrentStorageAccountName

If you need information on a different subscription either pass the subscription name (as defined on your client) for the cmdlets that support this or change the focus to a different subscription.

$SubName = "sqlcatwoman"

Select-AzureSubscription -SubscriptionName $SubName

Now we will look at the cores available for Azure virtual machines (VMs / IaaS). Note that HDInsight cores are tracked separately. Be careful with unexpected line wraps that may paste into your PowerShell window (or ISE) incorrectly. The below snippet is 1 comment line and 4 lines of code.

# How many cores are available to create new VMs (or increase size of existing VMs) for the current subscription?

[int]$maxVMCores     = (Get-AzureSubscription -current -ExtendedDetails).maxcorecount

[int]$currentVMCores = (Get-AzureSubscription -current -ExtendedDetails).currentcorecount

[int]$availableCores = $maxVMCores $currentVMCores

Write-Host "Cores available for VMs:" $availableCores

We can get similar information about cloud services:

#how many cloud (hosted) services are available on this subscription

[int]$maxAvl         = (Get-AzureSubscription -current -ExtendedDetails).MaxHostedServices

[int]$currentUsed    = (Get-AzureSubscription -current -ExtendedDetails).CurrentHostedServices

[int]$availableNow   = $maxAvl $currentUsed

Write-Host "Cloud services available:" $availableNow

Some limits and usage are available on cmdlets specific to a particular technology. For example, the HDInsight usage and maximums are available from the Get-AzureHDInsightProperties cmdlet. You can find details and samples on Get HDInsight Properties with PowerShell.

Other times we have to look at different cmdlets for different pieces of the information, such as for storage accounts:

#how many storage accounts are available on this subscription

[int]$maxAvl         = (Get-AzureSubscription -current -ExtendedDetails).MaxStorageAccounts

[int]$currentUsed    = (Get-AzureStorageAccount).Count

[int]$availableNow   = $maxAvl $currentUsed

Write-Host "Storage Accounts available:" $availableNow

We can look at all the extended properties available for a subscription:happy owl

Get-AzureSubscription -currentExtendedDetails

If you know you have a particular component created and this cmdlet shows the “Current” value is zero, take a look at the Get-Azure… cmdlet for that particular type of resource and look for a “Current” value.

Another handy thing to look at is the overall information about what Azure regions exist and what services are available in each region:


And you can pull off specific information:

Get-AzureLocation  | Select DisplayName

I hope these small bites of PowerShell help save the day for you in some way!

Leave a comment

Get HDInsight Properties with PowerShell

Small Bites of Big Data from AzureCAT

You’ve created your HDInsight Hadoop clusters and now you want to know exactly what you have out there in Azure. Maybe you want to pull the key information into a repository periodically as a reference for future troubleshooting, comparisons, or billing. Maybe you just need to get a quick look at your overall HDInsight usage. This is something you can easily automate with PowerShell.


First, open Windows Azure PowerShell or powershell_ise.exe.

Set some values for your environment:

$SubName = "YourSubscriptionName"
Select-AzureSubscription -SubscriptionName $SubName
Get-AzureSubscription -Current
$ClusterName = "HDInsightClusterName" #HDInsight cluster name

HDInsight Usage for the Subscription

Take a look at your overall HDInsight usage for this subscription:


Get-AzureHDInsightProperties returns the number of clusters for this subscription, the total HDInsight cores used and available (for head nodes and data nodes), the Azure regions where HDInsight clusters can be created, and the HDInsight versions available for new clusters:

ClusterCount    : 2
CoresAvailable  : 122
CoresUsed       : 48
Locations       : {East US, North Europe, Southeast Asia, West Europe...}
MaxCoresAllowed : 170
Versions        : {1.6, 2.1, 3.0}

You can also pick out specific pieces of information and write them to a file, store them as variables, or use them elsewhere. This example simply outputs the values to the screen.

write-host '== Max HDInsight Cores for Sub: ' (Get-AzureHDInsightProperties).MaxCoresAllowed
write-host '== Cores Available:             ' (Get-AzureHDInsightProperties).CoresAvailable
write-host '== Cores Used:                  ' (Get-AzureHDInsightProperties).CoresUsed

HDInsight Cluster Information

Get-AzureHDInsightCluster provides information about all existing HDInsight clusters for this subscription:

As you can see this cmdlet tells you the size, connection information, and version.
ClusterSizeInNodes    : 4
ConnectionUrl         :
CreateDate            : 4/5/2014 3:37:23 PM
DefaultStorageAccount :
HttpUserName          : Admin
Location              : West US
Name                  : BigCAT30
State                 : Running
StorageAccounts       : {}
SubscriptionId        : {YourSubID}
UserName              : Admin
Version               :
VersionStatus         : Compatible

ClusterSizeInNodes    : 4
ConnectionUrl         :
CreateDate            : 5/5/2014 6:09:58 PM
DefaultStorageAccount :
HttpUserName          : Admin
Location              : West US
Name                  : cgrosstest
State                 : Running
StorageAccounts       : {}
SubscriptionId        : {YourSubID}
UserName              : Admin
Version               :
VersionStatus         : Compatible

You can also get information about just one HDInsight cluster at a time:

Get-AzureHDInsightCluster  -name $ClusterName

Or you can get very granular and look at specific properties, even some that aren’t in the default values:

write-host '== Default Storage Account:     ' `
(Get-AzureHDInsightCluster -Cluster $ClusterName).DefaultStorageAccount.StorageAccountName.split(".")[0]
write-host '== Default Container:           ' `
(Get-AzureHDInsightCluster -Cluster $ClusterName).DefaultStorageAccount.StorageContainerName

This information will be a valuable source of information for tracking past configurations, current usage, and planning. Enjoy your Hadooping!

Sample Script

# Cindy Gross 2014
# Get HDInsight properties
$SubName = "YourSubscriptionName"
Select-AzureSubscription -SubscriptionName $SubName
Get-AzureSubscription -Current
$ClusterName        = "YourHDInsightClusterName" #HDInsight cluster name

Get-AzureHDInsightCluster  -name $ClusterName
write-host '== Default Storage Account:     ' `
(Get-AzureHDInsightCluster -Cluster $ClusterName).DefaultStorageAccount.StorageAccountName.split(".")[0]
write-host '== Default Container:           ' `
(Get-AzureHDInsightCluster -Cluster $ClusterName).DefaultStorageAccount.StorageContainerName
write-host '== Max HDInsight Cores for Sub: ' (Get-AzureHDInsightProperties).MaxCoresAllowed
write-host '== Cores Available:             ' (Get-AzureHDInsightProperties).CoresAvailable
write-host '== Cores Used:                  ' (Get-AzureHDInsightProperties).CoresUsed


Getting Started with Azure PowerShell Cmdlets–Subscription Management

I’ve started using the Azure PowerShell cmdlets more often to manage virtual machines and HDInsight in Azure. Once you connect to a subscription everything just works. However, the initial steps to get one or more subscriptions configured to be used from your machine or understanding how to change subscription information on your machine can be confusing. Some of the docs are contradictory, outdated, or incomplete. Often they assume you are only a co-admin of one subscription. The below steps should get you going with Azure cmdlets whether you admin one or many subscriptions.

You need to enable your machine to talk to one or more Azure subscriptions. The first step is creating a certificate. Do NOT do this if you already used the PublishSettings commands unless you first use Remove-AzureSubscription (which removes the locally stored information about the specified subscription). Makecert is more secure than PublishSettings, especially if you (a given email address) have multiple co-administrators per subscription and/or you (a given email address) are a co-administrator of multiple subscriptions.

The steps to get going are documented in Shep’s blog “Cloud Spelunking, Managing Azure form your Desktop via PowerShell (the Setup)” I’ll go a bit deeper and fill in a few additional details on what Shep calls the “hard” option.

Create a Certificate

If you have IIS, Visual Studio, or the Windows SDK you will have some variation of a “Developer Command Prompt” (or VS201x or Visual Studio Command Prompt). Open that command prompt with the “run as administrator” option. Replace YourCertName with a meaningful name and run the below command. The cert always goes to the cert store on your local machine – the last parameter is an optional file based copy of that certificate that we will need for the next step. If you don’t specify the location it goes to %windir%system32. Be very protective of the .cer file – delete it once you have uploaded it. You can always generate another file if you need it.

makecert -sky exchange -r -n "CN=<YourCertName>" -pe -a sha1 -len 2048 -ss My "c:temp<YourCertName>.cer"

This certificate is yours – do not share it with others. If you want to reuse the certificate on other machines that you control, you can copy the .cer file to those machines and import them into the local certificate store on each machine. The .cer is just a copy, the actual certificate was loaded into your local certificate store (Manage Computer Certificates) by makecert.

Upload Certificate to Azure Subscription(s)

Generally you will not want to share certificates with others. Any certificate you use must be in your local certificate store (Manage Computer Certificates). The same certificate must also be uploaded to the portal and associated with each subscription you wish to manage from your machine.

From your local machine where you created the certificate in the above step:

  • Log in to the Azure Portal with an email address that is associated with the subscription you want to use from your own machine.
  • Scroll to the bottom of the left pane and choose “SETTINGS”




  • Click on the “UPLOAD” button in the middle of the bar at the bottom of the screen.


  • In the “Upload a management certificate” dialog navigate to the location specified in the last parameter above or %windir%system32 if you didn’t specify a location. Choose the .cer file you just created with makecert (or export a certificate from the local certificate store – just make sure it has the right properties). If you have multiple subscriptions there is a 2nd drop down box where you need to choose the subscription that the certificate will be associated with.


  • Repeat for any additional subscriptions that you want to manage with the same certificate (or create one certificate per subscription for additional security granularity).

Install and Configure the Azure PowerShell Cmdlets

Follow the steps here to install the Azure Cmdlets. Basically you are selecting “Azure PowerShell” from the Web Platform Installer. You can also check in the Web Platform Installer for updated versions of the cmdlets.

A very common setting that many admins set is the RemoteSigned Execution Policy. This is less secure than AllSigned or Restricted but allows you to use most downloaded scripts.

Open Windows Azure PowerShell with the “run as admin” option and run:

Set-ExecutionPolicy RemoteSigned –Force
Get-ExecutionPolicy –list

If you see errors when setting the execution policy, search on your specific error or start with this blog: Set-ExecutionPolicy : Windows PowerShell updated your execution policy successfully, but the setting is overridden by a policy defined at a more specific scope!!! You may need to open “Edit Group Policy” (in Windows 8 that opens the Local Group Policy Editor) and make a change.  Sometimes you may need to set each individual scope, but process scope settings go back to the default when the process is closed:

Set-ExecutionPolicy RemoteSigned -Scope Process -Force

Then import the Azure cmdlets:

Import-Module Azure

You can close the PowerShell window, you no longer need to “run as admin”.

Enable PowerShell to use a Subscription via a Certificate

Repeat this section on each machine that will be used to execute PowerShell code. Also repeat for additional subscriptions on each machine.

Open Windows Azure PowerShell. Optionally type ISE to open the Integrated Scripting Environment where you can edit, save, and run collections of cmdlets.

First, set some variables. You will need to copy some basic settings from the Azure Management Portal. On the far left side of the portal, scroll all the way to the bottom and choose “SETTINGS” and “MANAGEMENT CERTIFICATES” (see the “Upload Certificate to Azure Subscription(s)” section of this blog for more details – you are copying from the same place where you uploaded the certificate). Choose the certificate you just uploaded. Don’t worry if the numbers are cut off on the screen, if you highlight and copy it will get the whole value, even the part that doesn’t show on the screen. Replace the $subID and $thumbprint below – do not update $myCert as that is done based on your other variables. Execute the code in the PowerShell window.

#copy SUBSCRIPTION ID from portal 
#lower left, settings, management certificates
$subID = "11111111-2222-3333-4444-555555555555"
#copy THUMBPRINT from portal 
#lower left, settings, management certificates
$thumbprint = "1234567891234567891234567891234567891234"
$myCert = Get-Item cert:\CurrentUserMy$thumbprint  

Now set the subscription name you will use to refer to this subscription from this machine. In most cases you will choose the NAME of the subscription from the portal but that is not required. The matching between your machine’s knowledge of the subscription and the subscription on Azure is done via the SUBSCRIPTION ID. Update $localSubName below and execute the code in the PowerShell window. Note that the local subscription name is case-sensitive.

#subname to be used locally
#usually you will choose the actual subscription name
#stored in %appdata%Windows Azure PowerShellWindowsAzureProfile.xml
$localSubName = "MyFavSub"

Now that you have set the values for your own environment, run the code to actually update your machine’s knowledge of the subscription. Note that I used the back tick “`” to specify that the command continues on a new line.

Set-AzureSubscription –SubscriptionName $localSubName `
–SubscriptionId $subID -Certificate $myCert

Some operations rely on a default storage account, you may want to set the default storage account you want to use for each subscription.

#optionally set "current" storage account for this sub
$defaultStorageAccount = 'MyFavStorageAccount'
Set-AzureSubscription -SubscriptionName $localSubName `
-CurrentStorageAccount $defaultStorageAccount

Next you can set the default subscription that you will start with when you open PowerShell on this machine (note that we’ve changed from the Set cmdlet to the Select one):

Select-AzureSubscription –Default $localSubName

You can change which of the configured subscriptions is the current one:

Select-AzureSubscription –Current $localSubName

Check to see which subscription you are currently using:

Get-AzureSubscription –Current
(Get-AzureSubscription -Current).SubscriptionName

Verify that you can connect and list the services associated with the current subscription:

Get-AzureService | select ServiceName

Look at the Local Configuration

Now let’s look at what got updated on the local machine.

Open File Explorer and go to %appdata%Windows Azure PowerShell. Open WindowsAzureProfile.xml in Notepad or your favorite editor. Here are a few of the key values for each subscription you have mapped on your machine:

IsDefault tells you which one is the default subscription for your machine


The thumbprint id is stored as the ManagementCertificate:


The local name you chose for the subscription is stored in Name (to avoid confusion chose the name used in the portal):


The subscription id is stored in SubscriptionId:


Remove Subscription

If you need to remove a subscription from your machine, whether because you no longer have access to it or because you want to change one of the properties such as the name or which certificate you use, you can use Remove-AzureSubscription. This updates your local %appdata%Windows Azure PowerShell.

#Remove my machine's knowledge of a subscription 
#Removes info from %appdata%Windows Azure PowerShellWindowsAzureProfile.xml
Remove-AzureSubscription -SubscriptionName MyFavSub

Sample Script

Here is a handy dandy cut/paste version of the above PowerShell code to add a subscription and make it your default and current subscription:

#copy SUBSCRIPTION ID from portal 
#lower left, settings, management certificates
$subID = "YourOwnSubID"
#copy THUMBPRINT from portal 
#lower left, settings, management certificates
$thumbprint = "YourCertThumbprint"
$myCert = Get-Item cert:\CurrentUserMy$thumbprint  
#subname to be used locally
#usually you will choose the actual subscription name
#stored in %appdata%Windows Azure PowerShellWindowsAzureProfile.xml
$localSubName = "YourSubcriptionName"
#optionally set "current" storage account for this sub
$defaultStorageAccount = 'OptionalDefaultStorage'
Set-AzureSubscription –SubscriptionName $localSubName `
    –SubscriptionId $subID -Certificate $myCert
Set-AzureSubscription -SubscriptionName $localSubName `
    -CurrentStorageAccount $defaultStorageAccount
Select-AzureSubscription –Default $localSubName
Select-AzureSubscription –Current $localSubName
Get-AzureSubscription –Current
(Get-AzureSubscription -Current).SubscriptionName

You are Ready for PowerShell Gooey Goodness!

Woohoo! Now you can access your Azure subscriptions from your machine without entering ids and passwords. You can automate, simplify, and standardize any Azure activity that has an associated cmdlet! Happy PowerShelling!



Sample PowerShell Script: HDInsight Custom Create

This is a working script I use to create various HDInsight clusters. For a really reproducible, automated environment you would want to put this into a .ps1 script that accepts parameters (see here for an example). However, you may find the method below good for learning and experimenting. Replace all the “YOURxyz” sections with your actual information. Beware of oddities introduced by cut/paste such as spaces being replaced by line breaks or quotes being replaced by smart quotes. The # is a comment, some commands that you rarely run are commented out so remove the # to run them if you need them.

# This PowerShell script is meant to be a cut/paste of specific parts, it is NOT designed to be run as a whole.

# Do once after you install the cmdlets
#Import-AzurePublishSettingsFile C:UsersYOURDirectoryDownloadsYOURName-credentials.publishsettings

# Use if you admin more than one subscription
#Get-AzureAccount # This may be needed to log in to Azure
Select-AzureSubscription –SubscriptionName YOURSubscription
Get-AzureSubscription -Current

# Many things are easier in the ISE

            ### create clusters ###          

# Add your specific information here
# Previous failures may make a name unavailable for a while – check to see if previous cluster was partially created
$ClusterName = “YOURNewHDInsightClusterName” #the name you will give to your cluster
$Location = “YOURDataCenter” #cluster data center must be East US, West US, or North Europe (as of December 2013)
$NumOfNodes = 1 #start small
$StorageAcct1 = “YOURExistingStorageAccountName” #currently must be in same data center as the cluster
$DefaultContainer = “YOURExistingContainerName” #already exists on the storage account

# These variables are automatically set for you
$FullStorage1 = “${StorageAcct1}”
$Key1 = Get-AzureStorageKey $StorageAcct1 | %{ $_.Primary }
$SubID = Get-AzureSubscription -Current | %{ $_.SubscriptionId }
$SubName = Get-AzureSubscription -Current | %{ $_.SubscriptionName }
$Cert = Get-AzureSubscription -Current | %{ $_.Certificate }
$Creds = Get-Credential -Message “New admin account to be created for your HDInsight cluster” #this prompts you

# Sample quick create
# Equivalent of quick create
# The ` specifies that the cmd continues on the next line, beware of artifical line breaks added during cut/paste from the blog
New-AzureHDInsightCluster -Name $ClusterName -ClusterSizeInNodes $NumOfNodes -Subscription $SubID -Location “$Location” `
-DefaultStorageAccountName $FullStorage1 -DefaultStorageAccountKey $Key1 -DefaultStorageContainerName $DefaultContainer -Credential $Creds

# Sample custom create
# Most params are the same as quick create, use a new cluster name
# Pass in a 2nd storage account, a SQLAzure db for the metastore (assume same db for Oozie and Hive), add Avro library, some config values
# Execute all the variable settings from above

# This value is set for you, don’t change!
$configvalues = new-object ‘Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightHiveConfiguration’

# Add your specific information here
$ClusterName = “YOURNewHDInsightClusterName
$StorageAcct2 = “YOURExistingStorageAccountName2
$MetastoreAzureSQLDBName = “YOURExistingSQLAzureDBName
$MetastoreAzureServerName = “” #gives a DNS error if you don’t use the full name
$configvalues.Configuration = @{ “hive.exec.compress.output”=”true” }  #this is an example of a config value you may pass in

# These variables are automatically set for you
$FullStorage2 = “${StorageAcct2}”
$Key2 = Get-AzureStorageKey $StorageAcct2 | %{ $_.Primary }
$MetastoreCreds = Get-Credential -Message “existing id/password for your SQL Azure DB (metastore)” #This prompts for the existing id and password of your existing SQL Azure DB

# Add a config file value
# Add AVRO SerDe libraries for Hive (on storage 1)
$configvalues.AdditionalLibraries = new-object ‘Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightDefaultStorageAccount’
$configvalues.AdditionalLibraries.StorageAccountName = $FullStorage1
$configvalues.AdditionalLibraries.StorageAccountKey = $Key1
$configvalues.AdditionalLibraries.StorageContainerName = “hivelibs” #container called hivelibs must exist on specified storage account
# Create custom cluster
New-AzureHDInsightClusterConfig -ClusterSizeInNodes $NumOfNodes `
    | Set-AzureHDInsightDefaultStorage -StorageAccountName $FullStorage1 -StorageAccountKey $Key1 -StorageContainerName $DefaultContainer `
    | Add-AzureHDInsightStorage -StorageAccountName $FullStorage2 -StorageAccountKey $Key2 `
    | Add-AzureHDInsightMetastore -SqlAzureServerName $MetastoreAzureServerName -DatabaseName $MetastoreAzureSQLDBName -Credential $MetastoreCreds -MetastoreType OozieMetastore `
    | Add-AzureHDInsightMetastore -SqlAzureServerName $MetastoreAzureServerName -DatabaseName $MetastoreAzureSQLDBName -Credential $MetastoreCreds -MetastoreType HiveMetastore `
    | Add-AzureHDInsightConfigValues -Hive $configvalues `
    | New-AzureHDInsightCluster -Subscription $SubID -Location “$Location” -Name $ClusterName -Credential $Creds

# get status, properties, etc.
#$SubName = $SubID = Get-AzureSubscription -Current | %{ $_.SubscriptionName }
Get-AzureHDInsightProperties -Subscription $SubName
Get-AzureHDInsightCluster -Subscription $SubName
Get-AzureHDInsightCluster -Subscription $SubName -name YOURClusterName

# remove cluster
#Remove-AzureHDInsightCluster -Name $ClusterName -Subscription $SubName