Befriending Dragons

Turn Scary Into Attainable


3 Comments

Create HDInsight Cluster in Azure Portal

Creating an HDInsight cluster from the Azure portal is very easy. However, sometimes you want all the choices and best practices explained as well as the “how to”. I have created a series of slides with audio recordings to walk you through the process and choices. They are available as sessions 1-8 of “Create HDInsight Cluster in Azure Portal” on my YouTube channel Small Bites of Big Data.

Playlist Getting Started with HDInsight: https://www.youtube.com/playlist?list=PLAD2dOpGM3s1R2L5HgPMX4MkTGvSza7gv

  1. Why HDInsight: https://youtu.be/J9KzIShLeD8
  2. Azure Subscription: https://youtu.be/lSxMtmRE114
  3. Azure Storage – WASB: https://youtu.be/6OdDDmdaVVE
  4. Metastore: https://youtu.be/1Og_eftYVpA
  5. Create HDInsight: https://youtu.be/SysIo3LwONk
  6. Hive Query: https://youtu.be/DRAuOXsuec0
  7. Load Demo Data: https://youtu.be/XyiOpRPjfUs
  8. Pricing, Automation, and Wrapup: https://youtu.be/78YowrOnNGM

PowerPoint deck: http://www.slideshare.net/cindygross1/create-hd-insightfeb2015

image

Why HDInsight?

HDInsight is Hadoop on Azure as a service.

  • Easy, cost effective, changeable scale out data processing
  • Lower TCO – easily add/remove/scale
  • Separation of storage and compute allows data to exist across clusters
  • Hortonworks HDP is one of the 3 major Hadoop
    distributors, the most purely open source
  • HDInsight *IS* Hortonworks HDP as a service in Azure (cloud)
  • Metastore (Hcatalog) exists independently across clusters via SQL DB
  • #, size, type of clusters are flexible and can all access the same data
  • Hive is a Hadoop component that makes data look like rows/columns for data warehouse type activities

It offers the standard advantages of Hadoop:

  • Scale-out
  • Load data now, add schema later (write once, read many)
  • Fail fast – iterate through many questions to find the right question
  • Faster time from question to insight
  • Hadoop is “just another data source” for BI, Analytics, Machine Learning

In addition you have the advantages of Hadoop in the cloud:

  • Instantly access data born in the cloud
  • Easily, cheaply load, share, and merge public or private data
  • Data exists independently across clusters (separation of storage and compute) via WASB on Azure storage accounts

Recording of why HDInsight on YouTube

Azure Subscription

You have many options to obtain a Microsoft Azure subscription:

Login to Azure Subscription

1. Login on Azure Portal https://manage.windowsazure.com

2. Use a Microsoft Account http://www.microsoft.com/en-us/account/default.aspx
Note: Some companies have federated their accounts and can use company accounts.

image

Choose Subscription

Most accounts will only have one Azure subscription associated with them. But if you seem to have unexpected resources, check to make sure you are in the expected subscription. The Subscriptions button is on the upper right of the Azure portal.

image

image

Add Accounts

Option: Add more Microsoft Accounts as admins of the Azure Subscription.

1. Choose SETTINGS at the very bottom on the left.

2. Then choose ADMINISTRATORS at the top. Click on the ADD button at the very bottom.

3. Enter a Microsoft Account or federated enterprise account that will be an admin.

image

Recording of getting started with an Azure subscription on YouTube

Azure Storage – WASB

I recommend you manually create at least one Azure storage account and container ahead of time. While the HDInsight creation dialogue gives the option of creating the storage account and container for you, that only works if you don’t plan to reuse data across clusters.

Create a Storage Account

1. Click on STORAGE in the left menu then NEW.

2. URL: Choose a lower-case storage account name that is unique within *.core.windows.net.

3. LOCATION: Choose the same location for the SQL Azure metastore database, the storage account(s), and HDInsight.

4. REPLICATION: Locally redundant stores fewer copies and costs less.

image

Repeat if you need additional storage.

Create a Container

1. Click on your storage account in the left menu then CONTAINERS on the top.

2. Choose CREATE A CONTAINER or choose the NEW button at the bottom.

3. Enter a lower-case NAME for the container, unique within that storage account.

4. Choose either Private or Public ACCESS. If there is any chance of sensitive or PII data being loaded to this container choose Private. Private access requires a key. HDInsight can be configured with that key during creation or keys can be passed in for individual jobs.

This will be the default container for the cluster. If you want to manage your data separately you may want to create additional containers.

image

WASB

Additional information about storage, including details on Windows Azure Storage Blobs (WASB) is on http://SmallBitesOfBigData.com.

image

Recording of creating an Azure storage account and container on YouTube.

Metastore (HCatalog)

In Azure you have the option to create a metastore for Hive and/or Oozie that exists independently of your HDInsight clusters. This allows you to reuse your Hive schemas and Oozie workflows as you drop and recreate your cluster(s). I highly recommend using this option for a production environment or anything that involves repeated access to the same, standard schemas and/or workflows.

Create a Metastore aka Azure SQL DB

Persist your Hive and Oozie metadata across cluster instances, even if no cluster exists, with an HCatalog metastore in an Azure SQL Database. This database should not be used for anything else. While it works to share a single metastore across multiple instances it is not officially tested or supported.

1. Click on SQL DATABASES then NEW and choose CUSTOM CREATE.

2. Choose a NAME unique to your server.

3. Click on the “?” to help you decide what TIER of database to create.

4. Use the default database COLLATION.

5. If you choose an existing SERVER you will share sysadmin access with other databases.

image

You can make the system more secure if you create a custom login on the Azure server. Add that login as a user in the database you just created. Grant it minimal read/write permissions in the database. This is not well documented or tested so the exact permissions needed for this are vague. You may see odd errors if you don’t grant the appropriate permissions.

Firewall Rules

In order to refer to the metastore from automated cluster creation scripts such as PowerShell your workstation must be added to the firewall rules.

1. Click on MANAGE then choose YES.

2. You can also use the MANAGE button to connect to the SQL Azure database and manage logins and permissions.

image

Recording of creating the metastore on YouTube.

Create the HDInsight Cluster

Now that we have the pre-requisites done we can move on to creating the cluster.

  • Quick Create through the Azure portal is the fastest way to get started with all the default settings.
  • The Azure portal Custom Create allows you to customize size, storage, and other configuration options.
  • You can customize and automate through code including .NET and PowerShell. This increases standardization and lets you automate the creation and deletion of clusters over time.
  • For all the examples here we will create a basic Hadoop cluster with Hive, Pig, and MapReduce.
  • A cluster will take several minutes to create, the type and size of the cluster have little impact on the time for creation.

Quick Create Option

For your first cluster choose a Quick Create.

1. Click on HDINSIGHT in the left menu, then NEW.

2. Choose Hadoop. HBase and Storm also include the features of a basic Hadoop cluster but are optimized for in-memory key value pairs (HBase) or alerting (Storm).

3. Choose a NAME unique in the azurehdinisght.net domain.

4. Start with a small CLUSTER SIZE, often 2 or 4 nodes.

5. Choose the admin PASSWORD.

6. The location of the STORAGE ACCOUNT determines the location of the cluster.

image

Custom Create Option

You can also customize your size, admin account, storage, metastore, and more through the portal. We’ll walk through a basic Hadoop cluster.

New

1. Click on HDINSIGHT in the left menu, then NEW in the lower left.

2. Choose CUSTOM CREATE.

image

Basic Info

1. Choose a NAME unique in the azurehdinisght.net domain.

2. Choose Hadoop. HBase and Storm also include the features of a basic Hadoop cluster but are optimized for in-memory key-value pairs (HBase) or alerting (Storm).

3. Choose Windows or Linux as the OPERATING SYSTEM. Linux is only available if you have signed up for the preview.

4. In most cases you will want the default VERSION.

image

Size and Location

1. Choose the number of DATA NODES for this cluster. Head nodes and gateway nodes will also be created and they all use HDInsight cores. For information on how many cores are used by each node see the “Pricing details” link.

2. Each subscription has a billing limit set for the maximum number of HDInsight cores available to that subscription. To change the number available to your subscription choose “Create a support ticket.” If the total of all HDInsight cores in use plus the number needed for the cluster you are creating exceeds the billing limit you will receive a message: “This cluster requires X cores, but only Y cores are available for this subscription”. Note that the messages are in cores and your configuration is specified in nodes.

3. The storage account(s), metastore, and cluster will all be in the same REGION.

image

Cluster Admin

1. Choose an administrator USER NAME. It is more secure to avoid “admin” and to choose a relatively obscure name. This account will be added to the cluster and doesn’t have to match any existing external accounts.

2. Choose a strong PASSWORD of at least 10 characters with upper/lower case letters, a number, and a special character. Some special characters may not be accepted.

image

Metastore (HCatalog)

On the same page as the Hadoop cluster admin account you can optionally choose to use a common metastore (Hcatalog).

1. Click on the blue box to the right of “Enter the Hive/Oozie Metastore”. This makes more fields available.

2. Choose the SQL Azure database you created earlier as the METASTORE.

3. Enter a login (DATABASE USER) and PASSWORD that allow you to access the METASTORE database. If you encounter errors, try logging in to the database manually from the portal. You may need to open firewall ports or change permissions.

image

Default Storage Account

Every cluster has a default storage account. You can optionally specify additional storage accounts at cluster create time or at run time.

1. To access existing data on an existing STORAGE ACCOUNT, choose “Use Existing Storage”.

2. Specify the NAME of the existing storage account.

3. Choose a DEFAULT CONTAINER on the default storage account. Other containers (units of data management) can be used as long as the storage account is known to the cluster.

4. To add ADDITIONAL STORAGE ACCOUNTS that will be accessible without the user providing the storage account key, specify that here.

image

Additional Storage Accounts

If you specified there will be additional accounts you will see this screen.

1. If you choose “Use Existing Storage” you simply enter the NAME of the storage account.

2. If you choose “Use Storage From Another Subscription” you specify the NAME and the GUID KEY for that storage account.

image image

Script Actions

You can add additional components or configure existing components as the cluster is deployed. This is beyond the scope of this demo.

1. Click “add script action” to show the remaining parameters.

2. Enter a unique NAME for your action.

3. The SCRIPT URI points to code for your custom action.

4. Choose the NODE TYPE for deployment.

image

Create is Done!

Once you click on the final checkmark Azure goes to work and creates the cluster. This takes several minutes. When the cluster is ready you can view it in the portal.

image

Recording of HDInsight quick and custom create on YouTube

Query with Hive

For most people the easiest, fastest way to learn Hadoop is through Hive. Hive is also the most widely used component of Hadoop. When you use the Hive ODBC driver any ODBC-compliant app can access the Hive data as “just another data source”. That includes Azure Machine Learning, Power BI, Excel, and Tableau.

Hive Console

The simplest, most relatable way for most people to use Hadoop is via the SQL-like, Database-like Hive and HiveQL (HQL).

1.  Put focus on your HDInsight cluster and choose QUERY CONSOLE to open a new tab in your browser. In my case it opens: https://dragondemo1.azurehdinsight.net//

2.  Click on Hive Editor.

image

image

Query Hive

The query console defaults to selecting the first 10 rows from the pre-loaded sample table. This table is created when the cluster is created.

1. Optionally edit or replace the default query:
Select * from hivesampletable LIMIT 10;

2. Optionally name your query to make it easier to find in the job history.

3. Click Submit.

Hive is a batch system optimized for processing huge amounts of data. It spends several seconds up front splitting the job across the nodes and this overhead exists even for small result sets. If you are doing the equivalent of a table scan in SQL Server and have enough nodes in Hadoop, Hadoop will probably be faster than SQL Server. If your query uses indexes in SQL Server, then SQL Server will likely be faster than Hive.

image

View Hive Results

1. Click on the Query you just submitted in the Job Session. This opens a new tab.

image

2. You can see the text of the Job Query that was submitted. You can Download it.

3. The first few lines of the Job Output (query result) are available. To see the full output choose Download File.

4. The Job Log has details including errors if there are any.

5. Additional information about the job is available in the upper right.

image

View Hive Data in Excel Workbook

At this point HDInsight is “just another data source” for any application that supports ODBC.

1. Install the Microsoft Hive ODBC driver.

2. Define an ODBC data source pointing to your HDInsight instance.

3. From DATA choose From Other Sources and From Data Connection Wizard.

image

View Hive Data in PowerPivot

At this point HDInsight is “just another data source” for any application that supports ODBC.

1. Install the Microsoft Hive ODBC driver.

2. Define an ODBC data source pointing to your HDInsight instance.

3. Click on POWERPIVOT then choose Manage. This opens a new PowerPivot for Excel window.

4. Choose Get External Data then Others (OLEDB/ODBC).

Now you can combine the Hive data with other data inside the tabular PowerPivot data model.

image

Recording of querying Hive on YouTube

Load Demo Data

In the cloud you don’t have to load data to Hadoop, you can load data to an Azure Storage Account. Then you point your HDInsight or other WASB compliant Hadoop cluster to the existing data source. There many ways to load data, for the demo we’ll use CloudXplorer.

You use the Accounts button to add Azure, S3, or other data/storage accounts you want to manage.

In this example nealhadoop is the Azure storage account, demo is the container, and bacon is a “directory”. The files are bacon1.txt and bacon2.txt. Any Hive tables would point to the bacon directory, not to individual files. Drag and drop files from Windows Explorer to CloudXplorer.

Windows Azure Storage Explorers (2014)

image

Recording of loading demo data on YouTube

WrapUp

Once you have created the HDInsight cluster you can use it and play with it and try many things. When you are done, simply remove the cluster. If you created an independent metastore in SQL Azure you can use that same metastore and the same Azure storage account(s) the next time you create a cluster. You are charged for the existence of the cluster, not for the usage of it. So make sure you drop the cluster when you aren’t using it. You can use automation, such as PowerShell, to spin up a cluster that is configured the same every time and to drop it. Check the website for the most recent information.

Pricing

image

Automate with PowerShell

With PowerShell, .NET, or the Cross-Platform cmd line tools you can specify even more configuration settings that aren’t available in the portal. This includes node size, a library store, and changing default configuration settings such as Tez and compression.

Automation allows you to standardize and with version control lets you track your configurations over time.

Sample PowerShell Script: HDInsight Custom Create http://blogs.msdn.com/b/cindygross/archive/2013/12/06/sample-powershell-script-hdinsight-custom-create.aspx. If your HDInsight and/or Azure cmdlets don’t match the current documention or return unexpected errors run Web Platform Installer and check for a new version of “Microsoft Azure PowerShell with Microsoft Azure SDK” or “Microsoft Azure PowerShell (standalone).”

image

Recording of Pricing, Automation, and Wrapup on YouTube

Summary

  • HDInsight is Hadoop on Azure as a service, specifically Hortonworks HDP on either Windows or Linux
  • Easy, cost effective, changeable scale out data processing for a lower TCO – easily add/remove/scale
  • Separation of storage and compute allows data to exist across clusters via WASB
  • Metastore (Hcatalog) exists independently across clusters via SQL DB
  • #, size, type of clusters are flexible and can all access the same data
  • Instantly access data born in the cloud; Easily, cheaply load, share, and merge public or private data
  • Load data now, add schema later (write once, read many)
  • Fail fast – iterate through many questions to find the right question
  • Faster time from question to insight
  • Hadoop is “just another data source” for BI, Analytics, Machine Learning

I hope you enjoyed this Small Bite of Big Data! Happy Hadooping!

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow  
@SQLCindy | @NealAnalytics | CindyG@NealAnalytics.com | http://smallbitesofbigdata.com

Advertisements


Leave a comment

Master Choosing the Right Project for Hadoop

Hadoop is the hot buzzword of the Big Data world, and many IT people are being told “go create a Hadoop cluster and do some magic”. It’s hard to know where to start or which projects are a good fit. The information available online is sparse, often conflicting, and usually focused on how to solve a technical problem rather than a business problem. So let’s look at this from a business perspective.

Data-Driven InsightsYodaCool

For the average business just getting into using Hadoop for the first time, you are most likely to be successful if you choose a project related to data exploration, analytics and reporting, and/or looking for new data-driven actionable insights. In many ways Hadoop is ‘just another data source.” Generally most businesses will not start with replacing existing, high-functioning OLTP implementations. Instead you will likely see the highest initial return on investment (ROI) from adding on to those existing systems. Pull some of the existing data into Hadoop, add new data, and look for new ways to use that data. The goal should remain clearly focused on how to use the data to take action based on the new data-driven insights you will uncover.

Success

DataPointer Below are some characteristics that are often present for a successful Hadoop implementation. You don’t need to have all of them to be successful, use the list to brainstorm new ideas.

  • Goals include innovation, exploration, iteration, and experimentation. Hadoop allows you to ask lots of “what-if” questions cheaply, to “fail fast” so you can try out many potential hypotheses, and look for that one cool thing everyone else has missed that can really impact your business.
  • New data or data variations will be explored. Some of it may be loosely structured. Hadoop, especially in the cloud, allows you to import and experiment with data much more quickly and cheaply than with traditional systems. Hadoop on Azure in particular has the WASB option to make data ingestion even easier and faster.
  • You are looking for the “Unknown Unknowns”. There are always lurking things that haven’t come to your attention before but which may be sparks for new actions. You know you don’t know what you want or what to ask for and will use that to spur innovation.
  • Flexible, fast scaling without the need to change your code is important. Hadoop is built on the premise that it is infinitely scalable – you simply add more nodes when you need more processing power. In the cloud you can also scale your storage and compute separately and more easily scale down during slow periods.
  • You are looking to gain some competitive advantage faster than your competition based on data-driven actions. This goes back to the previous points, you are using Hadoop to look for something new that can change your business or help you be first to market with something.
  • There are a low number of direct, concurrent users of the Hadoop system itself. The more jobs you have running at the same time, the more robust and expensive your head node(s) must be and often the larger your cluster must be. This changes the cost/benefit ratio quickly. Once data is processed and curated in Hadoop it can be sent to systems that are less-batch oriented and more available and familiar to the average power user or data steward.
  • Archiving data in a low-cost manner is important. Often historical data is kept in Hadoop while more interactive data is kept in a relational system.

Anti-Patterns

Quite often I hear people proposing Hadoop for projects that are not an ideal use for Hadoop, at least not as you are learning it and looking for quick successes to bolster confidence in the new technology. The below characteristics are generally indicators that you do NOT want to use Hadoop in a project.RosieInTechWIT

  • You plan to replace an existing system whose pain points don’t align with Hadoop’s strengths.
  • There are OLTP business requirements, especially if they are adequately met by an existing system. Yes, there are some components of Hadoop that can meet OLTP requirements and those features are growing and expanding rapidly. If you have an OLTP scenario that requires ACID properties and fast interactive response time it is possible Hadoop could be a fit but it’s usually not a good first project for you to learn Hadoop and truly use Hadoop’s strengths.
  • Data is well-known and the schema is static. Generally speaking, though the tipping point is changing rapidly, when you can use an index for a query it will likely be faster in a relational system. When you do the equivalent of a table scan across a large volume of data and provide enough scaled-out nodes it is likely faster on a Big Data system such as Hadoop. Well-known, well-structured data is highly likely to have well-known, repeated queries that have supporting indexes.
  • A large number of users will need to directly access the system and they have interactive response time requirements (response within seconds).
  • Your first project and learning is on a mission critical system or application. Learn on something new, something that makes Hadoop’s strengths really apparent and easy to see.

And in Conclusion

BeTheChangeChalk Choosing the right first project for your dive into Hadoop is crucial. Make it bite-sized, clearly outline your goals, make sure it has some of the above success criteria and avoid the anti-patterns. Make learning Hadoop a key goal of the project. Budget time for everyone to really learn not only how things work but why they work that way and whether there are better ways to do certain things. Hadoop is becoming ubiquitous, avoiding it completely is not an option. Jump in, but do so with your eyes wide open and make some good up-front decisions. Happy Big Data-ing!


1 Comment

AzureCopy to the Rescue for an S3 to Azure Blob Copy!

This week I helped a client move files from AWS S3 to Azure Storage blobs. Sounds simple, right? Here’s the tricky part… While there are both Azure and AWS cmdlets for PowerShell, they don’t cooperate. Neither has a cmdlet that accepts credentials from the other and neither accepts arbitrary URLs from outside their own cloud. And AzCopy also doesn’t accept S3 URLs. None of the S3 tools seem to recognize Azure. So what’s a girl to do?

The Search and The Discovery

After hours of trying to get creative with PowerShell or AzCopy I resorted to Bing searches. When what to my wondering eyes should appear, but a miniature sleigh…. uh, a fully fledged, well-written tool to move data between Azure and S3. But there’s more! This tool, known as Rudolph… I mean AzureCopy, can move data between Azure, S3, OneDrive, SharePoint online, Dropbox, and local file systems! Ken Faulkner has written a wonderful, holly jolly tool! After a few hiccups as I learned how to use the tool and learned about how S3 URLs are (and at first mostly are not) formed I quickly had all my data moved from S3 to Azure! Simple. Easy. It flew like the down of a thistle (whatever that means). So, what was required after installing the tool?

Open a dos-prompt and go to the directory where you installed AzureCopy. Instead of using a config file I set the values at the command line (use your own real values for the directory and after each equal sign):

cd C:installsazurecopy
set AzureAccountKey=MyAzureStorageAccountKey
set AWSAccessKeyID=MyS3AccessId
set AWSSecretAccessKeyID=MyS3SecretKey
set AWSRegion value=us-west-2

Then I got a listing of my files on S3 – this took longer than it should because I had trouble getting the S3 URL correct. That was a problem with my newness to S3, not a problem with the tool. If you’re in the default region you use mybucket.s3.amazonaws.com. Otherwise you use mybucket.s3-region.amazonaws.com. See Amazon’s docs on S3 buckets for more details on the URL.

Also, I didn’t need all the keys passed in on both commands, it was just easier to write and copy the code that way as I tried to get it all working.

azurecopy -list https://mybucket.s3-us-west-2.amazonaws.com/ -azurekey %AzureAccountKey% -s3k %AWSAccessKeyID% -s3sk %AWSSecretAccessKeyID%

Next I listed out the files in Azure. At this point the container was empty but the command at least verified my access worked. I uploaded a small test file and verified I could see it with AzureCopy, then deleted the test file.

azurecopy -list https://mystorage.blob.core.windows.net/mycontainer  -azurekey %AzureAccountKey% -s3k %AWSAccessKeyID% -s3sk %AWSSecretAccessKeyID%

And now on to the secret sauce – the actual, magical file copy.

azurecopy -i https://mybucket.s3-us-west-2.amazonaws.com/ -o https://mystorage.blob.core.windows.net/mycontainer -azurekey %AzureAccountKey% -s3k %AWSAccessKeyID% -s3sk %AWSSecretAccessKeyID% -blobcopy -destblobtype block

Success!

And just like that, within a couple of minutes, the list command for azurecopy showed all the files in Azure! I double-checked with my Azure and AWS PowerShell cmdlets that yes, this was really true! This tool saved me SO MUCH TIME! And now you know, the built in tools from the major cloud vendors lock you into their own cloud. But with AzureCopy you too can free your data!


1 Comment

PowerShell works for Amazon AWS S3 too!

More and more we have to work with data in many different locations. This week I got to work with S3 files that were moving to Azure blob storage. I was surprised to find that Amazon has published AWS cmdlets for PowerShell. It took me a little while to figure out the format and terminology so I’ll try to explain that and compare/contrast how we interact with storage in AWS and Azure. Today we will cover viewing the files.

Configure PowerShell

Well first, let’s get things set up. Install the Azure and AWS cmdlets for PowerShell. These examples will pass keys for everything so there’s no need to configure PowerShell with certificates to access the clouds.

The first time (depending on your PowerShell version) you use PowerShell after installing AWS cmdlets you may need to run these cmdlets:

Add-Type -Path “C:Program Files (x86)AWS SDK for .NETbinNet45AWSSDK.dll”
Import-Module “C:Program Files (x86)AWS ToolsPowerShellAWSPowerShellAWSPowerShell.psd1”

Connecting to Storage

S3

We’ll start with AWS S3. Each connection to S3 storage requires an AWS region (unless you use the default “US Standard”, an access id (unique identifier), a secret key, and a bucket. You are storing data within a specific region on an access point in a managed grouping called a bucket. The access id in S3 is equivalent to a storage account name in Azure. A bucket in S3 is roughly equivalent to a container in Azure.

$S3Bucket = “MyBucket”
$S3Key=”SecretKeyValue”
$S3AccessID=”AccessKey”
$AWSregion = “us-west-2”

Next let’s use those values to make a new client connection to S3. You define a configuration object that points to the full URL for the region. Then you pass that configuration object, the access id, and the secret key to a function that creates a client connection to S3. This sets the context for the entire session and the context does not have to be passed to the individual commands. Note that the URL changes depending on the region, for example https://s3-us-west-2.amazonaws.com

Set-DefaultAWSRegion $AWSregion # auto-stored to $StoredAWSRegion
$AWSserviceURL=”https://s3-$AWSRegion.amazonaws.com”
$config=New-Object Amazon.S3.AmazonS3Config
$config.ServiceURL = $AWSserviceURL
$S3Client=[Amazon.AWSClientFactory]::CreateAmazonS3Client($secretKeyID, $secretAccessKeyID, $config)

Azure

Let’s compare that to how we list files in Azure blob storage. First you specify the location and credentials. The region is implied because the storage account name is unique across all regions. The container and secret key value are similar in meaning.

$storageAccountName = “MyStorageAccountName”
$storageaccountkey = “SecretKeyValue”
$containerName = “MyBucket”

Then you define the storage context which is the location and credentials of an object. Alternatively you could set the default storage context for the session or for a particular profile’s connection to a given subscription.

$AzureContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountkey

View the Files

S3

Now you can get basic metadata about the S3 bucket:
Get-S3Bucket $S3Bucket
Get-S3BucketLocation $S3Bucket

Next let’s list the files in that bucket.

Get-S3Object -BucketName $S3Bucket

You can populate an array with the list, in this example I passed in just the name (key) of each file:
$S3FileList = (Get-S3Object -BucketName $S3Bucket).key

And you can filter the result set:
$S3FileList = (Get-S3Object -BucketName $S3Bucket | Where-Object {$_.lastmodified -lt “2/17/2015”}).Key
$S3FileList = (Get-S3Object -BucketName $S3Bucket | Where-Object {$_.key -like “*42*”}).Key

Azure

For Azure we can do similar operations to view the files. This example lists all files in the container:

Get-AzureStorageBlob -Context $AzureContext -Container $containerName

You can also populate an array with the list:

$AzureList = Get-AzureStorageBlob -Context $AzureContext -Container $containerName

Or pull out just a single property:

(Get-AzureStorageBlob -Context $AzureContext -Container $containerName).Name

Or list just blobs that match a wildcard value:

Get-AzureStorageBlob -Context $AzureContext -Container $containerName -Blob *42*

My Work Here is Done

This intro to PowerShell for S3 opens up the door to many possibilities – data migrations, multi-cloud hybrid solutions, and whatever your imagination can conjure up! Today we reviewed how to view files, I’ll cover more in future posts. Happy PowerShelling!

Tip

When you open “Microsoft Azure PowerShell” type ISE in the window to launch the interactive PowerShell shell. It has intellisense, multiple script windows, and a cmdlet viewer.


3 Comments

Understanding WASB and Hadoop Storage in Azure

Yesterday we learned Why WASB Makes Hadoop on Azure So Very Cool. Now let’s dive deeper into Windows Azure storage and WASB. I’ll answer some of the common questions I get when people first try to understand how WASB is the same as and different from HDFS.

What is HDFS?

The Hadoop Distributed File System (HDFS) is one of the core Hadoop components, it is how Hadoop manages data and storage. At a high level, when you load a file into Hadoop the “name node” uses HDFS to chunk the file into blocks and it spreads those blocks of data across the worker nodes within the cluster. Each chunk of data is stored on multiple nodes (assuming the replication factor is set to > 1) for higher availability. The name node knows where each chunk of data is stored and that information is used by the job manager to allocate tasks and resources appropriately across nodes.

What is WASB?

Windows Azure Storage Blob (WASB) is an extension built on top of the HDFS APIs. The WASBS variation uses SSL certificates for improved security. It in many ways “is” HDFS. However, WASB creates a layer of abstraction that enables separation of storage. This separation is what enables your data to persist even when no clusters currently exist and enables multiple clusters plus other applications to access a single piece of data all at the same time. This increases functionality and flexibility while reducing costs and reducing the time from question to insight.

What is an Azure blob store, an Azure storage account, and an Azure container? For that matter, what is Azure again?

Azure is Microsoft’s cloud solution. A cloud is essentially a collection of host data centers that you don’t have to directly manage. You can request services from that cloud. For example, you can request virtual machines and storage, data services such as SQL Azure Database or HDInsight, or services such as Websites or Service Bus. In Azure you store blobs on containers within Azure storage accounts. You grant access to a storage account, you create collections at the container level, and you place blobs (files of any format) inside the containers. This illustration from Microsoft’s documentation helps to show the structure:

Blob1

How do I manage and configure block/chunk size and the replication factor with WASB?

You don’t. It’s not generally necessary. The data is stored in the Azure storage accounts, remaining accessible to many applications at once. Each blob (file) is replicated 3x within the data center. If you choose to use geo-replication on your account you also get 3 copies of the data in another data center within the same region. The data is chunked and distributed to nodes when a job is run. If you need to change the chunk size for memory related performance at run time that is still an option. You can pass in any Hadoop configuration parameter setting when you create the cluster or you can use the SET command for a given job.

Isn’t one of the selling points of Hadoop that the data sits with the compute? How does that work with WASB?

Just like with any Hadoop system the data is loaded into memory on the individual nodes at compute time (when the job runs). The difference with WASB is that the data is loaded from the storage accounts instead of from local disks. Given the way Azure data center backbones are built the performance is generally the same or better than if you used disks locally attached to the VMs.

How do I load data to Hadoop on Azure?

You use any of the many Azure data loading methods. There isn’t really anything special about loading data that will be used for Hadoop. As with data used by any other application there are some guidelines around directory structures, optimal numbers of files, and internal format but that is independent of data loading. Some common examples are AZCopy, CloudXplorer and other storage explorers, and SQL Server Integration Services (SSIS).

And yes, I will blog about those guidelines but not here. 🙂

Can I have multiple Hadoop clusters pointing to one storage account?

Yes.

Can I have one Hadoop cluster pointing to multiple storage accounts?

Yes. Check!

See: Use Additional Storage Accounts with HDInsight Hive.

Can I have many Hadoop clusters pointing to multiple storage accounts?

Why, yes. Yes you can. Check!

Do I get to keep my data even if no Hadoop cluster currently exists?

What a fun day to say Yes. Check!

For a caveat see HDInsight: Hive Internal and External Tables Intro.

Is WASB available for any distribution of Hadoop other than HDInsight?

It is my pleasure to answer that with a resounding Yes. Check!

WASB is built into HDInsight (Microsoft’s Hadoop on Azure service) and is the default file system. WASB is also available in the Apache source code for Hadoop. Therefore when you install Hadoop, such as Hortonworks HDP or Cloudera EDH/CDH, on Azure VMs you can use WASB with some configuration changes to the cluster.

How do I manage files and directories?

Hive is the most common entry point for Hadoop jobs and with Hive you never point to a single file, you always point to a directory. If you are a stickler for details and want to point out that Azure doesn’t have directories, that’s technically true. However, Hadoop recognizes that a slash “/” is an indication of a directory. Therefore Hadoop treats the below Azure blob file as if it were AFile.txt in a directory structure of: SomeDirectory/ASubDirectory. But since you don’t access individual files in Hive you will reference either SomeDirectory or SomeDirectory/ASubDirectory.

Blob: wasb://YOURDefaultContainer@YOURStorageAccount.blob.core.windows.net/SomeDirectory/ASubDirectory/AFile.txt

You can add, remove, and modify files in the Azure blob store without regard to whether a Hadoop cluster exists. Each time a job runs it reads the data that currently exists in the directory(s) it references. Hadoop itself can also write to files.

What about ORCFile, Parquet, and AVRO?

They are proprietary formats often used within Hadoop but rarely used outside of Hadoop. There are performance advantages to using those formats for “write once, read many” data inside Hadoop, but chances are high that you won’t then be able to access the data without going through one of your Hadoop clusters.

Should I have lots of small files?

NO! No!  

Why is too long to answer here. The short answer is to use files that are many multiples of the in-memory chunk size, in the GB or TB size range. Whenever possible use fewer, larger files instead of many small files. If necessary stitch the files together.

That’s your storage lesson for today – please put your additional Hadoop on Azure storage questions in the comments or send me a tweet! Thanks for stopping by!

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow  image
@SQLCindy | @NealAnalytics | CindyG@NealAnalytics.com | http://smallbitesofbigdata.com

http://blogs.msdn.com/b/cindygross/archive/2015/02/04/understanding-wasb-and-hadoop-storage-in-azure.aspx

http://www.nealanalytics.com/understanding-wasb-and-hadoop-storage-in-azure/

!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?’http’:’https’;if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+”://platform.twitter.com/widgets.js”;fjs.parentNode.insertBefore(js,fjs);}}(document,”script”,”twitter-wjs”);


2 Comments

Why WASB Makes Hadoop on Azure So Very Cool

Rescue dogData. It’s all about the data. We want to make more data driven decisions. We want to keep more data so we can make better decisions. We want that data stored cheaply, easily accessible, and quickly ingested. Hadoop promises to help with all those things. However, when you deal with Hadoop on-premises you have a multi-step process to load the data. Azure and WASB to the rescue!

With a typical Hadoop installation you load your data to a staging location then you import it into the Hadoop Distributed File System (HDFS) within a single Hadoop cluster. That data is manipulated, massaged, and transformed. Then you may export some or all of the data back to a non-HDFS system (a SAN, a file share, a website).

What’s different in the cloud? With Azure you have Azure Blob Storage Accounts. Data can be stored there as blobs in any format. That data can be accessed by various applications – including Hadoop without first doing a separate load into HDFS! This is made possible because Microsoft used the public extensions available with HDFS to create the Windows Azure Storage Blobs (WASB) interface between Hadoop and the Azure blob storage. This WASB code is available for any distributor of Hadoop in the Apache source code and it is the default storage system in HDInsight – Microsoft’s Hadoop on Azure PaaS offering. It is also available for Hortonworks HDP on Azure VMs or Cloudera EDH/CDH on Azure VMs with some manual configuration steps.

With WASB you load your data to Azure blobs at any time – whether Hadoop clusters currently exist or not. That way you aren’t paying for Hadoop compute time simply to load data. You spin up one or more clusters, point them at the data sets (yes, multiple clusters pointing to same data!), and run your Hadoop jobs. When you don’t need the system for a while you take down your Hadoop cluster(s) and the data is still there. At any point, whether one or more Hadoop clusters are accessing the data or not, other applications can still access and manipulate the data. For example, you could have data sitting on an Azure storage account that is being added to by a SQL Server Integration Services (SSIS) job. At the same time someone is using Power Query to load that data into PowerPivot while a website inserts new data to the same location. Meanwhile your R&D department can be running highly intensive jobs that require a large cluster up for many days or weeks at a time, and your sales team can have a separate, smaller cluster that’s up for a few hours a day – all pointing at the same data!

With this separation of storage and compute you have simplified your data accessibility, reduced data movement and copies, and reduced the time it takes to have your data available! That all adds up to lower costs and a faster, more data-driven time to insight.

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow  
@SQLCindy | @NealAnalytics | CindyG@NealAnalytics.com | http://smallbitesofbigdata.com

http://www.nealanalytics.com/why-wasb-makes-hadoop-on-azure-so-very-cool/

http://blogs.msdn.com/b/cindygross/archive/2015/02/03/why-wasb-makes-hadoop-on-azure-so-very-cool.aspx


2 Comments

Azure Maximums and Resource Usage from PowerShell

Technorati Tags: ,

Have you ever struggled to find out how many VM cores, HDInsight cores, storage accounts, or other Azure resources your subscription is set to allow or how many you actually use? Maybe you want to use this information in your automation scripts to avoid trying to create components for which you don’t have resources.

quizzical owl

PowerShell to the rescue!

First a couple of key points. There are various maximums in Azure. Today we are talking about finding the currently configured maximums allowed for a specified subscription. There are default maximums (default limit) which you can increase for a given subscription by opening a billing support ticket. There are also hard maximums (maximum limit). However, with some products, such as HDInsight (Hadoop), you can get past some per-subscription maximums for dependent services by combining resources (storage accounts) from multiple subscriptions for a single HDInsight cluster. All the samples below find the current billing quota limitation and actual usage for the current subscription.

Let’s take a look at the information available on the subscription level cmdlet.

Start by checking which subscription is in focus / current for the PowerShell session.

(Get-AzureSubscription -Current).SubscriptionName

(Get-AzureSubscription -Current).CurrentStorageAccountName

If you need information on a different subscription either pass the subscription name (as defined on your client) for the cmdlets that support this or change the focus to a different subscription.

$SubName = “sqlcatwoman”

Select-AzureSubscription -SubscriptionName $SubName

Now we will look at the cores available for Azure virtual machines (VMs / IaaS). Note that HDInsight cores are tracked separately. Be careful with unexpected line wraps that may paste into your PowerShell window (or ISE) incorrectly. The below snippet is 1 comment line and 4 lines of code.

# How many cores are available to create new VMs (or increase size of existing VMs) for the current subscription?

[int]$maxVMCores     = (Get-AzureSubscription -current -ExtendedDetails).maxcorecount

[int]$currentVMCores = (Get-AzureSubscription -current -ExtendedDetails).currentcorecount

[int]$availableCores = $maxVMCores $currentVMCores

Write-Host “Cores available for VMs:” $availableCores

We can get similar information about cloud services:

#how many cloud (hosted) services are available on this subscription

[int]$maxAvl         = (Get-AzureSubscription -current -ExtendedDetails).MaxHostedServices

[int]$currentUsed    = (Get-AzureSubscription -current -ExtendedDetails).CurrentHostedServices

[int]$availableNow   = $maxAvl $currentUsed

Write-Host “Cloud services available:” $availableNow

Some limits and usage are available on cmdlets specific to a particular technology. For example, the HDInsight usage and maximums are available from the Get-AzureHDInsightProperties cmdlet. You can find details and samples on Get HDInsight Properties with PowerShell.

Other times we have to look at different cmdlets for different pieces of the information, such as for storage accounts:

#how many storage accounts are available on this subscription

[int]$maxAvl         = (Get-AzureSubscription -current -ExtendedDetails).MaxStorageAccounts

[int]$currentUsed    = (Get-AzureStorageAccount).Count

[int]$availableNow   = $maxAvl $currentUsed

Write-Host “Storage Accounts available:” $availableNow

We can look at all the extended properties available for a subscription:happy owl

Get-AzureSubscription -currentExtendedDetails

If you know you have a particular component created and this cmdlet shows the “Current” value is zero, take a look at the Get-Azure… cmdlet for that particular type of resource and look for a “Current” value.

Another handy thing to look at is the overall information about what Azure regions exist and what services are available in each region:

Get-AzureLocation 

And you can pull off specific information:

Get-AzureLocation  | Select DisplayName

I hope these small bites of PowerShell help save the day for you in some way!