Befriending Dragons

Turn Scary Into Attainable


Leave a comment

Hadoop Likes Big Files

One of the frequently overlooked yet essential best practices for Hadoop is to prefer fewer, bigger files over more, smaller files. How small is too small and how many is too many? How do you stitch together all those small Internet of Things files into files “big enough” for Hadoop to process efficiently?

The Problem

One performance best practice for Hadoop is to have fewer large files as opposed to large numbers of small files. A related best practice is to not partition “too much”. Part of the reason for not over-partitioning is that it generally leads to larger numbers of smaller files.

Too small is smaller than HDFS block size (chunk size), or realistically small is something less than several times larger than chunk size. A very, very rough rule of thumb is files should be at least 1GB each and no more than maybe around 10,000-ish files per table. These numbers, especially the maximum total number of files per table, vary depending on many factors. However, it gives you a reference point. The 1GB is based on multiples of the chunk size while the 2nd is honestly a bit of a guess based on a typical small cluster.

Why Is It Important?

One reason for this recommendation is that Hadoop’s name node service keep track of all the files and where the internal chunks of the individual files are. The more files it has to track the more memory it needs on the head node and the longer it takes to build a job execution plan. The number and size of files also affects how memory is used on each node.

smallpiebigpieLet’s say your chunk size is 256MB. That’s the maximum size of each piece of the file that Hadoop will store per node. So if you have 10 nodes and a single 1GB file it would be split into 4 chunks of 256MB each and stored on 4 of those nodes (I’m ignoring the replication factor for this discussion). If you have 1000 files that are 1MB each (still a total data size of ~1GB) then every one of those files is a separate chunk and 1000 chunks are spread across those 10 nodes. NOTE: In Azure and WASB this happens somewhat differently behind the scenes – the data isn’t physically chunked up when initially stored but rather chunked up at the time a job runs.

With the single 1GB file the name node has 5 things to keep track of – the logical file plus the 4 physical chunks and their associated physical locations. With 1000 smaller files the name node has to track the logical file plus 1000 physical chunks and their physical locations. That uses more memory and results in more work when the head node service uses the file location information to build out the plan for how it will split out any Hadoop job into tasks across the many nodes. When we’re talking about systems that often have TBs or PBs of data the difference between small and large files can add up quickly.

The other problem comes at the time that the data is read by a Hadoop job. When the job runs on each node it loads the files the task tracker identified for it to work with into memory on that local node (in WASB the chunking is done at this point). When there are more files to be read for the same amount of data it results in more work and slower execution time for each task within each job. Sometimes you will see hard errors when operating system limits are hit related to the number of open files. There is also more internal work involved in reading the larger number of files and combining the data.

Stitching

There are several options for stitching files together.

  • Combine the files as they land using the code that moves the files. This is the most performant and efficient method in most cases.
  • INSERT into new Hive tables (directories) which creates larger files under the covers. The output file size can be controlled with settings like hive.merge.smallfiles.avgsize and hive.merge.size.per.task.
  • Use a combiner in Pig to load the many small files into bigger splits.
  • Use the HDFS FileSystem Concat API http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#concat.
  • Write custom stitching code and make it a JAR.
  • Enable the Hadoop Archive (HAR). This is not very efficient for this scenario but I am including it for completeness.

There are several writeups out there that address the details of each of these methods so I won’t repeat them.

The key here is to work with fewer, larger files as much as possible in Hadoop. The exact steps to get there will vary depending on your specific scenario.

I hope you enjoyed this small bite of big data!

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow  image
@SQLCindy | @NealAnalytics | CindyG@NealAnalytics.com | http://smallbitesofbigdata.com

//

Advertisements


2 Comments

Azure Data Factory: Hub Not Found

You can use the new Azure portal to create or edit Azure Data Factory components. Once you are done you may automate the process of creating future Data Factory components from PowerShell. In that case you can use the JSON files you edited in the portal GUI as configuration files for the PowerShell cmdlets. For example, you may try to create a new linked service using settings from C:CoolerHDInsight.JSON as specified below:

New-AzureDataFactoryLinkedService -ResourceGroupName CoolerDemo -DataFactoryName $DataFactoryName -File C:CoolerHDInsight.JSON

In that case you may see something like this error:

New-AzureDataFactoryLinkedService : Hub: {SomeName_hub} not found.
CategoryInfo                : CloseError: (:) [New-AzureDataFactoryLinkedService]. Provisioning FailedException
FullyQualifiedErrorID   : Microsoft.Azure.Commands.DataFactories.NewAzureDataFactoryLinkedServiceCommand

image

If you check the JSON file that you exported from the portal and referenced in the PowerShell script, you will see it ends with something like this:

        “isPaused”: false,
“hubName”: “SomeName_hub”
}
}

The hubName is currently automatically generated based on the name of the Data Factory and should not be present in the JSON files used by PowerShell. Remove the comma on the line above the hubName and the entire line starting with hubName.

       ,
“hubName”: “SomeName_hub”

That will leave the end of the file looking something like this:

        “isPaused”: false
}
}

Check out all your other JSON files you are using for Data Factory components and do the same editing for any that have a hubName.

NOTE: This applies to Azure Data Factory as of April 2015. At some point the hubName should become a viable parameter usable by PowerShell.

I hope you enjoyed this small bite of big data!

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow  image
@SQLCindy | @NealAnalytics | CindyG@NealAnalytics.com | http://smallbitesofbigdata.com

//


5 Comments

Getting Started with Azure PowerShell Cmdlets–Subscription Management

I’ve started using the Azure PowerShell cmdlets more often to manage virtual machines and HDInsight in Azure. Once you connect to a subscription everything just works. However, the initial steps to get one or more subscriptions configured to be used from your machine or understanding how to change subscription information on your machine can be confusing. Some of the docs are contradictory, outdated, or incomplete. Often they assume you are only a co-admin of one subscription. The below steps should get you going with Azure cmdlets whether you admin one or many subscriptions.

You need to enable your machine to talk to one or more Azure subscriptions. The first step is creating a certificate. Do NOT do this if you already used the PublishSettings commands unless you first use Remove-AzureSubscription (which removes the locally stored information about the specified subscription). Makecert is more secure than PublishSettings, especially if you (a given email address) have multiple co-administrators per subscription and/or you (a given email address) are a co-administrator of multiple subscriptions.

The steps to get going are documented in Shep’s blog “Cloud Spelunking, Managing Azure form your Desktop via PowerShell (the Setup)” http://blogs.msdn.com/b/sql_shep/archive/2013/03/29/cloud-spelunking-managing-azure-form-your-desktop-via-powershell.aspx. I’ll go a bit deeper and fill in a few additional details on what Shep calls the “hard” option.

Create a Certificate

If you have IIS, Visual Studio, or the Windows SDK you will have some variation of a “Developer Command Prompt” (or VS201x or Visual Studio Command Prompt). Open that command prompt with the “run as administrator” option. Replace YourCertName with a meaningful name and run the below command. The cert always goes to the cert store on your local machine – the last parameter is an optional file based copy of that certificate that we will need for the next step. If you don’t specify the location it goes to %windir%system32. Be very protective of the .cer file – delete it once you have uploaded it. You can always generate another file if you need it.

makecert -sky exchange -r -n “CN=<YourCertName>” -pe -a sha1 -len 2048 -ss My “c:temp<YourCertName>.cer”

This certificate is yours – do not share it with others. If you want to reuse the certificate on other machines that you control, you can copy the .cer file to those machines and import them into the local certificate store on each machine. The .cer is just a copy, the actual certificate was loaded into your local certificate store (Manage Computer Certificates) by makecert.

Upload Certificate to Azure Subscription(s)

Generally you will not want to share certificates with others. Any certificate you use must be in your local certificate store (Manage Computer Certificates). The same certificate must also be uploaded to the portal and associated with each subscription you wish to manage from your machine.

From your local machine where you created the certificate in the above step:

  • Log in to the Azure Portal with an email address that is associated with the subscription you want to use from your own machine.
  • Scroll to the bottom of the left pane and choose “SETTINGS”

settings

  • Choose “MANAGEMENT CERTIFICATES”

AzurePortalSettingsMgmtCert

  • Click on the “UPLOAD” button in the middle of the bar at the bottom of the screen.

image

  • In the “Upload a management certificate” dialog navigate to the location specified in the last parameter above or %windir%system32 if you didn’t specify a location. Choose the .cer file you just created with makecert (or export a certificate from the local certificate store – just make sure it has the right properties). If you have multiple subscriptions there is a 2nd drop down box where you need to choose the subscription that the certificate will be associated with.

image

  • Repeat for any additional subscriptions that you want to manage with the same certificate (or create one certificate per subscription for additional security granularity).

Install and Configure the Azure PowerShell Cmdlets

Follow the steps here to install the Azure Cmdlets. Basically you are selecting “Azure PowerShell” from the Web Platform Installer. You can also check in the Web Platform Installer for updated versions of the cmdlets.

A very common setting that many admins set is the RemoteSigned Execution Policy. This is less secure than AllSigned or Restricted but allows you to use most downloaded scripts.

Open Windows Azure PowerShell with the “run as admin” option and run:

Set-ExecutionPolicy RemoteSigned –Force
Get-ExecutionPolicy –list

If you see errors when setting the execution policy, search on your specific error or start with this blog: Set-ExecutionPolicy : Windows PowerShell updated your execution policy successfully, but the setting is overridden by a policy defined at a more specific scope!!! You may need to open “Edit Group Policy” (in Windows 8 that opens the Local Group Policy Editor) and make a change.  Sometimes you may need to set each individual scope, but process scope settings go back to the default when the process is closed:

Set-ExecutionPolicy RemoteSigned -Scope Process -Force

Then import the Azure cmdlets:

Import-Module Azure

You can close the PowerShell window, you no longer need to “run as admin”.

Enable PowerShell to use a Subscription via a Certificate

Repeat this section on each machine that will be used to execute PowerShell code. Also repeat for additional subscriptions on each machine.

Open Windows Azure PowerShell. Optionally type ISE to open the Integrated Scripting Environment where you can edit, save, and run collections of cmdlets.

First, set some variables. You will need to copy some basic settings from the Azure Management Portal. On the far left side of the portal, scroll all the way to the bottom and choose “SETTINGS” and “MANAGEMENT CERTIFICATES” (see the “Upload Certificate to Azure Subscription(s)” section of this blog for more details – you are copying from the same place where you uploaded the certificate). Choose the certificate you just uploaded. Don’t worry if the numbers are cut off on the screen, if you highlight and copy it will get the whole value, even the part that doesn’t show on the screen. Replace the $subID and $thumbprint below – do not update $myCert as that is done based on your other variables. Execute the code in the PowerShell window.

#copy SUBSCRIPTION ID from portal 
#lower left, settings, management certificates
$subID = "11111111-2222-3333-4444-555555555555"
#copy THUMBPRINT from portal 
#lower left, settings, management certificates
$thumbprint = "1234567891234567891234567891234567891234"
$myCert = Get-Item cert:\CurrentUserMy$thumbprint  

Now set the subscription name you will use to refer to this subscription from this machine. In most cases you will choose the NAME of the subscription from the portal but that is not required. The matching between your machine’s knowledge of the subscription and the subscription on Azure is done via the SUBSCRIPTION ID. Update $localSubName below and execute the code in the PowerShell window. Note that the local subscription name is case-sensitive.

#subname to be used locally
#usually you will choose the actual subscription name
#stored in %appdata%Windows Azure PowerShellWindowsAzureProfile.xml
$localSubName = "MyFavSub"

Now that you have set the values for your own environment, run the code to actually update your machine’s knowledge of the subscription. Note that I used the back tick “`” to specify that the command continues on a new line.

Set-AzureSubscription –SubscriptionName $localSubName `
–SubscriptionId $subID -Certificate $myCert

Some operations rely on a default storage account, you may want to set the default storage account you want to use for each subscription.

#optionally set "current" storage account for this sub
$defaultStorageAccount = 'MyFavStorageAccount'
Set-AzureSubscription -SubscriptionName $localSubName `
-CurrentStorageAccount $defaultStorageAccount

Next you can set the default subscription that you will start with when you open PowerShell on this machine (note that we’ve changed from the Set cmdlet to the Select one):

Select-AzureSubscription –Default $localSubName

You can change which of the configured subscriptions is the current one:

Select-AzureSubscription –Current $localSubName

Check to see which subscription you are currently using:

Get-AzureSubscription –Current
(Get-AzureSubscription -Current).SubscriptionName

Verify that you can connect and list the services associated with the current subscription:

Get-AzureService | select ServiceName

Look at the Local Configuration

Now let’s look at what got updated on the local machine.

Open File Explorer and go to %appdata%Windows Azure PowerShell. Open WindowsAzureProfile.xml in Notepad or your favorite editor. Here are a few of the key values for each subscription you have mapped on your machine:

IsDefault tells you which one is the default subscription for your machine

<IsDefault>true</IsDefault>

The thumbprint id is stored as the ManagementCertificate:

<ManagementCertificate>1234567891234567891234567891234567891234</ManagementCertificate>

The local name you chose for the subscription is stored in Name (to avoid confusion chose the name used in the portal):

<Name>MyFavSub</Name>

The subscription id is stored in SubscriptionId:

<SubscriptionId>11111111-2222-3333-4444-555555555555</SubscriptionId>

Remove Subscription

If you need to remove a subscription from your machine, whether because you no longer have access to it or because you want to change one of the properties such as the name or which certificate you use, you can use Remove-AzureSubscription. This updates your local %appdata%Windows Azure PowerShell.

#RemoveSub
#Remove my machine's knowledge of a subscription 
#Removes info from %appdata%Windows Azure PowerShellWindowsAzureProfile.xml
Remove-AzureSubscription -SubscriptionName MyFavSub

Sample Script

Here is a handy dandy cut/paste version of the above PowerShell code to add a subscription and make it your default and current subscription:

#copy SUBSCRIPTION ID from portal 
#lower left, settings, management certificates
$subID = "YourOwnSubID"
#copy THUMBPRINT from portal 
#lower left, settings, management certificates
$thumbprint = "YourCertThumbprint"
$myCert = Get-Item cert:\CurrentUserMy$thumbprint  
#subname to be used locally
#usually you will choose the actual subscription name
#stored in %appdata%Windows Azure PowerShellWindowsAzureProfile.xml
$localSubName = "YourSubcriptionName"
#optionally set "current" storage account for this sub
$defaultStorageAccount = 'OptionalDefaultStorage'
Set-AzureSubscription –SubscriptionName $localSubName `
    –SubscriptionId $subID -Certificate $myCert
Set-AzureSubscription -SubscriptionName $localSubName `
    -CurrentStorageAccount $defaultStorageAccount
Select-AzureSubscription –Default $localSubName
Select-AzureSubscription –Current $localSubName
Get-AzureSubscription –Current
(Get-AzureSubscription -Current).SubscriptionName

You are Ready for PowerShell Gooey Goodness!

Woohoo! Now you can access your Azure subscriptions from your machine without entering ids and passwords. You can automate, simplify, and standardize any Azure activity that has an associated cmdlet! Happy PowerShelling!


					
		
	


10 Comments

Sample PowerShell Script: HDInsight Custom Create

This is a working script I use to create various HDInsight clusters. For a really reproducible, automated environment you would want to put this into a .ps1 script that accepts parameters (see here for an example). However, you may find the method below good for learning and experimenting. Replace all the “YOURxyz” sections with your actual information. Beware of oddities introduced by cut/paste such as spaces being replaced by line breaks or quotes being replaced by smart quotes. The # is a comment, some commands that you rarely run are commented out so remove the # to run them if you need them.

# This PowerShell script is meant to be a cut/paste of specific parts, it is NOT designed to be run as a whole.

# Do once after you install the cmdlets
#Get-AzurePublishSettingsFile
#Import-AzurePublishSettingsFile C:UsersYOURDirectoryDownloadsYOURName-credentials.publishsettings

# Use if you admin more than one subscription
#Get-AzureAccount # This may be needed to log in to Azure
Select-AzureSubscription –SubscriptionName YOURSubscription
Get-AzureSubscription -Current

# Many things are easier in the ISE
ise

###############################################
### create clusters ###
###############################################

# Add your specific information here
# Previous failures may make a name unavailable for a while – check to see if previous cluster was partially created
$ClusterName = “YOURNewHDInsightClusterName” #the name you will give to your cluster
$Location = “YOURDataCenter” #cluster data center must be East US, West US, or North Europe (as of December 2013)
$NumOfNodes = 1 #start small
$StorageAcct1 = “YOURExistingStorageAccountName” #currently must be in same data center as the cluster
$DefaultContainer = “YOURExistingContainerName” #already exists on the storage account

# These variables are automatically set for you
$FullStorage1 = “${StorageAcct1}.blob.core.windows.net”
$Key1 = Get-AzureStorageKey $StorageAcct1 | %{ $_.Primary }
$SubID = Get-AzureSubscription -Current | %{ $_.SubscriptionId }
$SubName = Get-AzureSubscription -Current | %{ $_.SubscriptionName }
$Cert = Get-AzureSubscription -Current | %{ $_.Certificate }
$Creds = Get-Credential -Message “New admin account to be created for your HDInsight cluster” #this prompts you

###############################################
# Sample quick create
###############################################
# Equivalent of quick create
# The ` specifies that the cmd continues on the next line, beware of artifical line breaks added during cut/paste from the blog
New-AzureHDInsightCluster -Name $ClusterName -ClusterSizeInNodes $NumOfNodes -Subscription $SubID -Location “$Location” `
-DefaultStorageAccountName $FullStorage1 -DefaultStorageAccountKey $Key1 -DefaultStorageContainerName $DefaultContainer -Credential $Creds

###############################################
# Sample custom create
###############################################
#https://hadoopsdk.codeplex.com/wikipage?title=PowerShell%20Cmdlets%20for%20Cluster%20Management
# Most params are the same as quick create, use a new cluster name
# Pass in a 2nd storage account, a SQLAzure db for the metastore (assume same db for Oozie and Hive), add Avro library, some config values
# Execute all the variable settings from above

# This value is set for you, don’t change!
$configvalues = new-object ‘Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightHiveConfiguration’

# Add your specific information here
$ClusterName = “YOURNewHDInsightClusterName
$StorageAcct2 = “YOURExistingStorageAccountName2
$MetastoreAzureSQLDBName = “YOURExistingSQLAzureDBName
$MetastoreAzureServerName = “YOURExistingSQLAzureServer.database.windows.net” #gives a DNS error if you don’t use the full name
$configvalues.Configuration = @{ “hive.exec.compress.output”=”true” }  #this is an example of a config value you may pass in

# These variables are automatically set for you
$FullStorage2 = “${StorageAcct2}.blob.core.windows.net”
$Key2 = Get-AzureStorageKey $StorageAcct2 | %{ $_.Primary }
$MetastoreCreds = Get-Credential -Message “existing id/password for your SQL Azure DB (metastore)” #This prompts for the existing id and password of your existing SQL Azure DB

# Add a config file value
# Add AVRO SerDe libraries for Hive (on storage 1)
$configvalues.AdditionalLibraries = new-object ‘Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightDefaultStorageAccount’
$configvalues.AdditionalLibraries.StorageAccountName = $FullStorage1
$configvalues.AdditionalLibraries.StorageAccountKey = $Key1
$configvalues.AdditionalLibraries.StorageContainerName = “hivelibs” #container called hivelibs must exist on specified storage account
# Create custom cluster
New-AzureHDInsightClusterConfig -ClusterSizeInNodes $NumOfNodes `
| Set-AzureHDInsightDefaultStorage -StorageAccountName $FullStorage1 -StorageAccountKey $Key1 -StorageContainerName $DefaultContainer `
| Add-AzureHDInsightStorage -StorageAccountName $FullStorage2 -StorageAccountKey $Key2 `
| Add-AzureHDInsightMetastore -SqlAzureServerName $MetastoreAzureServerName -DatabaseName $MetastoreAzureSQLDBName -Credential $MetastoreCreds -MetastoreType OozieMetastore `
| Add-AzureHDInsightMetastore -SqlAzureServerName $MetastoreAzureServerName -DatabaseName $MetastoreAzureSQLDBName -Credential $MetastoreCreds -MetastoreType HiveMetastore `
| Add-AzureHDInsightConfigValues -Hive $configvalues `
| New-AzureHDInsightCluster -Subscription $SubID -Location “$Location” -Name $ClusterName -Credential $Creds

###############################################
# get status, properties, etc.
###############################################
#$SubName = $SubID = Get-AzureSubscription -Current | %{ $_.SubscriptionName }
Get-AzureHDInsightProperties -Subscription $SubName
Get-AzureHDInsightCluster -Subscription $SubName
Get-AzureHDInsightCluster -Subscription $SubName -name YOURClusterName

###############################################
# remove cluster
###############################################
#Remove-AzureHDInsightCluster -Name $ClusterName -Subscription $SubName


Leave a comment

Your First HDInsight Cluster–Step by Step

Small Bites of Big Data from AZURECAT
Big Data Tech Training Series #1
Cindy Gross | Murshed Zaman

Sometimes it is just hard to get started. Have you been putting off your first foray into Hadoop? Are you not sure where to begin? Let’s get really basic.

Prerequisites:

Log on to the Windows Azure Portal https://manage.windowsazure.com

Go to storage https://manage.windowsazure.com/#Workspaces/StorageExtension/storage. Create a storage account in a location that is available to HDInsight (as of November 2013 that’s East US, West US, and North Europe). Do NOT choose an affinity group. If you choose to “Enable Geo-Replication” there will be an extra charge – it’s probably not necessary for a demo/test account as you have a limited amount of credit in the trial subscription. In the portal choose the STORAGE icon on the left. Then click on +NEW at the bottom. That opens a QUICK CREATE window. Enter a unique name for your storage, such as sqlcatwomanrules. It only allows lower case letters and numbers.

StorageNov2013 NewStorageAccountNov2013

Now click on the HDInsight icon just below the storage icon storage. Choose QUICK CREATE. Enter a unique name for your HDInsight cluster. For a demo choose 4 data nodes. Enter a password that contains upper and lower case letters, a number, and a special character. Choose the storage account you created above. Once you click on “CREATE HDINSIGHT CLUSTER” it will take several minutes for the cluster to be deployed.

StorageNov2013_HDI QuickCreateNov2013

Once it completes you are ready to use your cluster!

ClusterQuickStart

If you won’t be using the cluster right away, go ahead and delete it (look for the icon at the bottom of the portal) to save compute time and money. You can easily recreate it when you need it.

image

Look for more blogs soon on customizing your cluster with CUSTOM CREATE or PowerShell and on automating deployment and jobs with PowerShell. In the meantime see if you can get Invoke-Hive working from PowerShell for some simple Hive commands such as:

Invoke-Hive “select * from hivesampletable limit 10”

Big Data Technical Series:

Your First HDInsight Cluster–Step by Step

Automating HDInsight cluster creation with PowerShell