Befriending Dragons

Turn Scary Into Attainable


Leave a comment

Big Data for the SQL Eye

SQL Server is a great technology – I’ve been using it since 1993 when the user interface consisted of a query window with the options to save and execute and not much else. With every release there’s something new and exciting and there’s always something to learn about even the most familiar of features. However, not everyone uses SQL Server for every storage and compute opportunity – sad but true.

So what is a SQL geek to do in the face of all the new options out there – many under the umbrella of Big Data (distributed processing)? Why just jump right on in and learn it! No one can know all the pieces because it’s a big, fluid, messy collection of “things”. But don’t worry about that, start with one thing and build from there. Even if you never plan to implement a production Big Data system you need to learn about it – because if you don’t have some hands-on experience with it then someone who does have that experience will be influencing the decision makers without you. For a SQL Pro I suggest Hive as that easy entry point. At some point maybe Spark SQL will jump into that gap, but for now Hive is the easiest entry point for most SQL pros.

For more, I refer you to the talk I gave at the Pacific Northwest SQL Server User Group meeting on October 14, 2015. Excerpts are below, the file is attached.

Look, it’s SQL!

SELECT score, fun
FROM toDo
WHERE type = ‘they pay me for this?’;

Here’s how that code looks from Visual Studio along with the links to how you find the output and logs:

image

And yet it’s more!

CREATE EXTERNAL TABLE IF NOT EXISTS toDo
(fun STRING,
rank INT COMMENT ‘rank the greatness’,
type STRING)
COMMENT ‘two tables walk into a bar….’
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE
LOCATION ‘/data/demo/’;

image

A mix of old and new

— read some data
SELECT ‘you cannot make me ‘, score, fun, type
FROM toDo
WHERE score <= 0
ORDER BY score;

SELECT ‘when can we ‘, score, fun, type
FROM toDo
WHERE score > 0
DISTRIBUTE BY score SORT BY score;

image

That’s Hive folks!

Hive

on Hadoop
on HDInsight
on AzureBig Data in the cloud!

Hadoop Shines When….
(refer to http://blogs.msdn.com/b/cindygross/archive/2015/02/25/master-choosing-the-right-project-for-hadoop.aspx)

Data exploration, analytics and reporting, new data-driven actionable insights
Rapid iterating
Unknown unknowns
Flexible scaling
Data driven actions for early competitive advantage or first to market
Low number of direct, concurrent users
Low cost data archival

Hadoop Anti-Patterns….

Replace system whose pain points don’t align with Hadoop’s strengths
OLTP needs adequately met by an existing system
Known data with a static schema
Many end users
Interactive response time requirements (becoming less true)
Your first Hadoop project + mission critical system

image

Azure has so much more

Go straight to the business code
Scale storage and compute separately
Open Source
Linux
Managed and unmanaged services
Hybrid
On-demand and 24×7 options
SQL Server

It’s a Polyglot

Stream your data into a lake
Pick the best compute for each task

And it’s Fun!

I hope you enjoyed this small bite of big data!

//

BigDataForTheSQLEye.zip

Advertisements


3 Comments

Create HDInsight Cluster in Azure Portal

Creating an HDInsight cluster from the Azure portal is very easy. However, sometimes you want all the choices and best practices explained as well as the "how to". I have created a series of slides with audio recordings to walk you through the process and choices. They are available as sessions 1-8 of "Create HDInsight Cluster in Azure Portal" on my YouTube channel Small Bites of Big Data.

Playlist Getting Started with HDInsight: https://www.youtube.com/playlist?list=PLAD2dOpGM3s1R2L5HgPMX4MkTGvSza7gv 

  1. Why HDInsight: https://youtu.be/J9KzIShLeD8
  2. Azure Subscription: https://youtu.be/lSxMtmRE114
  3. Azure Storage – WASB: https://youtu.be/6OdDDmdaVVE
  4. Metastore: https://youtu.be/1Og_eftYVpA
  5. Create HDInsight: https://youtu.be/SysIo3LwONk
  6. Hive Query: https://youtu.be/DRAuOXsuec0
  7. Load Demo Data: https://youtu.be/XyiOpRPjfUs
  8. Pricing, Automation, and Wrapup: https://youtu.be/78YowrOnNGM

PowerPoint deck: http://www.slideshare.net/cindygross1/create-hd-insightfeb2015 

image

Why HDInsight?

HDInsight is Hadoop on Azure as a service.

  • Easy, cost effective, changeable scale out data processing
  • Lower TCO – easily add/remove/scale
  • Separation of storage and compute allows data to exist across clusters
  • Hortonworks HDP is one of the 3 major Hadoop
    distributors, the most purely open source
  • HDInsight *IS* Hortonworks HDP as a service in Azure (cloud)
  • Metastore (Hcatalog) exists independently across clusters via SQL DB
  • #, size, type of clusters are flexible and can all access the same data
  • Hive is a Hadoop component that makes data look like rows/columns for data warehouse type activities

It offers the standard advantages of Hadoop:

  • Scale-out
  • Load data now, add schema later (write once, read many)
  • Fail fast – iterate through many questions to find the right question
  • Faster time from question to insight
  • Hadoop is “just another data source” for BI, Analytics, Machine Learning

In addition you have the advantages of Hadoop in the cloud:

  • Instantly access data born in the cloud
  • Easily, cheaply load, share, and merge public or private data
  • Data exists independently across clusters (separation of storage and compute) via WASB on Azure storage accounts

Recording of why HDInsight on YouTube

Azure Subscription

You have many options to obtain a Microsoft Azure subscription:

 

Login to Azure Subscription

1. Login on Azure Portal https://manage.windowsazure.com

2. Use a Microsoft Account http://www.microsoft.com/en-us/account/default.aspx
Note: Some companies have federated their accounts and can use company accounts.

image

Choose Subscription

Most accounts will only have one Azure subscription associated with them. But if you seem to have unexpected resources, check to make sure you are in the expected subscription. The Subscriptions button is on the upper right of the Azure portal.

image

image

Add Accounts

Option: Add more Microsoft Accounts as admins of the Azure Subscription.

1. Choose SETTINGS at the very bottom on the left.

2. Then choose ADMINISTRATORS at the top. Click on the ADD button at the very bottom.

3. Enter a Microsoft Account or federated enterprise account that will be an admin.

image 

Recording of getting started with an Azure subscription on YouTube

Azure Storage – WASB

I recommend you manually create at least one Azure storage account and container ahead of time. While the HDInsight creation dialogue gives the option of creating the storage account and container for you, that only works if you don’t plan to reuse data across clusters.

Create a Storage Account

1. Click on STORAGE in the left menu then NEW.

2. URL: Choose a lower-case storage account name that is unique within *.core.windows.net.

3. LOCATION: Choose the same location for the SQL Azure metastore database, the storage account(s), and HDInsight.

4. REPLICATION: Locally redundant stores fewer copies and costs less.

image

Repeat if you need additional storage.

Create a Container

1. Click on your storage account in the left menu then CONTAINERS on the top.

2. Choose CREATE A CONTAINER or choose the NEW button at the bottom.

3. Enter a lower-case NAME for the container, unique within that storage account.

4. Choose either Private or Public ACCESS. If there is any chance of sensitive or PII data being loaded to this container choose Private. Private access requires a key. HDInsight can be configured with that key during creation or keys can be passed in for individual jobs.

This will be the default container for the cluster. If you want to manage your data separately you may want to create additional containers.

image

WASB

Additional information about storage, including details on Windows Azure Storage Blobs (WASB) is on http://SmallBitesOfBigData.com.

image

Recording of creating an Azure storage account and container on YouTube.

Metastore (HCatalog)

In Azure you have the option to create a metastore for Hive and/or Oozie that exists independently of your HDInsight clusters. This allows you to reuse your Hive schemas and Oozie workflows as you drop and recreate your cluster(s). I highly recommend using this option for a production environment or anything that involves repeated access to the same, standard schemas and/or workflows.

Create a Metastore aka Azure SQL DB

Persist your Hive and Oozie metadata across cluster instances, even if no cluster exists, with an HCatalog metastore in an Azure SQL Database. This database should not be used for anything else. While it works to share a single metastore across multiple instances it is not officially tested or supported.

1. Click on SQL DATABASES then NEW and choose CUSTOM CREATE.

2. Choose a NAME unique to your server.

3. Click on the “?” to help you decide what TIER of database to create.

4. Use the default database COLLATION.

5. If you choose an existing SERVER you will share sysadmin access with other databases.

image

You can make the system more secure if you create a custom login on the Azure server. Add that login as a user in the database you just created. Grant it minimal read/write permissions in the database. This is not well documented or tested so the exact permissions needed for this are vague. You may see odd errors if you don’t grant the appropriate permissions.

Firewall Rules

In order to refer to the metastore from automated cluster creation scripts such as PowerShell your workstation must be added to the firewall rules.

1. Click on MANAGE then choose YES.

2. You can also use the MANAGE button to connect to the SQL Azure database and manage logins and permissions.

image

Recording of creating the metastore on YouTube.

Create the HDInsight Cluster

Now that we have the pre-requisites done we can move on to creating the cluster.

  • Quick Create through the Azure portal is the fastest way to get started with all the default settings.
  • The Azure portal Custom Create allows you to customize size, storage, and other configuration options.
  • You can customize and automate through code including .NET and PowerShell. This increases standardization and lets you automate the creation and deletion of clusters over time.
  • For all the examples here we will create a basic Hadoop cluster with Hive, Pig, and MapReduce.
  • A cluster will take several minutes to create, the type and size of the cluster have little impact on the time for creation.

Quick Create Option

For your first cluster choose a Quick Create.

1. Click on HDINSIGHT in the left menu, then NEW.

2. Choose Hadoop. HBase and Storm also include the features of a basic Hadoop cluster but are optimized for in-memory key value pairs (HBase) or alerting (Storm).

3. Choose a NAME unique in the azurehdinisght.net domain.

4. Start with a small CLUSTER SIZE, often 2 or 4 nodes.

5. Choose the admin PASSWORD.

6. The location of the STORAGE ACCOUNT determines the location of the cluster.

image

Custom Create Option

You can also customize your size, admin account, storage, metastore, and more through the portal. We’ll walk through a basic Hadoop cluster.

New

1. Click on HDINSIGHT in the left menu, then NEW in the lower left.

2. Choose CUSTOM CREATE.

image

Basic Info

1. Choose a NAME unique in the azurehdinisght.net domain.

2. Choose Hadoop. HBase and Storm also include the features of a basic Hadoop cluster but are optimized for in-memory key-value pairs (HBase) or alerting (Storm).

3. Choose Windows or Linux as the OPERATING SYSTEM. Linux is only available if you have signed up for the preview.

4. In most cases you will want the default VERSION.

image

Size and Location

1. Choose the number of DATA NODES for this cluster. Head nodes and gateway nodes will also be created and they all use HDInsight cores. For information on how many cores are used by each node see the “Pricing details” link.

2. Each subscription has a billing limit set for the maximum number of HDInsight cores available to that subscription. To change the number available to your subscription choose “Create a support ticket.” If the total of all HDInsight cores in use plus the number needed for the cluster you are creating exceeds the billing limit you will receive a message: “This cluster requires X cores, but only Y cores are available for this subscription”. Note that the messages are in cores and your configuration is specified in nodes.

3. The storage account(s), metastore, and cluster will all be in the same REGION.

image

Cluster Admin

1. Choose an administrator USER NAME. It is more secure to avoid “admin” and to choose a relatively obscure name. This account will be added to the cluster and doesn’t have to match any existing external accounts.

2. Choose a strong PASSWORD of at least 10 characters with upper/lower case letters, a number, and a special character. Some special characters may not be accepted.

image

Metastore (HCatalog)

On the same page as the Hadoop cluster admin account you can optionally choose to use a common metastore (Hcatalog).

1. Click on the blue box to the right of “Enter the Hive/Oozie Metastore”. This makes more fields available.

2. Choose the SQL Azure database you created earlier as the METASTORE.

3. Enter a login (DATABASE USER) and PASSWORD that allow you to access the METASTORE database. If you encounter errors, try logging in to the database manually from the portal. You may need to open firewall ports or change permissions.

image

Default Storage Account

Every cluster has a default storage account. You can optionally specify additional storage accounts at cluster create time or at run time.

1. To access existing data on an existing STORAGE ACCOUNT, choose “Use Existing Storage”.

2. Specify the NAME of the existing storage account.

3. Choose a DEFAULT CONTAINER on the default storage account. Other containers (units of data management) can be used as long as the storage account is known to the cluster.

4. To add ADDITIONAL STORAGE ACCOUNTS that will be accessible without the user providing the storage account key, specify that here.

image

Additional Storage Accounts

If you specified there will be additional accounts you will see this screen.

1. If you choose “Use Existing Storage” you simply enter the NAME of the storage account.

2. If you choose “Use Storage From Another Subscription” you specify the NAME and the GUID KEY for that storage account.

image image

Script Actions

You can add additional components or configure existing components as the cluster is deployed. This is beyond the scope of this demo.

1. Click “add script action” to show the remaining parameters.

2. Enter a unique NAME for your action.

3. The SCRIPT URI points to code for your custom action.

4. Choose the NODE TYPE for deployment.

image

Create is Done!

Once you click on the final checkmark Azure goes to work and creates the cluster. This takes several minutes. When the cluster is ready you can view it in the portal.

image

Recording of HDInsight quick and custom create on YouTube

Query with Hive

For most people the easiest, fastest way to learn Hadoop is through Hive. Hive is also the most widely used component of Hadoop. When you use the Hive ODBC driver any ODBC-compliant app can access the Hive data as "just another data source". That includes Azure Machine Learning, Power BI, Excel, and Tableau.

Hive Console

The simplest, most relatable way for most people to use Hadoop is via the SQL-like, Database-like Hive and HiveQL (HQL).

1.  Put focus on your HDInsight cluster and choose QUERY CONSOLE to open a new tab in your browser. In my case it opens: https://dragondemo1.azurehdinsight.net//

2.  Click on Hive Editor.

image

image

Query Hive

The query console defaults to selecting the first 10 rows from the pre-loaded sample table. This table is created when the cluster is created.

1. Optionally edit or replace the default query:
Select * from hivesampletable LIMIT 10;

2. Optionally name your query to make it easier to find in the job history.

3. Click Submit.

Hive is a batch system optimized for processing huge amounts of data. It spends several seconds up front splitting the job across the nodes and this overhead exists even for small result sets. If you are doing the equivalent of a table scan in SQL Server and have enough nodes in Hadoop, Hadoop will probably be faster than SQL Server. If your query uses indexes in SQL Server, then SQL Server will likely be faster than Hive.

image

View Hive Results

1. Click on the Query you just submitted in the Job Session. This opens a new tab.

image

2. You can see the text of the Job Query that was submitted. You can Download it.

3. The first few lines of the Job Output (query result) are available. To see the full output choose Download File.

4. The Job Log has details including errors if there are any.

5. Additional information about the job is available in the upper right.

image

View Hive Data in Excel Workbook

At this point HDInsight is “just another data source” for any application that supports ODBC.

1. Install the Microsoft Hive ODBC driver.

2. Define an ODBC data source pointing to your HDInsight instance.

3. From DATA choose From Other Sources and From Data Connection Wizard.

image

View Hive Data in PowerPivot

At this point HDInsight is “just another data source” for any application that supports ODBC.

1. Install the Microsoft Hive ODBC driver.

2. Define an ODBC data source pointing to your HDInsight instance.

3. Click on POWERPIVOT then choose Manage. This opens a new PowerPivot for Excel window.

4. Choose Get External Data then Others (OLEDB/ODBC).

Now you can combine the Hive data with other data inside the tabular PowerPivot data model.

image

Recording of querying Hive on YouTube

Load Demo Data

In the cloud you don’t have to load data to Hadoop, you can load data to an Azure Storage Account. Then you point your HDInsight or other WASB compliant Hadoop cluster to the existing data source. There many ways to load data, for the demo we’ll use CloudXplorer.

You use the Accounts button to add Azure, S3, or other data/storage accounts you want to manage.

In this example nealhadoop is the Azure storage account, demo is the container, and bacon is a “directory”. The files are bacon1.txt and bacon2.txt. Any Hive tables would point to the bacon directory, not to individual files. Drag and drop files from Windows Explorer to CloudXplorer.

Windows Azure Storage Explorers (2014)

image

Recording of loading demo data on YouTube

WrapUp

Once you have created the HDInsight cluster you can use it and play with it and try many things. When you are done, simply remove the cluster. If you created an independent metastore in SQL Azure you can use that same metastore and the same Azure storage account(s) the next time you create a cluster. You are charged for the existence of the cluster, not for the usage of it. So make sure you drop the cluster when you aren’t using it. You can use automation, such as PowerShell, to spin up a cluster that is configured the same every time and to drop it. Check the website for the most recent information.

Pricing

image

Automate with PowerShell

With PowerShell, .NET, or the Cross-Platform cmd line tools you can specify even more configuration settings that aren’t available in the portal. This includes node size, a library store, and changing default configuration settings such as Tez and compression.

Automation allows you to standardize and with version control lets you track your configurations over time.

Sample PowerShell Script: HDInsight Custom Create http://blogs.msdn.com/b/cindygross/archive/2013/12/06/sample-powershell-script-hdinsight-custom-create.aspx. If your HDInsight and/or Azure cmdlets don’t match the current documention or return unexpected errors run Web Platform Installer and check for a new version of “Microsoft Azure PowerShell with Microsoft Azure SDK” or “Microsoft Azure PowerShell (standalone).”

image

Recording of Pricing, Automation, and Wrapup on YouTube

Summary

  • HDInsight is Hadoop on Azure as a service, specifically Hortonworks HDP on either Windows or Linux
  • Easy, cost effective, changeable scale out data processing for a lower TCO – easily add/remove/scale
  • Separation of storage and compute allows data to exist across clusters via WASB
  • Metastore (Hcatalog) exists independently across clusters via SQL DB
  • #, size, type of clusters are flexible and can all access the same data
  • Instantly access data born in the cloud; Easily, cheaply load, share, and merge public or private data
  • Load data now, add schema later (write once, read many)
  • Fail fast – iterate through many questions to find the right question
  • Faster time from question to insight
  • Hadoop is “just another data source” for BI, Analytics, Machine Learning

I hope you enjoyed this Small Bite of Big Data! Happy Hadooping!

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow  
@SQLCindy | @NealAnalytics | CindyG@NealAnalytics.com | http://smallbitesofbigdata.com


2 Comments

HDInsight Big Data Talks from #SQLPASS

SQL PASS Summit 2013 was another great data geek week! I chatted with many of you about Big Data, Hadoop, HDInsight, architecting solutions, SQL Server, data, BI, analytics, and general geekiness – great fun! This time around I delivered two talks on Hadoop and HDInsight – the slides from both are attached.

Zero to 60 with HDInsight takes you from an overview of Big Data and why it matters (zero) all the way through an end to end solution (60). We discussed how to create an HDInsight cluster with the Azure portal or PowerShell and talked through the architecture of the data and analysis behind the release of Halo 4. We talked about how you could use the same architectural pattern for many projects and walked through Hive and Pig script examples. We finished up with how to use Power Map (codename GeoFlow) over that data to gain new insights and improve the game experience for the end user.

The next session I co-presented with HDInsight PM Dipti Sangani: CAT: From Question to Insight with HDInsight and BI. We went deeper this time. Not only did we present an end to end story with how our own internal Windows Azure SQL Database team uses telemetry to improve your experience with SQL Server in Azure PaaS but we also went deeper with demos of Hive, Pig, and Oozie. We also gave another archetypical design scenario that will apply to many of your own scenarios and talked about how HDInsight fits with SQL Server and your other existing infrastructure. The deck covers your cloud and on-premises options for Hadoop on Windows including HDInsight Service, Hortonworks HDP for Windows, OneBox, and PDW with Polybase.

Please let me know if you have any questions from the talks or just general HDInsight questions!

 

PASSSummit2013BigData.zip


2 Comments

Self-Service BI Works!

When I talk to people about adding self-service BI to their company’s environment I generally get a list of reasons why it won’t work. Some things I commonly hear:

  • I can’t get anyone in IT or on the business side to even try it.
  • The business side doesn’t know how to use the technology.
  • This threatens my job.
  • I just don’t know where to start either politically/culturally or with the technology.
  • I have too many other things to do.
  • How can it possibly be secure, allow standardization, or result in quality data and decisions?
  • That’s not the way we do things.
  • I don’t really know what self-service BI means.

#PASSBAC 2013 Cindy and Eduardo 

So what is a forward thinking BI implementer to do? Well, Intel just went out and did it, blowing through the supposed obstacles. Eduardo Gamez of Intel’s Technology Manufacturing Engineering (TME) group interviewed business folks to find those who were motivated for change, found a great pilot project with committed employees, and drove the process forward. They put a “sandbox” environment up for the business to use and came up with a plan for monitoring the sandbox activity to find models and reports worth adding to their priority queue for enterprise BI projects. The business creates their own data models and their own reports for both high and low priority items. IT provides the infrastructure and training including products like Analysis Services, PowerPivot, Power View, SharePoint, Excel, SQL Server, and various data sources. The self-service models and reports are useful to the business – they reduce manual efforts, give them the reports they want much faster, and ultimately drive better, more agile business decisions. If a model isn’t quite right after the first try, they can quickly modify it. The same models and reports are useful to IT – they are very refined and complete requirements docs that shorten the time to higher quality enterprise models and reports, they free up IT resources to build a more robust infrastructure and allow IT to concentrate on projects that require specialized IT knowledge. Everyone wins with a shorter time to decision, higher quality decisions, and a significant impact on the bottom line.

Learn more about how Intel TME is implementing self-service BI:

Eduardo (eduardo.m.gamez@intel.com) and I (cgross@microsoft.com or @SQLCindy) are happy to talk to you about Self-Service BI – let us know what you need to know!

Digg This

How_Intel__Integrates_Self-Service_BI_with_IT_for_Better_Business_Results_[DAV-208-M].zip


5 Comments

24HOP/SQLRally – Fitting Microsoft Hadoop Into Your Enterprise BI Strategy

24HOP/SQLRally – Fitting Microsoft Hadoop

Into Your Enterprise BI Strategy

 

Small Bites of Big Data

Cindy Gross, SQLCAT PM

The world of #bigdata and in particular #Hadoop is going mainstream. At 24HOP 2012 I talked about how a SQL Server professional fits into this big data world.

Hadoop generally falls into the NOSQL realm. NOSQL uses basically available, soft-state, eventual consistency (BASE) instead of requiring immediate, 100% consistency via ACID properties. This allows certain types of queries/operations to be much faster since locking and consistency rules don’t have to be followed. This works well when any given program/query looks at the entire dataset and does not try to filter the data or query a subset of it. Basically with Hadoop you take data that may not have a known, consistent structure and impose a structure on it for the duration of your MapReduce program (query). This is all based on the concept of scaling out rather than up, with new commodity (low end enterprise server) hardware being added to an existing Hadoop cluster to meet dynamic demands. With Hadoop you are spreading hardware costs out over a longer time, possibly paying less overall for hardware, and shifting your IT costs to different departments and parts of the lifecycle.

Hive is a database which sits on top of Hadoop’s HDFS (Hadoop Distributed File System). The data stays in HDFS but Hive stores metadata about the imposed structure and may store intermediate files. HiveQL looks a lot like TSQL and like most SQL languages makes an effort to align with the SQL standard when it makes sense. HiveQL will generate (usually multiple) MapReduce jobs to take an entire large result set and allow easier use of filters, joins, and aggregates. It’s still going to be slower than a relational database for most filters and joins but it allows access to data that may be too expensive (for whatever reason) to put into a relational or multi-dimensional database at this point in the lifecycle.

You may keep your source data outside HDFS and bring it in only for the duration of a project. You can sqoop data to/from a relational database (there is a sqoop driver for SQL Server and PDW) or pull data from blob stores like Windows Azure and Amazon S3, use secure FTP, query OData sources like the Windows Azure DataMarket. The Hive ODBC driver, often via the Excel Hive Add-in, and sqoop can be used to pull data from Hadoop into other tools or systems.

So far I’ve been talking as if big data = Hadoop. However, that’s not really true. Big data is data that is too big, expensive, or complex to process using your existing environment (often traditional RDBMSs with expensive SANs and high end hardware). While MapReduce, most commonly via Hadoop, is a very common solution it’s not the only one. There are streaming technologies like StreamInsight and HStreaming, machine learning like Mahout, massively parallel processing databases like PDW, and more. Generally big data at least starts out as unstructured or semi-structure, or perhaps of variable structure. It may flow very quickly with the need to process large amounts of data in a very small time window where decisions have to be made in real time.

At its core Hadoop has the file system HDFS which sits on top of the Windows or Linux file system and allows data to be mapped over many nodes in a Hadoop cluster. A head node maintains metadata about where each piece of data resides. MapReduce programs “map” the data to the many nodes then reduce the output based on the required logic, resulting in an output file that is the end result of the program/query. Other pieces of the Hadoop ecosystem may include things like Hive, HBase, HCatalog, Lucene, Mahout, Zookeeper and more.

So when would you use Hadoop? It’s good for exploring data, finding out what you don’t know. If you need to know your unknown unknowns or look for unexpected correlations, Hadoop may be what you need. It allows you to be very flexible and not lock yourself into a rigid schema until after you’ve fully explored your options. Some examples of common use cases are risk modeling, machine learning, cyber security, sentiment analysis, recommendation engines, log analysis, and sensor data. However, don’t think big data is the solution to all your problems. Often it is used to find and define the most useful bits of data or find the most useful algorithms. Then a more traditional design process may be used to create an RDBMS or multiple dimensional system for faster querying during day to day operations. Think of Hadoop as the foundation for a very specialized type of database that meets very specific needs, not as a replacement for relational. It’s important to note that every VLDB is not a fit for big data. “Big” is somewhat of a misnomer, size is only one of the characteristics and is relative to what your environment is prepared to handle.

Often you ask about the 4 “Vs” when deciding whether to use Hadoop – volume, velocity, variety, variability. Think about big data when you describe a problem with terms like tame the chaos, reduce the complexity, explore, I don’t know what I don’t know, unknown unknowns, unstructured, changing quickly, too much for what my environment can handle now, or unused data.

  • Volume = More data than the current environment can handle with vertical scaling, need to make sure of data that it is currently too expensive to use
  • Velocity = Small decision window compared to data change rate, ask how quickly you need to analyze and how quickly data arrives
  • Variety = Many different formats that are expensive to integrate, probably from many data sources/feeds
  • Variability = Many possible interpretations of the data

Microsoft is taking the existing Apache Hadoop code and making sure it runs on Windows. We have checked back in HDFS, MapReduce, and Hive code to the Apache open source community. More is on the way. We are adding things like the Excel add-in for the Hive ODBC driver, JavaScript, cloud (http://HadoopOnAzure.com), C# samples, etc. Where Microsoft is often seen as adding value is making Hadoop more available and allowing the reuse of existing skill sets. We offer self service, visualization, easier data movement, elasticity, and familiarity for faster completion of projects. This allows data scientists (people who know the business and statistics first and tech 2nd), data teams, information workers, and anyone seeking insights through BI to do their job better and faster.

We offer visualization through PowerPivot, Power View, and the Excel Hive ODBC Add-in. For many used to Hadoop on Linux these things are new and exciting, even though they seem familiar to many in the SQL Server world.

In the end, big data is really all about discovering insights and then taking action. If you don’t act on what you find, why did you spend all that effort mining the data? Sometimes the action is to go back and refine your process or change the way you looked at the data, but the end goal should be to make decisions that impact your business. Big data should add value to the business proposition.

References

My full presentation from #24HOP can be downloaded below. There are notes on many of the slides with further details. The recording from 24 Hours of PASS as soon as it is available at http://www.sqlpass.org/LearningCenter/SessionRecordings/24HoursSpring2012.aspx. Download their player then choose the session by title – “Fitting Microsoft Hadoop into your Enterprise BI
Strategy
(Cindy Gross)”.

I updated the attachment to include the updated slides from my May 11, 2012 talk at #SQLRally. There are few changes other than the background.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

MSHadoopEntBIStrategy24HOPandSQLRally2012.zip