Befriending Dragons

Turn Scary Into Attainable


Leave a comment

The Magic of Augmented Reality and HoloLens

July is Microsoft’s //oneweek celebration, when the full-time employees spend a week hacking apps that they would like to work on. I could not pass this opportunity, and decided to learn some…

Source: The Magic of Augmented Reality and HoloLens

Advertisements


Leave a comment

Windows Hyper-V Dragon

After all these years soaring through the data world, from SQL Server 1.11 all the way through today’s modern Big Data technologies, I am making a flight adjustment. My next adventure will be in the land of the Windows Hypervisor: Hyper-V. Last week I started working with my new team and I am already learning to corral and wieldGreenFlyingDragon a whole new world of acronyms, technologies, and scenarios. As a software engineer on the quality team I’ll help define and implement test scenarios that lead to better customer experiences across multiple products.

I won’t be leaving data behind! This new role has a lot of data aspects and of course the hypervisor underlies many of the world’s data systems! It’s been great working with the #SQLFamily over the years and I look forward to continuing to work with you all!

 


Leave a comment

Moving Beyond Unconscious Bias – Good People Matter!

Presented at SQL Saturday Oregon on October 24, 2015

by Julie Koesmarno and Cindy Gross

Good People

We’re good people. As good people we don’t want to think we do things that have negative consequences for others. But sometimes our subconscious can fool us. What we intend isn’t always what happens. We think we’re making a totally rational decision based on our conscious values – but subtle, unconscious bias creeps in. Yes, even for good people. For 20+ years folks at Harvard have been using something called the Implicit Association Test (IAT) to help us identify our biases.

Take this IAT on gender and career – the results may surprise you: https://implicit.harvard.edu/implicit/user/agg/blindspot/tablet.htm

Watch Alan Alda take the test, it will give you a feel for how it works: https://www.youtube.com/watch?v=2RSVz6VEybk 

image 

Continue reading at http://befriendingdragons.com/2015/10/26/moving-beyond-unconscious-bias-good-people-matter/ 

 

Slides are attached at the bottom of this post. 

 

 

 

 

MovingBeyondUnconsciousBiasOct2015.pptx


Leave a comment

Big Data for the SQL Eye

SQL Server is a great technology – I’ve been using it since 1993 when the user interface consisted of a query window with the options to save and execute and not much else. With every release there’s something new and exciting and there’s always something to learn about even the most familiar of features. However, not everyone uses SQL Server for every storage and compute opportunity – sad but true.

So what is a SQL geek to do in the face of all the new options out there – many under the umbrella of Big Data (distributed processing)? Why just jump right on in and learn it! No one can know all the pieces because it’s a big, fluid, messy collection of “things”. But don’t worry about that, start with one thing and build from there. Even if you never plan to implement a production Big Data system you need to learn about it – because if you don’t have some hands-on experience with it then someone who does have that experience will be influencing the decision makers without you. For a SQL Pro I suggest Hive as that easy entry point. At some point maybe Spark SQL will jump into that gap, but for now Hive is the easiest entry point for most SQL pros.

For more, I refer you to the talk I gave at the Pacific Northwest SQL Server User Group meeting on October 14, 2015. Excerpts are below, the file is attached.

Look, it’s SQL!

SELECT score, fun
FROM toDo
WHERE type = ‘they pay me for this?’;

Here’s how that code looks from Visual Studio along with the links to how you find the output and logs:

image

And yet it’s more!

CREATE EXTERNAL TABLE IF NOT EXISTS toDo
(fun STRING,
rank INT COMMENT ‘rank the greatness’,
type STRING)
COMMENT ‘two tables walk into a bar….’
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE
LOCATION ‘/data/demo/’;

image

A mix of old and new

— read some data
SELECT ‘you cannot make me ‘, score, fun, type
FROM toDo
WHERE score <= 0
ORDER BY score;

SELECT ‘when can we ‘, score, fun, type
FROM toDo
WHERE score > 0
DISTRIBUTE BY score SORT BY score;

image

That’s Hive folks!

Hive

on Hadoop
on HDInsight
on AzureBig Data in the cloud!

Hadoop Shines When….
(refer to http://blogs.msdn.com/b/cindygross/archive/2015/02/25/master-choosing-the-right-project-for-hadoop.aspx)

Data exploration, analytics and reporting, new data-driven actionable insights
Rapid iterating
Unknown unknowns
Flexible scaling
Data driven actions for early competitive advantage or first to market
Low number of direct, concurrent users
Low cost data archival

Hadoop Anti-Patterns….

Replace system whose pain points don’t align with Hadoop’s strengths
OLTP needs adequately met by an existing system
Known data with a static schema
Many end users
Interactive response time requirements (becoming less true)
Your first Hadoop project + mission critical system

image

Azure has so much more

Go straight to the business code
Scale storage and compute separately
Open Source
Linux
Managed and unmanaged services
Hybrid
On-demand and 24×7 options
SQL Server

It’s a Polyglot

Stream your data into a lake
Pick the best compute for each task

And it’s Fun!

I hope you enjoyed this small bite of big data!

//

BigDataForTheSQLEye.zip


3 Comments

The Big Data Dragon flies on to Microsoft AzureCAT

“Always in motion is the future” – YodaCindyMar2015

On June 1 I will be moving into a new role on AzureCAT. I tried the small business consulting world with Neal Analytics and it just wasn’t a good fit for me and my passions. So here I go, on to new challenges at Microsoft! I’ll be making the world a better place with the help of Big Data.

And while I’m making changes, I’ll also be moving from Boise, ID to the Redmond, WA area. It’s new adventures all around for me. I’ll miss Boise – my friends, my political battles, the greenbelt and hiking trails, sitting on the patios downtown. And I’m also excited about all the new opportunities I’ll have in my new, blue state.

Bring it on world, I’m ready!

cindygross@outlook.com | @SQLCindy | http://www.linkedin.com/in/cindygross | http://smallbitesofbigdata.com

Cross-published on:

http://befriendingdragons.com/2015/05/07/the-big-data-dragon-flies-on-to-microsoft-azurecat
http://smallbitesofbigdata.com/archive/2015/05/08/the-big-data-dragon-flies-on-to-microsoft-azurecat.aspx


Leave a comment

Hadoop Likes Big Files

One of the frequently overlooked yet essential best practices for Hadoop is to prefer fewer, bigger files over more, smaller files. How small is too small and how many is too many? How do you stitch together all those small Internet of Things files into files "big enough" for Hadoop to process efficiently?

The Problem

One performance best practice for Hadoop is to have fewer large files as opposed to large numbers of small files. A related best practice is to not partition “too much”. Part of the reason for not over-partitioning is that it generally leads to larger numbers of smaller files.

Too small is smaller than HDFS block size (chunk size), or realistically small is something less than several times larger than chunk size. A very, very rough rule of thumb is files should be at least 1GB each and no more than maybe around 10,000-ish files per table. These numbers, especially the maximum total number of files per table, vary depending on many factors. However, it gives you a reference point. The 1GB is based on multiples of the chunk size while the 2nd is honestly a bit of a guess based on a typical small cluster.

Why Is It Important?

One reason for this recommendation is that Hadoop’s name node service keep track of all the files and where the internal chunks of the individual files are. The more files it has to track the more memory it needs on the head node and the longer it takes to build a job execution plan. The number and size of files also affects how memory is used on each node.

smallpiebigpieLet’s say your chunk size is 256MB. That’s the maximum size of each piece of the file that Hadoop will store per node. So if you have 10 nodes and a single 1GB file it would be split into 4 chunks of 256MB each and stored on 4 of those nodes (I’m ignoring the replication factor for this discussion). If you have 1000 files that are 1MB each (still a total data size of ~1GB) then every one of those files is a separate chunk and 1000 chunks are spread across those 10 nodes. NOTE: In Azure and WASB this happens somewhat differently behind the scenes – the data isn’t physically chunked up when initially stored but rather chunked up at the time a job runs.

With the single 1GB file the name node has 5 things to keep track of – the logical file plus the 4 physical chunks and their associated physical locations. With 1000 smaller files the name node has to track the logical file plus 1000 physical chunks and their physical locations. That uses more memory and results in more work when the head node service uses the file location information to build out the plan for how it will split out any Hadoop job into tasks across the many nodes. When we’re talking about systems that often have TBs or PBs of data the difference between small and large files can add up quickly.

The other problem comes at the time that the data is read by a Hadoop job. When the job runs on each node it loads the files the task tracker identified for it to work with into memory on that local node (in WASB the chunking is done at this point). When there are more files to be read for the same amount of data it results in more work and slower execution time for each task within each job. Sometimes you will see hard errors when operating system limits are hit related to the number of open files. There is also more internal work involved in reading the larger number of files and combining the data.

Stitching

There are several options for stitching files together.

  • Combine the files as they land using the code that moves the files. This is the most performant and efficient method in most cases.
  • INSERT into new Hive tables (directories) which creates larger files under the covers. The output file size can be controlled with settings like hive.merge.smallfiles.avgsize and hive.merge.size.per.task.
  • Use a combiner in Pig to load the many small files into bigger splits.
  • Use the HDFS FileSystem Concat API http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#concat.
  • Write custom stitching code and make it a JAR.
  • Enable the Hadoop Archive (HAR). This is not very efficient for this scenario but I am including it for completeness.

There are several writeups out there that address the details of each of these methods so I won’t repeat them.

The key here is to work with fewer, larger files as much as possible in Hadoop. The exact steps to get there will vary depending on your specific scenario.

I hope you enjoyed this small bite of big data!

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow  image
@SQLCindy | @NealAnalytics | CindyG@NealAnalytics.com | http://smallbitesofbigdata.com

  //


2 Comments

Azure Data Factory: Hub Not Found

You can use the new Azure portal to create or edit Azure Data Factory components. Once you are done you may automate the process of creating future Data Factory components from PowerShell. In that case you can use the JSON files you edited in the portal GUI as configuration files for the PowerShell cmdlets. For example, you may try to create a new linked service using settings from C:CoolerHDInsight.JSON as specified below:

New-AzureDataFactoryLinkedService -ResourceGroupName CoolerDemo -DataFactoryName $DataFactoryName -File C:CoolerHDInsight.JSON

In that case you may see something like this error:

New-AzureDataFactoryLinkedService : Hub: {SomeName_hub} not found.
CategoryInfo                : CloseError: (:) [New-AzureDataFactoryLinkedService]. Provisioning FailedException
FullyQualifiedErrorID   : Microsoft.Azure.Commands.DataFactories.NewAzureDataFactoryLinkedServiceCommand

image

If you check the JSON file that you exported from the portal and referenced in the PowerShell script, you will see it ends with something like this:

        “isPaused”: false,
        “hubName”: “SomeName_hub”
    }
}

The hubName is currently automatically generated based on the name of the Data Factory and should not be present in the JSON files used by PowerShell. Remove the comma on the line above the hubName and the entire line starting with hubName.

                        ,
        “hubName”: “SomeName_hub”

That will leave the end of the file looking something like this:

        “isPaused”: false
     }
}

Check out all your other JSON files you are using for Data Factory components and do the same editing for any that have a hubName.

NOTE: This applies to Azure Data Factory as of April 2015. At some point the hubName should become a viable parameter usable by PowerShell.

I hope you enjoyed this small bite of big data!

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow  image
@SQLCindy | @NealAnalytics | CindyG@NealAnalytics.com | http://smallbitesofbigdata.com

 
//