Befriending Dragons

Turn Scary Into Attainable


Leave a comment

Hadoop Likes Big Files

One of the frequently overlooked yet essential best practices for Hadoop is to prefer fewer, bigger files over more, smaller files. How small is too small and how many is too many? How do you stitch together all those small Internet of Things files into files “big enough” for Hadoop to process efficiently?

The Problem

One performance best practice for Hadoop is to have fewer large files as opposed to large numbers of small files. A related best practice is to not partition “too much”. Part of the reason for not over-partitioning is that it generally leads to larger numbers of smaller files.

Too small is smaller than HDFS block size (chunk size), or realistically small is something less than several times larger than chunk size. A very, very rough rule of thumb is files should be at least 1GB each and no more than maybe around 10,000-ish files per table. These numbers, especially the maximum total number of files per table, vary depending on many factors. However, it gives you a reference point. The 1GB is based on multiples of the chunk size while the 2nd is honestly a bit of a guess based on a typical small cluster.

Why Is It Important?

One reason for this recommendation is that Hadoop’s name node service keep track of all the files and where the internal chunks of the individual files are. The more files it has to track the more memory it needs on the head node and the longer it takes to build a job execution plan. The number and size of files also affects how memory is used on each node.

smallpiebigpieLet’s say your chunk size is 256MB. That’s the maximum size of each piece of the file that Hadoop will store per node. So if you have 10 nodes and a single 1GB file it would be split into 4 chunks of 256MB each and stored on 4 of those nodes (I’m ignoring the replication factor for this discussion). If you have 1000 files that are 1MB each (still a total data size of ~1GB) then every one of those files is a separate chunk and 1000 chunks are spread across those 10 nodes. NOTE: In Azure and WASB this happens somewhat differently behind the scenes – the data isn’t physically chunked up when initially stored but rather chunked up at the time a job runs.

With the single 1GB file the name node has 5 things to keep track of – the logical file plus the 4 physical chunks and their associated physical locations. With 1000 smaller files the name node has to track the logical file plus 1000 physical chunks and their physical locations. That uses more memory and results in more work when the head node service uses the file location information to build out the plan for how it will split out any Hadoop job into tasks across the many nodes. When we’re talking about systems that often have TBs or PBs of data the difference between small and large files can add up quickly.

The other problem comes at the time that the data is read by a Hadoop job. When the job runs on each node it loads the files the task tracker identified for it to work with into memory on that local node (in WASB the chunking is done at this point). When there are more files to be read for the same amount of data it results in more work and slower execution time for each task within each job. Sometimes you will see hard errors when operating system limits are hit related to the number of open files. There is also more internal work involved in reading the larger number of files and combining the data.

Stitching

There are several options for stitching files together.

  • Combine the files as they land using the code that moves the files. This is the most performant and efficient method in most cases.
  • INSERT into new Hive tables (directories) which creates larger files under the covers. The output file size can be controlled with settings like hive.merge.smallfiles.avgsize and hive.merge.size.per.task.
  • Use a combiner in Pig to load the many small files into bigger splits.
  • Use the HDFS FileSystem Concat API http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#concat.
  • Write custom stitching code and make it a JAR.
  • Enable the Hadoop Archive (HAR). This is not very efficient for this scenario but I am including it for completeness.

There are several writeups out there that address the details of each of these methods so I won’t repeat them.

The key here is to work with fewer, larger files as much as possible in Hadoop. The exact steps to get there will vary depending on your specific scenario.

I hope you enjoyed this small bite of big data!

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow  image
@SQLCindy | @NealAnalytics | CindyG@NealAnalytics.com | http://smallbitesofbigdata.com

//

Advertisements


Leave a comment

Master Choosing the Right Project for Hadoop

Hadoop is the hot buzzword of the Big Data world, and many IT people are being told “go create a Hadoop cluster and do some magic”. It’s hard to know where to start or which projects are a good fit. The information available online is sparse, often conflicting, and usually focused on how to solve a technical problem rather than a business problem. So let’s look at this from a business perspective.

Data-Driven InsightsYodaCool

For the average business just getting into using Hadoop for the first time, you are most likely to be successful if you choose a project related to data exploration, analytics and reporting, and/or looking for new data-driven actionable insights. In many ways Hadoop is ‘just another data source.” Generally most businesses will not start with replacing existing, high-functioning OLTP implementations. Instead you will likely see the highest initial return on investment (ROI) from adding on to those existing systems. Pull some of the existing data into Hadoop, add new data, and look for new ways to use that data. The goal should remain clearly focused on how to use the data to take action based on the new data-driven insights you will uncover.

Success

DataPointer Below are some characteristics that are often present for a successful Hadoop implementation. You don’t need to have all of them to be successful, use the list to brainstorm new ideas.

  • Goals include innovation, exploration, iteration, and experimentation. Hadoop allows you to ask lots of “what-if” questions cheaply, to “fail fast” so you can try out many potential hypotheses, and look for that one cool thing everyone else has missed that can really impact your business.
  • New data or data variations will be explored. Some of it may be loosely structured. Hadoop, especially in the cloud, allows you to import and experiment with data much more quickly and cheaply than with traditional systems. Hadoop on Azure in particular has the WASB option to make data ingestion even easier and faster.
  • You are looking for the “Unknown Unknowns”. There are always lurking things that haven’t come to your attention before but which may be sparks for new actions. You know you don’t know what you want or what to ask for and will use that to spur innovation.
  • Flexible, fast scaling without the need to change your code is important. Hadoop is built on the premise that it is infinitely scalable – you simply add more nodes when you need more processing power. In the cloud you can also scale your storage and compute separately and more easily scale down during slow periods.
  • You are looking to gain some competitive advantage faster than your competition based on data-driven actions. This goes back to the previous points, you are using Hadoop to look for something new that can change your business or help you be first to market with something.
  • There are a low number of direct, concurrent users of the Hadoop system itself. The more jobs you have running at the same time, the more robust and expensive your head node(s) must be and often the larger your cluster must be. This changes the cost/benefit ratio quickly. Once data is processed and curated in Hadoop it can be sent to systems that are less-batch oriented and more available and familiar to the average power user or data steward.
  • Archiving data in a low-cost manner is important. Often historical data is kept in Hadoop while more interactive data is kept in a relational system.

Anti-Patterns

Quite often I hear people proposing Hadoop for projects that are not an ideal use for Hadoop, at least not as you are learning it and looking for quick successes to bolster confidence in the new technology. The below characteristics are generally indicators that you do NOT want to use Hadoop in a project.RosieInTechWIT

  • You plan to replace an existing system whose pain points don’t align with Hadoop’s strengths.
  • There are OLTP business requirements, especially if they are adequately met by an existing system. Yes, there are some components of Hadoop that can meet OLTP requirements and those features are growing and expanding rapidly. If you have an OLTP scenario that requires ACID properties and fast interactive response time it is possible Hadoop could be a fit but it’s usually not a good first project for you to learn Hadoop and truly use Hadoop’s strengths.
  • Data is well-known and the schema is static. Generally speaking, though the tipping point is changing rapidly, when you can use an index for a query it will likely be faster in a relational system. When you do the equivalent of a table scan across a large volume of data and provide enough scaled-out nodes it is likely faster on a Big Data system such as Hadoop. Well-known, well-structured data is highly likely to have well-known, repeated queries that have supporting indexes.
  • A large number of users will need to directly access the system and they have interactive response time requirements (response within seconds).
  • Your first project and learning is on a mission critical system or application. Learn on something new, something that makes Hadoop’s strengths really apparent and easy to see.

And in Conclusion

BeTheChangeChalk Choosing the right first project for your dive into Hadoop is crucial. Make it bite-sized, clearly outline your goals, make sure it has some of the above success criteria and avoid the anti-patterns. Make learning Hadoop a key goal of the project. Budget time for everyone to really learn not only how things work but why they work that way and whether there are better ways to do certain things. Hadoop is becoming ubiquitous, avoiding it completely is not an option. Jump in, but do so with your eyes wide open and make some good up-front decisions. Happy Big Data-ing!


12 Comments

x64 Windows – Upgrade from 32bit SQL Server to 64bit SQL Server

Many people are now upgrading from 32bit to 64bit SQL Servers. Most of you have a match between your operating system and your SQL Server platform. For example, most of you install a 32bit SQL Server on 32bit Windows, and if you have the x64 platform of Windows, you usually install the x64 SQL Server. But what happens when you have a 32bit SQL Server on an x64 system and you want to change it to be al x64? Note that you cannot install 32bit SQL Server on IA64 so this scenario does not apply to Itanium systems. In the example below both the platform and the version of SQL Server are changing.

You have an instance of SQL Server 2000 32bit installed on Windows 2003 SP2 x64. This means SQL Server is “running in the WOW”. WOW stands for Windows on Windows and means you have a 32bit application running inside a 64bit OS. This gives SQL Server a full 4GB of user addressable virtual memory space, which is more than any 32bit application can get on a 32bit OS without memory mapping (in SQL we do memory mapping of the buffer pool through “AWE”). However running in the WOW doesn’t give you the full memory advantages you would get from running a true x64 application on an x64 OS. SQL Server 2000 was not released in an x64 “flavor”, but once you upgrade to SQL Server 2000 SP4 Microsoft will support running it in the WOW. SP4 was required for this particular configuration even before we discontinued support for SP3. See 898042 Changes to SQL Server 2000 Service Pack 4 operating system support http://support.microsoft.com/default.aspx?scid=kb;EN-US;898042 Generally you should avoid installing 32bit applications on x64 systems whenever possible. Any recently purchased hardware will be x64 and putting a 32bit OS on it will throttle back its memory capabilities, so your best bet is going to be an x64 version of SQL Server on x64 Windows.

 

You want to upgrade this instance from SQL Server 2000 32bit to SQL Server 2005 x64 on the same box. You would like to keep the same instance name. However, we do not support an in-place upgrade from any 32bit SQL Server to any 64bit SQL Server. Additionally, you cannot restore system databases (master, model, tempdb, msdb) to a different version, not even a different service pack or hotfix level.

·         Version and Edition Upgrades “Upgrading a 32-bit instance of SQL Server 2000 from the 32-bit subsystem (WOW64) of a 64-bit server to SQL Server 2005 (64-bit) on the X64 platform is not supported. However, you can upgrade a 32-bit instance of SQL Server to SQL Server 2005 on the WOW64 of a 64-bit server as noted in the table above. You can also backup or detach databases from a 32-bit instance of SQL Server 2000, and restore or attach them to an instance of SQL Server 2005 (64-bit) if the databases are not published in replication. In this case, you must also recreate any logins and other user objects in master, msdb, and model system databases.”

·         You cannot restore system database backups to a different build of SQL Server “You cannot restore a backup of a system database (master, model, or msdb) on a server build that is different from the build on which the backup was originally performed.”

·         If the SQL Server versions are the same, even system databases can be restored between different platforms (x86/x64). However, you do sometimes have to make one update to the msdb database when you do this (because often the SQL Server install path has changed, such as using “program files (x86)” on an x64 system). For non-system databases the version you restore to doesn’t have to be identical, generally you can restore a user database to a higher version and the platform (x86/x64) is irrelevant. Error message when you restore or attach an msdb database or when you change the syssubsystems table in SQL Server 2005: “Subsystem % could not be loaded”

 

So in this case you have two basic options if you must keep the same server and instance name:

1.       Upgrade, reinstall, attach

a.       Make sure all users, applications, and services are totally off the system for the entire duration of the downtime

b.      Upgrade SQL 2000 SP4 32bit to SQL 2005 (or 2008) 32bit (NOT x64! – that is not a viable upgrade path)

c.       Backup all databases

d.      Detach the user databases (the detach does a checkpoint to ensure consistency)

e.      Make copies of the mdf/ldf files for user and system dbs

f.        Uninstall SQL Server 2005 32bit (to make the instance name available)

g.       Install SQL Server 2005 x64 to the same instance name and at the EXACT same version as what was just uninstalled

h.      Restore master, model, msdb

i.         Attach the user databases

j.        If needed, run the update from Error message when you restore or attach an msdb database or when you change the syssubsystems table in SQL Server 2005: “Subsystem % could not be loaded”

k.       Apply the appropriate Service Pack and/or Cumulative Update

l.         Take full backups

m.    Allow users back in the system

2.       Reinstall, attach, copy system db info

a.       Make sure all users, applications, and services are totally off the system for the entire duration of the downtime

b.      Backup all databases

c.       Extract all relevant information to allow re-creation of system database information. This includes logins/passwords, configuration settings, replication settings, linked servers (including login mappings), custom error messages, extended stored procedures, MSDB jobs, DTS/SSIS packages stored in MSDB, proxies, any objects manually created in any system database. If you go this route let me know and I’ll double check that this list is complete.

d.      Detach the user databases (the detach does a checkpoint to ensure consistency)

e.      Make copies of the mdf/ldf files for user and system dbs

f.        Uninstall SQL Server 2000 32bit (to make the instance name available)

g.       Install SQL Server 2005 x64 to the same instance name.

h.      Attach the user databases

i.         Apply all the system information you extracted above including sync’ing users to the new logins.

j.        If needed, run the update from Error message when you restore or attach an msdb database or when you change the syssubsystems table in SQL Server 2005: “Subsystem % could not be loaded”

k.       Apply the appropriate Service Pack and/or Cumulative Update

l.         Take full backups

m.    Allow users back in the system