|By Srinivasan Sundara Rajan||
|March 14, 2013 11:15 AM EDT||
Data Warehouse as a Service
Recently Amazon announced the availability of Redshift Data warehouse as a Service as a beta offering. Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It's optimized for datasets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.
Architecture Behind Redshift
Any data warehouse service meant to serve data of petabyte scale should have a robust architecture as its backbone. The following are the salient features of Redshift service.
- Shared Nothing Architecture: As indicated in one of my earlier articles, Cloud Database Scale Out Using Shared Nothing Architecture, the shared nothing architectural pattern is the most desired for databases of this scale and the same concept is adhered to in Redshift. The core component of Redshift is a cluster and each cluster consists of multiple compute nodes, each node has its dedicated storage following the shared nothing principle.
- Massively Parallel Processing (MPP): Hand in hand with the shared nothing pattern MPP provides horizontal scale out capabilities for large data warehouses rather than scaling up the individual servers. Massively parallel processing (MPP) enables fast execution of the most complex queries operating on large amounts of data. Multiple compute nodes handle all query processing leading up to the final result aggregation, with each core of each node executing the same compiled query segments on portions of the entire data. With the concept of NodeSlices Redshift has taken the MPP to the next level to the cores of a compute node. A compute node is partitioned into slices; one slice for each core of the node's multi-core processor. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node.
Refer to the following diagram from AWS Documentation, about Data warehouse system architecture
- Columnar Data Storage: Storing database table information in a columnar fashion reduces the number of disk I/O requests and reduces the amount of data you need to load from disk. Columnar storage for database tables drastically reduces the overall disk I/O requirements and is an important factor in optimizing analytic query performance.
- Leader Node: The leader node manages most communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node distributes compiled code to the compute nodes and assigns a portion of the data to each compute node.
- High Speed Network Connect: The clusters are connected internally by a 10 Gigabit Ethernet network, providing very fast communication between the leader node and the compute clusters.
Best Practices in Application Design on Redshift
The enablement of Big Data analytics through Redshift has created lot of excitement among the community. The usage of these kinds of alternate approaches to traditional data warehousing will be best in conjunction with the best practices for utilizing the features. The following are some of the best practices that can be considered for the design of applications on Redshift.
1. Collocated Tables: It is good practice to try to avoid sending data between the nodes to satisfy JOIN queries. Colocation between two joined tables occurs when the matching rows of the two tables are stored in the same compute nodes, so that the data need not be sent between nodes.
When you add data to a table, Amazon Redshift distributes the rows in the table to the cluster slices using one of two methods:
- Even distribution
- Key distribution
Even distribution is the default distribution method. With even distribution, the leader node spreads data rows across the slices in a round-robin fashion, regardless of the values that exist in any particular column. This approach is a good choice when you don't have a clear option for a distribution key.
If you specify a distribution key when you create a table, the leader node distributes the data rows to the slices based on the values in the distribution key column. Matching values from the distribution key column are stored together.
Colocation is best achieved by choosing the appropriate distribution keys than using the even distribution.
If you frequently join a table, specify the join column as the distribution key. If a table joins with multiple other tables, distribute on the foreign key of the largest dimension that the table joins with. If the dimension tables are filtered as part of the joins, compare the size of the data after filtering when you choose the largest dimension. This ensures that the rows involved with your largest joins will generally be distributed to the same physical nodes. Because local joins avoid data movement, they will perform better than network joins.
2. De-Normalization: In the traditional RDBMS, database storage is optimized by applying the normalization principles such that a particular attribute (column) is associated with one and only entity (Table). However in shared nothing scalable databases like Redshift this technique will not yield the desired results, rather keeping the redundancy of certain columns in the form of de-normalization is very important.
For example, the following query is one of the examples of a high performance query in the Redshift documentation.
SELECT * FROM tab1, tab2
WHERE tab1.key = tab2.key
AND tab1.timestamp > ‘1/1/2013'
AND tab2.timestamp > ‘1/1/2013';
Even if a predicate is already being applied on a table in a join query but transitively applies to another table in the query, it's useful to re-specify the redundant predicate if that other table is also sorted on the column in the predicate. That way, when scanning the other table, Redshift can efficiently skip blocks from that table as well.
By carefully applying de-normalization to bring the required redundancy, Amazon Redshift can perform at its best.
3. Native Parallelism: One of the biggest advantages of a shared nothing MPP architecture is about parallelism. Parallelism is achieved in multiple ways.
- Inter Node Parallelism: It refers the ability of the database system to break up a query into multiple parts across multiple instances across the cluster.
- Intra Node Parallelism: Intra node parallelism refers to the ability to break up query into multiple parts within a single compute node.
Typically in MPP architectures, both Inter Node Parallelism and Intra Node Parallelism will be combined and used at the same time to provide dramatic performance gains.
Amazon Redshift provides lot of operations to utilize both Intra Node and Inter Node parallelism.
When you use a COPY command to load data from Amazon S3, first split your data into multiple files instead of loading all the data from a single large file.
The COPY command then loads the data in parallel from multiple files, dividing the workload among the nodes in your cluster. Split your data into files so that the number of files is a multiple of the number of slices in your cluster. That way Amazon Redshift can divide the data evenly among the slices. Name each file with a common prefix. For example, each XL compute node has two slices, and each 8XL compute node has 16 slices. If you have a cluster with two XL nodes, you might split your data into four files named customer_1, customer_2, customer_3, and customer_4. Amazon Redshift does not take file size into account when dividing the workload, so make sure the files are roughly the same size.
Pre-Processing Data: Over the years RDBMS engines take pride of Location Independence. The Codd's 12 rules of the RDBMS states the following:
Rule 8: Physical data independence:
Changes to the physical level (how the data is stored, whether in arrays or linked lists, etc.) must not require a change to an application based on the structure.
However, in the columnar database services like Redshift the physical ordering of data does make major impact to the performance.
Sorting data is a mechanism for optimizing query performance.
When you create a table, you can define one or more of its columns as the sort key. When data is loaded into the table, the values in the sort key column (or columns) are stored on disk in sorted order. Information about sort key columns is passed to the query planner, and the planner uses this information to construct plans that exploit the way that the data is sorted. For example, a merge join, which is often faster than a hash join, is feasible when the data is distributed and presorted on the joining columns.
The VACUUM command also makes sure that new data in tables is fully sorted on disk. Vacuum as often as you need to in order to maintain a consistent query performance.
Platform as a Service (PaaS) is one of the greatest benefits to the IT community due to the Cloud Delivery Model, and from the beginning of pure play programming models like Windows Azure and Elastic Beanstalk it has moved to high-end services like data warehouse Platform as a Service. As the industry analysts see good adoption of the above service due to the huge cost advantages when compared to the traditional data warehouse platform, the best practices mentioned above will help to achieve the desired level of performance. Detailed documentation is also available on the vendor site in the form of developer and administrator guides.
T-Mobile has been transforming the wireless industry with its “Uncarrier” initiatives. Today as T-Mobile’s IT organization works to transform itself in a like manner, technical foundations built over the last couple of years are now key to their drive for more Agile delivery practices. In his session at DevOps Summit, Martin Krienke, Sr Development Manager at T-Mobile, will discuss where they started their Continuous Delivery journey, where they are today, and where they are going in an effort ...
May. 22, 2015 10:30 AM EDT Reads: 1,029
SYS-CON Events announced today that the "First Containers & Microservices Conference" will take place June 9-11, 2015, at the Javits Center in New York City. The “Second Containers & Microservices Conference” will take place November 3-5, 2015, at Santa Clara Convention Center, Santa Clara, CA. Containers and microservices have become topics of intense interest throughout the cloud developer and enterprise IT communities.
May. 22, 2015 10:00 AM EDT Reads: 1,989
Disruptive macro trends in technology are impacting and dramatically changing the "art of the possible" relative to supply chain management practices through the innovative use of IoT, cloud, machine learning and Big Data to enable connected ecosystems of engagement. Enterprise informatics can now move beyond point solutions that merely monitor the past and implement integrated enterprise fabrics that enable end-to-end supply chain visibility to improve customer service delivery and optimize sup...
May. 22, 2015 10:00 AM EDT Reads: 5,798
Buzzword alert: Microservices and IoT at a DevOps conference? What could possibly go wrong? In this Power Panel at DevOps Summit, moderated by Jason Bloomberg, the leading expert on architecting agility for the enterprise and president of Intellyx, panelists will peel away the buzz and discuss the important architectural principles behind implementing IoT solutions for the enterprise. As remote IoT devices and sensors become increasingly intelligent, they become part of our distributed cloud en...
May. 22, 2015 10:00 AM EDT Reads: 1,775
I’ve been thinking a bit about microservices (μServices) recently. My immediate reaction is to think: “Isn’t this just yet another new term for the same stuff, Web Services->SOA->APIs->Microservices?” Followed shortly by the thought, “well yes it is, but there are some important differences/distinguishing factors.” Microservices is an evolutionary paradigm born out of the need for simplicity (i.e., get away from the ESB) and alignment with agile (think DevOps) and scalable (think Containerizati...
May. 22, 2015 09:45 AM EDT Reads: 1,302
In this Power Panel at DevOps Summit, moderated by Jason Bloomberg, president of Intellyx, panelists Roberto Medrano, Executive Vice President at Akana; Lori MacVittie, IoT_Microservices Power PanelEvangelist for F5 Networks; and Troy Topnik, ActiveState’s Technical Product Manager; will peel away the buzz and discuss the important architectural principles behind implementing IoT solutions for the enterprise. As remote IoT devices and sensors become increasingly intelligent, they become part of ...
May. 22, 2015 09:45 AM EDT Reads: 1,410
Enterprises are fast realizing the importance of integrating SaaS/Cloud applications, API and on-premises data and processes, to unleash hidden value. This webinar explores how managers can use a Microservice-centric approach to aggressively tackle the unexpected new integration challenges posed by proliferation of cloud, mobile, social and big data projects. Industry analyst and SOA expert Jason Bloomberg will strip away the hype from microservices, and clearly identify their advantages and d...
May. 22, 2015 09:30 AM EDT Reads: 1,203
There is no doubt that Big Data is here and getting bigger every day. Building a Big Data infrastructure today is no easy task. There are an enormous number of choices for database engines and technologies. To make things even more challenging, requirements are getting more sophisticated, and the standard paradigm of supporting historical analytics queries is often just one facet of what is needed. As Big Data growth continues, organizations are demanding real-time access to data, allowing immed...
May. 22, 2015 09:30 AM EDT Reads: 2,665
SYS-CON Media named Andi Mann editor of DevOps Journal. DevOps Journal is focused on this critical enterprise IT topic in the world of cloud computing. DevOps Journal brings valuable information to DevOps professionals who are transforming the way enterprise IT is done. Andi Mann, Vice President, Strategic Solutions, at CA Technologies, is an accomplished digital business executive with extensive global expertise as a strategist, technologist, innovator, marketer, communicator, and thought lea...
May. 22, 2015 09:00 AM EDT Reads: 1,497
Even though it’s now Microservices Journal, long-time fans of SOA World Magazine can take comfort in the fact that the URL – soa.sys-con.com – remains unchanged. And that’s no mistake, as microservices are really nothing more than a new and improved take on the Service-Oriented Architecture (SOA) best practices we struggled to hammer out over the last decade. Skeptics, however, might say that this change is nothing more than an exercise in buzzword-hopping. SOA is passé, and now that people are ...
May. 22, 2015 09:00 AM EDT Reads: 3,376
While the DevOps movement and associated technologies have garnered much attention and fanfare, few have addressed the core issue - the hand off from development to operations. We tend to not acknowledge the importance of Release Management - a critical DevOps function. Release Management is the bridge between development and operations that needs to be strengthened with the right approach, tools, teams and processes. The white paper "How to set up an Effective Enterprise Release Management F...
May. 22, 2015 09:00 AM EDT Reads: 1,651
I’m not going to explain the basics of microservices, as that’s that’s handled elsewhere. The pattern of using APIs, initially built to cross application boundaries within a single enterprise or organization, is now being leveraged within a single application architecture to deliver functionality. Microservices adoption is being driven by two forces: the need for agility and speed; and the re-composing of applications enabling experimentation and demands to support new delivery platforms such as...
May. 22, 2015 09:00 AM EDT Reads: 1,570
Announced separately, New Relic is joining the Cloud Foundry Foundation to continue the support of customers and partners investing in this leading PaaS. As a member, New Relic is contributing the New Relic tile, service broker and build pack with the goal of easing the development of applications on Cloud Foundry and enabling the success of these applications without dedicated monitoring infrastructure. Supporting Quotes "The proliferation of microservices and new technologies like Docker ha...
May. 22, 2015 09:00 AM EDT Reads: 1,714
There’s a lot of discussion around managing outages in production via the likes of DevOps principles and the corresponding software development lifecycles that does enable higher quality output from development, however, one cannot lay all blame for “bugs” and failures at the feet of those responsible for coding and development. As developers incorporate features and benefits of these paradigm shift, there is a learning curve and a point of not-knowing-what-is-not-known. Sometimes, the only way ...
May. 22, 2015 09:00 AM EDT Reads: 937
You often hear the two titles of "DevOps" and "Immutable Infrastructure" used independently. In his session at DevOps Summit, John Willis, Technical Evangelist for Docker, will cover the union between the two topics and why this is important. He will cover an overview of Immutable Infrastructure then show how an Immutable Continuous Delivery pipeline can be applied as a best practice for "DevOps." He will end the session with some interesting case study examples.
May. 22, 2015 09:00 AM EDT Reads: 1,771
Data-intensive companies that strive to gain insights from data using Big Data analytics tools can gain tremendous competitive advantage by deploying data-centric storage. Organizations generate large volumes of data, the vast majority of which is unstructured. As the volume and velocity of this unstructured data increases, the costs, risks and usability challenges associated with managing the unstructured data (regardless of file type, size or device) increases simultaneously, including end-to-...
May. 22, 2015 08:30 AM EDT Reads: 4,143
The 5th International DevOps Summit, co-located with 17th International Cloud Expo – being held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the...
May. 22, 2015 08:30 AM EDT Reads: 3,921
The 17th International Cloud Expo has announced that its Call for Papers is open. 17th International Cloud Expo, to be held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, APM, APIs, Microservices, Security, Big Data, Internet of Things, DevOps and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding bu...
May. 22, 2015 08:00 AM EDT Reads: 4,172
Cloud services are the newest tool in the arsenal of IT products in the market today. These cloud services integrate process and tools. In order to use these products effectively, organizations must have a good understanding of themselves and their business requirements. In his session at 15th Cloud Expo, Brian Lewis, Principal Architect at Verizon Cloud, outlined key areas of organizational focus, and how to formalize an actionable plan when migrating applications and internal services to the ...
May. 22, 2015 08:00 AM EDT Reads: 3,509
Most companies hope for rapid growth so it's important to invest in scalable core technologies that won't demand a complete overhaul when a business goes through a growth spurt. Cloud technology enables previously difficult-to-scale solutions like phone, network infrastructure or billing systems to automatically scale based on demand. For example, with a virtual PBX service, a single-user cloud phone service can easily transition into an advanced VoIP system that supports hundreds of phones and ...
May. 22, 2015 07:00 AM EDT Reads: 2,698