|By Srinivasan Sundara Rajan||
|March 14, 2013 11:15 AM EDT||
Data Warehouse as a Service
Recently Amazon announced the availability of Redshift Data warehouse as a Service as a beta offering. Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. It's optimized for datasets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year, a tenth the cost of most traditional data warehousing solutions.
Architecture Behind Redshift
Any data warehouse service meant to serve data of petabyte scale should have a robust architecture as its backbone. The following are the salient features of Redshift service.
- Shared Nothing Architecture: As indicated in one of my earlier articles, Cloud Database Scale Out Using Shared Nothing Architecture, the shared nothing architectural pattern is the most desired for databases of this scale and the same concept is adhered to in Redshift. The core component of Redshift is a cluster and each cluster consists of multiple compute nodes, each node has its dedicated storage following the shared nothing principle.
- Massively Parallel Processing (MPP): Hand in hand with the shared nothing pattern MPP provides horizontal scale out capabilities for large data warehouses rather than scaling up the individual servers. Massively parallel processing (MPP) enables fast execution of the most complex queries operating on large amounts of data. Multiple compute nodes handle all query processing leading up to the final result aggregation, with each core of each node executing the same compiled query segments on portions of the entire data. With the concept of NodeSlices Redshift has taken the MPP to the next level to the cores of a compute node. A compute node is partitioned into slices; one slice for each core of the node's multi-core processor. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node.
Refer to the following diagram from AWS Documentation, about Data warehouse system architecture
- Columnar Data Storage: Storing database table information in a columnar fashion reduces the number of disk I/O requests and reduces the amount of data you need to load from disk. Columnar storage for database tables drastically reduces the overall disk I/O requirements and is an important factor in optimizing analytic query performance.
- Leader Node: The leader node manages most communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node distributes compiled code to the compute nodes and assigns a portion of the data to each compute node.
- High Speed Network Connect: The clusters are connected internally by a 10 Gigabit Ethernet network, providing very fast communication between the leader node and the compute clusters.
Best Practices in Application Design on Redshift
The enablement of Big Data analytics through Redshift has created lot of excitement among the community. The usage of these kinds of alternate approaches to traditional data warehousing will be best in conjunction with the best practices for utilizing the features. The following are some of the best practices that can be considered for the design of applications on Redshift.
1. Collocated Tables: It is good practice to try to avoid sending data between the nodes to satisfy JOIN queries. Colocation between two joined tables occurs when the matching rows of the two tables are stored in the same compute nodes, so that the data need not be sent between nodes.
When you add data to a table, Amazon Redshift distributes the rows in the table to the cluster slices using one of two methods:
- Even distribution
- Key distribution
Even distribution is the default distribution method. With even distribution, the leader node spreads data rows across the slices in a round-robin fashion, regardless of the values that exist in any particular column. This approach is a good choice when you don't have a clear option for a distribution key.
If you specify a distribution key when you create a table, the leader node distributes the data rows to the slices based on the values in the distribution key column. Matching values from the distribution key column are stored together.
Colocation is best achieved by choosing the appropriate distribution keys than using the even distribution.
If you frequently join a table, specify the join column as the distribution key. If a table joins with multiple other tables, distribute on the foreign key of the largest dimension that the table joins with. If the dimension tables are filtered as part of the joins, compare the size of the data after filtering when you choose the largest dimension. This ensures that the rows involved with your largest joins will generally be distributed to the same physical nodes. Because local joins avoid data movement, they will perform better than network joins.
2. De-Normalization: In the traditional RDBMS, database storage is optimized by applying the normalization principles such that a particular attribute (column) is associated with one and only entity (Table). However in shared nothing scalable databases like Redshift this technique will not yield the desired results, rather keeping the redundancy of certain columns in the form of de-normalization is very important.
For example, the following query is one of the examples of a high performance query in the Redshift documentation.
SELECT * FROM tab1, tab2
WHERE tab1.key = tab2.key
AND tab1.timestamp > ‘1/1/2013'
AND tab2.timestamp > ‘1/1/2013';
Even if a predicate is already being applied on a table in a join query but transitively applies to another table in the query, it's useful to re-specify the redundant predicate if that other table is also sorted on the column in the predicate. That way, when scanning the other table, Redshift can efficiently skip blocks from that table as well.
By carefully applying de-normalization to bring the required redundancy, Amazon Redshift can perform at its best.
3. Native Parallelism: One of the biggest advantages of a shared nothing MPP architecture is about parallelism. Parallelism is achieved in multiple ways.
- Inter Node Parallelism: It refers the ability of the database system to break up a query into multiple parts across multiple instances across the cluster.
- Intra Node Parallelism: Intra node parallelism refers to the ability to break up query into multiple parts within a single compute node.
Typically in MPP architectures, both Inter Node Parallelism and Intra Node Parallelism will be combined and used at the same time to provide dramatic performance gains.
Amazon Redshift provides lot of operations to utilize both Intra Node and Inter Node parallelism.
When you use a COPY command to load data from Amazon S3, first split your data into multiple files instead of loading all the data from a single large file.
The COPY command then loads the data in parallel from multiple files, dividing the workload among the nodes in your cluster. Split your data into files so that the number of files is a multiple of the number of slices in your cluster. That way Amazon Redshift can divide the data evenly among the slices. Name each file with a common prefix. For example, each XL compute node has two slices, and each 8XL compute node has 16 slices. If you have a cluster with two XL nodes, you might split your data into four files named customer_1, customer_2, customer_3, and customer_4. Amazon Redshift does not take file size into account when dividing the workload, so make sure the files are roughly the same size.
Pre-Processing Data: Over the years RDBMS engines take pride of Location Independence. The Codd's 12 rules of the RDBMS states the following:
Rule 8: Physical data independence:
Changes to the physical level (how the data is stored, whether in arrays or linked lists, etc.) must not require a change to an application based on the structure.
However, in the columnar database services like Redshift the physical ordering of data does make major impact to the performance.
Sorting data is a mechanism for optimizing query performance.
When you create a table, you can define one or more of its columns as the sort key. When data is loaded into the table, the values in the sort key column (or columns) are stored on disk in sorted order. Information about sort key columns is passed to the query planner, and the planner uses this information to construct plans that exploit the way that the data is sorted. For example, a merge join, which is often faster than a hash join, is feasible when the data is distributed and presorted on the joining columns.
The VACUUM command also makes sure that new data in tables is fully sorted on disk. Vacuum as often as you need to in order to maintain a consistent query performance.
Platform as a Service (PaaS) is one of the greatest benefits to the IT community due to the Cloud Delivery Model, and from the beginning of pure play programming models like Windows Azure and Elastic Beanstalk it has moved to high-end services like data warehouse Platform as a Service. As the industry analysts see good adoption of the above service due to the huge cost advantages when compared to the traditional data warehouse platform, the best practices mentioned above will help to achieve the desired level of performance. Detailed documentation is also available on the vendor site in the form of developer and administrator guides.
More and more companies are looking to microservices as an architectural pattern for breaking apart applications into more manageable pieces so that agile teams can deliver new features quicker and more effectively. What this pattern has done more than anything to date is spark organizational transformations, setting the foundation for future application development. In practice, however, there are a number of considerations to make that go beyond simply “build, ship, and run,” which changes ho...
Jun. 24, 2016 01:30 PM EDT Reads: 739
Gartner is now treating algorithms like they are some kind of innovative addition to the modern digital discussion. Presumably the brilliant minds there have some novel insight into algorithms and, yes, the Algorithm Economy that CIOs should sit up and take notice of. Not only are algorithms nothing new, but much of what Gartner is saying about them is obvious. The bigger picture here is that software continues to improve, and enterprises are becoming increasingly software-driven, in part bec...
Jun. 17, 2016 04:07 PM EDT Reads: 765
The Internet of Things is clearly many things: data collection and analytics, wearables, Smart Grids and Smart Cities, the Industrial Internet, and more. Cool platforms like Arduino, Raspberry Pi, Intel's Galileo and Edison, and a diverse world of sensors are making the IoT a great toy box for developers in all these areas. In this Power Panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists discussed what things are the most important, which will have the most profound...
Jun. 12, 2016 09:00 PM EDT Reads: 5,081
The Internet of Things (IoT) is growing rapidly by extending current technologies, products and networks. By 2020, Cisco estimates there will be 50 billion connected devices. Gartner has forecast revenues of over $300 billion, just to IoT suppliers. Now is the time to figure out how you’ll make money – not just create innovative products. With hundreds of new products and companies jumping into the IoT fray every month, there’s no shortage of innovation. Despite this, McKinsey/VisionMobile data...
Jun. 12, 2016 02:45 AM EDT Reads: 3,644
NHK, Japan Broadcasting, will feature the upcoming @ThingsExpo Silicon Valley in a special 'Internet of Things' and smart technology documentary that will be filmed on the expo floor between November 3 to 5, 2015, in Santa Clara. NHK is the sole public TV network in Japan equivalent to the BBC in the UK and the largest in Asia with many award-winning science and technology programs. Japanese TV is producing a documentary about IoT and Smart technology and will be covering @ThingsExpo Silicon Val...
Jun. 8, 2016 10:30 PM EDT Reads: 4,582
SYS-CON Events announced today that Men & Mice, the leading global provider of DNS, DHCP and IP address management overlay solutions, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. The Men & Mice Suite overlay solution is already known for its powerful application in heterogeneous operating environments, enabling enterprises to scale without fuss. Building on a solid range of diverse platform support,...
Jun. 8, 2016 06:45 PM EDT Reads: 4,095
Internet of @ThingsExpo, taking place June 7-9, 2016 at Javits Center, New York City and Nov 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 18th International @CloudExpo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world and ThingsExpo New York Call for Papers is now open.
Jun. 8, 2016 03:45 PM EDT Reads: 3,253
SYS-CON Events announced today that Catchpoint Systems, Inc., a provider of innovative web and infrastructure monitoring solutions, has been named “Silver Sponsor” of SYS-CON's DevOps Summit at 18th Cloud Expo New York, which will take place June 7-9, 2016, at the Javits Center in New York City, NY. Catchpoint is a leading Digital Performance Analytics company that provides unparalleled insight into customer-critical services to help consistently deliver an amazing customer experience. Designed...
Jun. 8, 2016 03:00 PM EDT Reads: 3,216
@DevOpsSummit taking place June 7-9, 2016 at Javits Center, New York City, and Nov 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 18th International @CloudExpo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world.
Jun. 8, 2016 01:00 PM EDT Reads: 4,384
Cloud Expo, Inc. has announced today that Andi Mann returns to 'DevOps at Cloud Expo 2016' as Conference Chair The @DevOpsSummit at Cloud Expo will take place on June 7-9, 2016, at the Javits Center in New York City, New York. "DevOps is set to be one of the most profound disruptions to hit IT in decades," said Andi Mann. "It is a natural extension of cloud computing, and I have seen both firsthand and in independent research the fantastic results DevOps delivers. So I am excited to help the g...
Jun. 8, 2016 11:00 AM EDT Reads: 3,668
Korean Broadcasting System (KBS) will feature the upcoming 18th Cloud Expo | @ThingsExpo in a New York news documentary about the "New IT for the Future." The documentary will cover how big companies are transmitting or adopting the new IT for the future and will be filmed on the expo floor between June 7-June 9, 2016, at the Javits Center in New York City, New York. KBS has long been a leader in the development of the broadcasting culture of Korea. As the key public service broadcaster of Korea...
Jun. 8, 2016 10:00 AM EDT Reads: 2,592
SYS-CON Events announced today that Addteq will exhibit at SYS-CON's @DevOpsSummit at Cloud Expo New York, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. Addteq is one of the top 10 Platinum Atlassian Experts who specialize in DevOps, custom and continuous integration, automation, plugin development, and consulting for midsize and global firms. Addteq firmly believes that automation is essential for successful software releases. Addteq centers its products a...
Jun. 8, 2016 09:45 AM EDT Reads: 2,565
In the rush to compete in the digital age, a successful digital transformation is essential, but many organizations are setting themselves up for failure. There’s a common misconception that the process is just about technology, but it’s not. It’s about your business. It shouldn’t be treated as an isolated IT project; it should be driven by business needs with the committed involvement of a range of stakeholders.
Jun. 8, 2016 02:15 AM EDT Reads: 3,632
SYS-CON Events announced today that FalconStor Software® Inc., a 15-year innovator of software-defined storage solutions, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. FalconStor Software®, Inc. (NASDAQ: FALC) is a leading software-defined storage company offering a converged, hardware-agnostic, software-defined storage and data services platform. Its flagship solution FreeStor®, utilizes a horizonta...
Jun. 7, 2016 07:00 PM EDT Reads: 4,205
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo 2016 in New York and Silicon Valley. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 17th Cloud Expo and will feature technical sessions from a rock star conference faculty ...
Jun. 7, 2016 04:30 PM EDT Reads: 5,958
SYS-CON Events announced today that Column Technologies will exhibit at SYS-CON's @DevOpsSummit at Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. Established in 1998, Column Technologies is a global technology solutions provider with over 400 employees, headquartered in the United States with offices in Canada, India, and the United Kingdom. Column Technologies provides “Best of Breed” technology solutions that automate the key DevOps principal...
Jun. 7, 2016 04:15 PM EDT Reads: 3,377
SYS-CON Events announced today that SoftLayer, an IBM Company, has been named “Gold Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. SoftLayer, an IBM Company, provides cloud infrastructure as a service from a growing number of data centers and network points of presence around the world. SoftLayer’s customers range from Web startups to global enterprises.
Jun. 7, 2016 01:00 PM EDT Reads: 2,757
SYS-CON Events announced today that IBM Cloud Data Services has been named “Bronze Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. IBM Cloud Data Services offers a portfolio of integrated, best-of-breed cloud data services for developers focused on mobile computing and analytics use cases.
Jun. 7, 2016 12:30 PM EDT Reads: 2,802
SYS-CON Events announced today that Anexia will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. Anexia offers high-quality customized managed hosting solutions for SaaS and IaaS companies. The company was founded in 2006 in Klagenfurt, Austria. Today, it has additional offices in Vienna, Graz, Munich, Cologne and New York City to serve numerous international customers.
Jun. 7, 2016 11:00 AM EDT Reads: 2,932
SYS-CON Events announced today that Stratoscale, the software company developing the next generation data center operating system, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. Stratoscale is revolutionizing the data center with a zero-to-cloud-in-minutes solution. With Stratoscale’s hardware-agnostic, Software Defined Data Center (SDDC) solution to store everything, run anything and scale everywhere...
Jun. 7, 2016 08:00 AM EDT Reads: 2,932