|By Jason Bloomberg||
|February 2, 2013 10:00 AM EST||
The problem with Big Data is that, well, Big Data are big. Really big. We’re talking terabytes. Petabytes. Zettabytes. Whatever’s-even-bigger-bytes. And of course, we want to solve all our Big Data challenges in the Cloud. If only we could get those gigando-bytes into the Cloud in the first place. And there’s the rub.
Uploading Big Data from our internal network to the Cloud via an Internet connection is as practical as filling a swimming pool through a drinking straw. It doesn’t matter how sophisticated our Big Data analytics, how super-duper our Hadoopers. If we can’t efficiently get our data where we need them when we need them, we’re stuck.
Optimize the Pipe
Fortunately, the Big Data upload problem isn’t new. In fact, it’s been around for years, under the moniker Wide Area Network (WAN) Optimization. Fortunate for us because vendors have been working on WAN Optimization techniques for a while now, and now several of them are repurposing those techniques to help with the Cloud.
For example, Aryaka has been peddling WAN Optimization appliances for several years. Put one appliance in your local data center, a second in the remote data center, and proprietary technology moves data from one to the other at a rapid clip. Now that the Cloud has turned their world upside down, they are providing a distributed service at the remote end, a “mesh of network connections” better suited to the Cloud. In other words, Aryaka is building an offering similar to Content Delivery Networks (CDNs) like Akamai.
RainStor, in contrast, focuses primarily on a proprietary compression algorithm that promises to squeeze data into one fortieth their original size. Furthermore, RainStor’s compressed data remain directly accessible using standard SQL or even MapReduce on Hadoop with no storage-eating, time-consuming reinflation.
Then there’s Aspera, who’s found a sophisticated way around the limitations of the Transmission Control Protocol (TCP) itself. After all, TCP’s tiny packets and penchant for resending them are a large part of the reason uploading Big Data over the Internet runs like such a dog in the first place. To teach this dog a new trick or two, Aspera transfers use one TCP port for session initialization and control, and one User Datagram Protocol (UDP) port for data transfer.
UDP is an older, fire-and-forget protocol that doesn’t perform the retries that provide TCP’s reliability, but by combining the two protocols, FASP achieves nearly 100% error-free data throughput. In fact, FASP reaches the maximum transfer speed possible given the hardware on which you deploy it, and maintains maximum available throughput independent of network delay and packet loss. FASP also aggregates hundreds of concurrent transfers on commodity hardware, addressing the drinking straw problem in part by supporting hundreds of straws at once.
CloudOpt is also a player worth mentioning. Their JetStream technology takes a soup-to-nuts approach that combines compression and transmission protocol optimization with advanced data deduplication, SSL acceleration, and an ingenious approach to getting the most performance out of cached data. Or Attunity Cloudbeam, that touts file to Cloud upload, file to Cloud replication, and Cloud to Cloud replication. Attunity’s Managed File Transfer (MFT) incorporates a secure DMZ architecture, security policy enforcement, guaranteed and accelerated transfers, process automation, and audit capabilities across each stage of the file transfer process.
Finally, there’s Amazon Web Services (AWS) itself. Yes, most if not all of the vendors discussed above can firehose data into AWS’s various storage services. But AWS also offers a simple, if decidedly low-tech approach as well: AWS Import/Export. Simply ship your big hard drives to Amazon. They’ll hook them up, copy the data to your Simple Storage Service (S3) or other storage service, and ship the drive back when you’re done. This SneakerNet or “Forklifting” approach, believe it or not, can even be faster than some of the over-the-Internet optimizations for certain Big Data sets, even considering the time it takes to FedEx AWS your drives.
On Beyond Drinking Straws
The problem with most of the approaches above (excepting only Aspera and Amazon’s forklift) is that they make the drinking straw we’re using to fill that swimming pool better, faster, and bigger – but we’re still filling that damn pool with a straw. So what’s better than a straw? How about many straws? If any optimization technique improves a single connection to the Internet, then it stands to reason that establishing many connections to your Cloud provider in parallel would multiply your upload speed dramatically.
Fair enough, but let’s think out of the box here. A fundamental Big Data best practice is to bring your analytics to your data. The reasoning is that it’s hard to move your data but easy to move your software, so once your data are in the Cloud, you should also run your analytics there.
But this argument also works in reverse. If your data aren’t in the Cloud, then it may not make sense to move them to the Cloud simply to run your software there. Instead, bring your software to your data, even if they’re on premise.
Perish the thought, you say! We’re sold on Big Data in the Cloud. We’ve crunched the numbers and we know it’s going to save us money, provide more capabilities, and facilitate sharing information across our organization and the world. Fair enough. Here’s another twist for you.
Why are your Big Data sets outside the Cloud to begin with? Sure, you’re stuck with existing, legacy data sets wherever they happen to be today. But as a rule, those don’t constitute Big Data, or will cease to qualify as being large enough to warrant the Big Data label relatively soon. By definition, Big Data sets keep expanding exponentially, which means that you keep creating them with generations of newfangled tools.
In fact, there are already multitudinous sources for raw Big Data, as varied as the Big Data challenges organizations struggle with today. But many such sources are already in the Cloud, or could be moved to the Cloud simply. For example, clickthrough data from your Web sites. Such data come from your Web servers, which should be in the Cloud anyway. If your Big Data come from Web Servers scattered here and there in the Cloud, then moving the clickthrough data to a Big Data repository for processing can be handled in the same Cloud. No need for uploading.
What about data sources that aren’t already in the Cloud? Many Big Data streams come from instrumentation or sensors of some sort, from seismographs underground to EKGs in hospitals to UPC scanners in supermarkets. There’s no reason why such instrumentation shouldn’t pour their raw data feeds directly to the Cloud. What good is storing a week’s worth of supermarket purchasing data on premise anyway? You’ll want to store, process, manage, and analyze those data in the Cloud, so the sooner you get it there, the better.
The ZapThink Take
The only reason we have to worry about uploading Big Data to the Cloud in the first place is because our Big Data aren’t already in the Cloud. And broadly speaking, the reason they’re not already in the Cloud is because the Cloud isn’t everywhere. Instead, we think of the Cloud as being locked away in data centers, those alien, air conditioned facilities packed full of racks of high tech equipment.
That may be true today, but as ZapThink has discussed before, there’s nothing in the definition of Cloud Computing that requires Cloud resources to live in data centers. You might have a bit of the Cloud in your pocket, or on your laptop, in your car, or in your refrigerator. For now, this vision of the Internet of Things meeting the Cloud is mostly the stuff of science fiction. We’re only now figuring out what it means to have a ubiquitous global network of sensors, from the aforementioned EKGs and UPC scanners to traffic cameras to home thermostats. But the writing is on the wall. Just as we now don’t think twice about carrying supercomputers in our pockets, it’s only a matter of time until the Cloud itself is fully distributed and ubiquitous. When that happens, the question of moving Big Data to the Cloud will be moot. They will already be there.
Are you one of the vendors mentioned in this article and have a correction, or a vendor who should have been mentioned but wasn’t? Please feel free to comment here.
Image Source: US Navy
DevOps Summit at Cloud Expo 2014 Silicon Valley was a terrific event for us. The Qubell booth was crowded on all three days. We ran demos every 30 minutes with folks lining up to get a seat and usually standing around. It was great to meet and talk to over 500 people! My keynote was well received and so was Stan's joint presentation with RingCentral on Devops for BigData. I also participated in two Power Panels – ‘Women in Technology’ and ‘Why DevOps Is Even More Important than You Think,’ both ...
Oct. 4, 2015 03:00 PM EDT Reads: 8,513
Clearly the way forward is to move to cloud be it bare metal, VMs or containers. One aspect of the current public clouds that is slowing this cloud migration is cloud lock-in. Every cloud vendor is trying to make it very difficult to move out once a customer has chosen their cloud. In his session at 17th Cloud Expo, Naveen Nimmu, CEO of Clouber, Inc., will advocate that making the inter-cloud migration as simple as changing airlines would help the entire industry to quickly adopt the cloud wit...
Oct. 4, 2015 02:30 PM EDT Reads: 369
“All our customers are looking at the cloud ecosystem as an important part of their overall product strategy. Some see it evolve as a multi-cloud / hybrid cloud strategy, while others are embracing all forms of cloud offerings like PaaS, IaaS and SaaS in their solutions,” noted Suhas Joshi, Vice President – Technology, at Harbinger Group, in this exclusive Q&A with Cloud Expo Conference Chair Roger Strukhoff.
Oct. 4, 2015 12:45 PM EDT Reads: 285
Docker is hot. However, as Docker container use spreads into more mature production pipelines, there can be issues about control of Docker images to ensure they are production-ready. Is a promotion-based model appropriate to control and track the flow of Docker images from development to production? In his session at DevOps Summit, Fred Simon, Co-founder and Chief Architect of JFrog, will demonstrate how to implement a promotion model for Docker images using a binary repository, and then show h...
Oct. 4, 2015 12:30 PM EDT Reads: 651
Jack Welch, the former CEO of GE once said - “If the rate of change on the outside is happening faster than the rate of change on the inside, the end is in sight.” This rings truer than ever – especially because business success is inextricably associated with those organizations who’ve got really good at delivering high-quality software innovations – innovations that disrupt existing markets and carve out new ones. Like the businesses they’ve helped digitally transform, DevOps teams and Conti...
Oct. 4, 2015 12:00 PM EDT Reads: 343
Culture is the most important ingredient of DevOps. The challenge for most organizations is defining and communicating a vision of beneficial DevOps culture for their organizations, and then facilitating the changes needed to achieve that. Often this comes down to an ability to provide true leadership. As a CIO, are your direct reports IT managers or are they IT leaders? The hard truth is that many IT managers have risen through the ranks based on their technical skills, not their leadership ab...
Oct. 4, 2015 12:00 PM EDT Reads: 821
This week, the team assembled in NYC for @Cloud Expo 2015 and @ThingsExpo 2015. For the past four years, this has been a must-attend event for MetraTech. We were happy to once again join industry visionaries, colleagues, customers and even competitors to share and explore the ways in which the Internet of Things (IoT) will impact our industry. Over the course of the show, we discussed the types of challenges we will collectively need to solve to capitalize on the opportunity IoT presents.
Oct. 4, 2015 11:00 AM EDT Reads: 2,599
Apps and devices shouldn't stop working when there's limited or no network connectivity. Learn how to bring data stored in a cloud database to the edge of the network (and back again) whenever an Internet connection is available. In his session at 17th Cloud Expo, Bradley Holt, Developer Advocate at IBM Cloud Data Services, will demonstrate techniques for replicating cloud databases with devices in order to build offline-first mobile or Internet of Things (IoT) apps that can provide a better, ...
Oct. 4, 2015 11:00 AM EDT Reads: 342
SYS-CON Events announced today that Alert Logic, the leading provider of Security-as-a-Service solutions for the cloud, has been named “Bronze Sponsor” of SYS-CON's 17th International Cloud Expo® and DevOps Summit 2015 Silicon Valley, which will take place November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Alert Logic provides Security-as-a-Service for on-premises, cloud, and hybrid IT infrastructures, delivering deep security insight and continuous protection for cust...
Oct. 4, 2015 10:15 AM EDT Reads: 2,345
Application availability is not just the measure of “being up”. Many apps can claim that status. Technically they are running and responding to requests, but at a rate which users would certainly interpret as being down. That’s because excessive load times can (and will be) interpreted as “not available.” That’s why it’s important to view ensuring application availability as requiring attention to all its composite parts: scalability, performance, and security.
Oct. 4, 2015 10:00 AM EDT Reads: 240
SYS-CON Events announced today that HPM Networks will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. For 20 years, HPM Networks has been integrating technology solutions that solve complex business challenges. HPM Networks has designed solutions for both SMB and enterprise customers throughout the San Francisco Bay Area.
Oct. 4, 2015 09:00 AM EDT Reads: 542
What Is Emergent About Emergent Architecture? By @TheEbizWizard | @DevOpsSummit #DevOps #BigData #API
All we need to do is have our teams self-organize, and behold! Emergent design and/or architecture springs up out of the nothingness! If only it were that easy, right? I follow in the footsteps of so many people who have long wondered at the meanings of such simple words, as though they were dogma from on high. Emerge? Self-organizing? Profound, to be sure. But what do we really make of this sentence?
Oct. 4, 2015 08:00 AM EDT Reads: 354
DevOps is speeding towards the IT world like a freight train and the hype around it is deafening. There is no reason to be afraid of this change as it is the natural reaction to the agile movement that revolutionized development just a few years ago. By definition, DevOps is the natural alignment of IT performance to business profitability. The relevance of this has yet to be quantified but it has been suggested that the route to the CEO’s chair will come from the IT leaders that successfully ma...
Oct. 4, 2015 06:00 AM EDT Reads: 14,015
Somebody call the buzzword police: we have a serious case of microservices-washing in progress. The term “microservices-washing” is derived from “whitewashing,” meaning to hide some inconvenient truth with bluster and nonsense. We saw plenty of cloudwashing a few years ago, as vendors and enterprises alike pretended what they were doing was cloud, even though it wasn’t. Today, the hype around microservices has led to the same kind of obfuscation, as vendors and enterprise technologists alike ar...
Oct. 4, 2015 05:00 AM EDT Reads: 266
I’ve been thinking a bit about microservices (μServices) recently. My immediate reaction is to think: “Isn’t this just yet another new term for the same stuff, Web Services->SOA->APIs->Microservices?” Followed shortly by the thought, “well yes it is, but there are some important differences/distinguishing factors.” Microservices is an evolutionary paradigm born out of the need for simplicity (i.e., get away from the ESB) and alignment with agile (think DevOps) and scalable (think Containerizati...
Oct. 4, 2015 03:00 AM EDT Reads: 2,414
The cloud has reached mainstream IT. Those 18.7 million data centers out there (server closets to corporate data centers to colocation deployments) are moving to the cloud. In his session at 17th Cloud Expo, Achim Weiss, CEO & co-founder of ProfitBricks, will share how two companies – one in the U.S. and one in Germany – are achieving their goals with cloud infrastructure. More than a case study, he will share the details of how they prioritized their cloud computing infrastructure deployments ...
Oct. 4, 2015 03:00 AM EDT Reads: 627
Mobile has become standard in the enterprise with smartphones and tablets common in the workplace. Anywhere, anytime access to company systems is expected and systems must work flawlessly on these devices! This demand is requiring that corporate IT departments figure out the best mobile strategy to follow. This eBook looks at how to kick start your mobile application strategy.
Oct. 2, 2015 09:45 AM EDT Reads: 583
Even though you are running an agile development process, that doesn’t necessarily mean that your performance testing is being conducted in a truly agile way. Saving performance testing for a “final sprint” before release still treats it like a waterfall development step, with all the cost and risk that comes with that. In this post, we will show you how to make load testing happen early and often by putting SLAs on the agile task board.
Sep. 30, 2015 01:00 PM EDT Reads: 526
Today, we are in the middle of a paradigm shift as we move from managing applications on VMs and containers to embracing everything that the cloud and XaaS (Everything as a Service) has to offer. In his session at 17th Cloud Expo, Kevin Hoffman, Advisory Solutions Architect at Pivotal Cloud Foundry, will provide an overview of 12-factor apps and migrating enterprise apps to the cloud. Kevin Hoffman is an Advisory Solutions Architect for Pivotal Cloud Foundry, and has spent the past 20 years b...
Sep. 30, 2015 04:00 AM EDT Reads: 595
Go ahead. Name a cloud environment that doesn't include load balancing as the key enabler of elastic scalability. I've got coffee... so it's good, take your time... Exactly. Load balancing - whether implemented as traditional high availability pairs or clustering - provides the means by which applications (and infrastructure, in many cases) scale horizontally. It is load balancing that is at the heart of elastic scalability models, and that provides a means to ensure availability and even imp...
Sep. 29, 2015 06:00 AM EDT Reads: 11,906