|By Jason Bloomberg||
|February 2, 2013 10:00 AM EST||
The problem with Big Data is that, well, Big Data are big. Really big. We’re talking terabytes. Petabytes. Zettabytes. Whatever’s-even-bigger-bytes. And of course, we want to solve all our Big Data challenges in the Cloud. If only we could get those gigando-bytes into the Cloud in the first place. And there’s the rub.
Uploading Big Data from our internal network to the Cloud via an Internet connection is as practical as filling a swimming pool through a drinking straw. It doesn’t matter how sophisticated our Big Data analytics, how super-duper our Hadoopers. If we can’t efficiently get our data where we need them when we need them, we’re stuck.
Optimize the Pipe
Fortunately, the Big Data upload problem isn’t new. In fact, it’s been around for years, under the moniker Wide Area Network (WAN) Optimization. Fortunate for us because vendors have been working on WAN Optimization techniques for a while now, and now several of them are repurposing those techniques to help with the Cloud.
For example, Aryaka has been peddling WAN Optimization appliances for several years. Put one appliance in your local data center, a second in the remote data center, and proprietary technology moves data from one to the other at a rapid clip. Now that the Cloud has turned their world upside down, they are providing a distributed service at the remote end, a “mesh of network connections” better suited to the Cloud. In other words, Aryaka is building an offering similar to Content Delivery Networks (CDNs) like Akamai.
RainStor, in contrast, focuses primarily on a proprietary compression algorithm that promises to squeeze data into one fortieth their original size. Furthermore, RainStor’s compressed data remain directly accessible using standard SQL or even MapReduce on Hadoop with no storage-eating, time-consuming reinflation.
Then there’s Aspera, who’s found a sophisticated way around the limitations of the Transmission Control Protocol (TCP) itself. After all, TCP’s tiny packets and penchant for resending them are a large part of the reason uploading Big Data over the Internet runs like such a dog in the first place. To teach this dog a new trick or two, Aspera transfers use one TCP port for session initialization and control, and one User Datagram Protocol (UDP) port for data transfer.
UDP is an older, fire-and-forget protocol that doesn’t perform the retries that provide TCP’s reliability, but by combining the two protocols, FASP achieves nearly 100% error-free data throughput. In fact, FASP reaches the maximum transfer speed possible given the hardware on which you deploy it, and maintains maximum available throughput independent of network delay and packet loss. FASP also aggregates hundreds of concurrent transfers on commodity hardware, addressing the drinking straw problem in part by supporting hundreds of straws at once.
CloudOpt is also a player worth mentioning. Their JetStream technology takes a soup-to-nuts approach that combines compression and transmission protocol optimization with advanced data deduplication, SSL acceleration, and an ingenious approach to getting the most performance out of cached data. Or Attunity Cloudbeam, that touts file to Cloud upload, file to Cloud replication, and Cloud to Cloud replication. Attunity’s Managed File Transfer (MFT) incorporates a secure DMZ architecture, security policy enforcement, guaranteed and accelerated transfers, process automation, and audit capabilities across each stage of the file transfer process.
Finally, there’s Amazon Web Services (AWS) itself. Yes, most if not all of the vendors discussed above can firehose data into AWS’s various storage services. But AWS also offers a simple, if decidedly low-tech approach as well: AWS Import/Export. Simply ship your big hard drives to Amazon. They’ll hook them up, copy the data to your Simple Storage Service (S3) or other storage service, and ship the drive back when you’re done. This SneakerNet or “Forklifting” approach, believe it or not, can even be faster than some of the over-the-Internet optimizations for certain Big Data sets, even considering the time it takes to FedEx AWS your drives.
On Beyond Drinking Straws
The problem with most of the approaches above (excepting only Aspera and Amazon’s forklift) is that they make the drinking straw we’re using to fill that swimming pool better, faster, and bigger – but we’re still filling that damn pool with a straw. So what’s better than a straw? How about many straws? If any optimization technique improves a single connection to the Internet, then it stands to reason that establishing many connections to your Cloud provider in parallel would multiply your upload speed dramatically.
Fair enough, but let’s think out of the box here. A fundamental Big Data best practice is to bring your analytics to your data. The reasoning is that it’s hard to move your data but easy to move your software, so once your data are in the Cloud, you should also run your analytics there.
But this argument also works in reverse. If your data aren’t in the Cloud, then it may not make sense to move them to the Cloud simply to run your software there. Instead, bring your software to your data, even if they’re on premise.
Perish the thought, you say! We’re sold on Big Data in the Cloud. We’ve crunched the numbers and we know it’s going to save us money, provide more capabilities, and facilitate sharing information across our organization and the world. Fair enough. Here’s another twist for you.
Why are your Big Data sets outside the Cloud to begin with? Sure, you’re stuck with existing, legacy data sets wherever they happen to be today. But as a rule, those don’t constitute Big Data, or will cease to qualify as being large enough to warrant the Big Data label relatively soon. By definition, Big Data sets keep expanding exponentially, which means that you keep creating them with generations of newfangled tools.
In fact, there are already multitudinous sources for raw Big Data, as varied as the Big Data challenges organizations struggle with today. But many such sources are already in the Cloud, or could be moved to the Cloud simply. For example, clickthrough data from your Web sites. Such data come from your Web servers, which should be in the Cloud anyway. If your Big Data come from Web Servers scattered here and there in the Cloud, then moving the clickthrough data to a Big Data repository for processing can be handled in the same Cloud. No need for uploading.
What about data sources that aren’t already in the Cloud? Many Big Data streams come from instrumentation or sensors of some sort, from seismographs underground to EKGs in hospitals to UPC scanners in supermarkets. There’s no reason why such instrumentation shouldn’t pour their raw data feeds directly to the Cloud. What good is storing a week’s worth of supermarket purchasing data on premise anyway? You’ll want to store, process, manage, and analyze those data in the Cloud, so the sooner you get it there, the better.
The ZapThink Take
The only reason we have to worry about uploading Big Data to the Cloud in the first place is because our Big Data aren’t already in the Cloud. And broadly speaking, the reason they’re not already in the Cloud is because the Cloud isn’t everywhere. Instead, we think of the Cloud as being locked away in data centers, those alien, air conditioned facilities packed full of racks of high tech equipment.
That may be true today, but as ZapThink has discussed before, there’s nothing in the definition of Cloud Computing that requires Cloud resources to live in data centers. You might have a bit of the Cloud in your pocket, or on your laptop, in your car, or in your refrigerator. For now, this vision of the Internet of Things meeting the Cloud is mostly the stuff of science fiction. We’re only now figuring out what it means to have a ubiquitous global network of sensors, from the aforementioned EKGs and UPC scanners to traffic cameras to home thermostats. But the writing is on the wall. Just as we now don’t think twice about carrying supercomputers in our pockets, it’s only a matter of time until the Cloud itself is fully distributed and ubiquitous. When that happens, the question of moving Big Data to the Cloud will be moot. They will already be there.
Are you one of the vendors mentioned in this article and have a correction, or a vendor who should have been mentioned but wasn’t? Please feel free to comment here.
Image Source: US Navy
The Internet of Things. Cloud. Big Data. Real-Time Analytics. To those who do not quite understand what these phrases mean (and let’s be honest, that’s likely to be a large portion of the world), words like “IoT” and “Big Data” are just buzzwords. The truth is, the Internet of Things encompasses much more than jargon and predictions of connected devices. According to Parker Trewin, Senior Director of Content and Communications of Aria Systems, “IoT is big news because it ups the ante: Reach out ...
Jul. 30, 2015 06:00 AM EDT Reads: 387
[video] Logging and Monitoring with @Sematext Founder @OtisG | @DevOpsSummit #DevOps #Logging #Monitoring
"We got started as search consultants. On the services side of the business we have help organizations save time and save money when they hit issues that everyone more or less hits when their data grows," noted Otis Gospodnetić, Founder of Sematext, in this SYS-CON.tv interview at @DevOpsSummit, held June 9-11, 2015, at the Javits Center in New York City.
Jul. 29, 2015 11:45 PM EDT Reads: 1,022
Modern DevOps Tool Kit By @Logentries and @NewRelic | @DevOpsSummit #DevOps #Containers #Microservices
Auto-scaling environments, micro-service architectures and globally-distributed teams are just three common examples of why organizations today need automation and interoperability more than ever. But is interoperability something we simply start doing, or does it require a reexamination of our processes? And can we really improve our processes without first making interoperability a requirement for how we choose our tools?
Jul. 29, 2015 07:15 PM EDT Reads: 393
Where the Network Got Invited to the Party By @LMacVittie | @DevOpsSummit #DevOps #Docker #Containers #Microservices
At DevOps Summit NY there’s been a whole lot of talk about not just DevOps, but containers, IoT, and microservices. Sessions focused not just on the cultural shift needed to grow at scale with a DevOps approach, but also made sure to include the network ”plumbing” needed to ensure success as applications decompose into the microservice architectures enabling rapid growth and support for the Internet of (Every)Things.
Jul. 29, 2015 06:15 PM EDT Reads: 1,747
How do you securely enable access to your applications in AWS without exposing any attack surfaces? The answer is usually very complicated because application environments morph over time in response to growing requirements from your employee base, your partners and your customers. In his session at @DevOpsSummit, Haseeb Budhani, CEO and Co-founder of Soha, shared five common approaches that DevOps teams follow to secure access to applications deployed in AWS, Azure, etc., and the friction an...
Jul. 29, 2015 04:30 PM EDT Reads: 499
Take the Long View with Digital Transformation By @IoT2040 | @ThingsExpo #IoT #M2M #API #Microservices #InternetOfThings
Digital Transformation is the ultimate goal of cloud computing and related initiatives. The phrase is certainly not a precise one, and as subject to hand-waving and distortion as any high-falutin' terminology in the world of information technology. Yet it is an excellent choice of words to describe what enterprise IT—and by extension, organizations in general—should be working to achieve. Digital Transformation means: handling all the data types being found and created in the organizat...
Jul. 29, 2015 04:00 PM EDT Reads: 1,070
This week, I joined SOASTA as Senior Vice President of Performance Analytics. Given my background in cloud computing and distributed systems operations — you may have read my blogs on CNET or GigaOm — this may surprise you, but I want to explain why this is the perfect time to take on this opportunity with this team. In fact, that’s probably the best way to break this down. To explain why I’d leave the world of infrastructure and code for the world of data and analytics, let’s explore the timing...
Jul. 29, 2015 03:45 PM EDT Reads: 355
The Software Defined Data Center (SDDC), which enables organizations to seamlessly run in a hybrid cloud model (public + private cloud), is here to stay. IDC estimates that the software-defined networking market will be valued at $3.7 billion by 2016. Security is a key component and benefit of the SDDC, and offers an opportunity to build security 'from the ground up' and weave it into the environment from day one. In his session at 16th Cloud Expo, Reuven Harrison, CTO and Co-Founder of Tufin,...
Jul. 29, 2015 03:00 PM EDT Reads: 468
You often hear the two titles of "DevOps" and "Immutable Infrastructure" used independently. In his session at DevOps Summit, John Willis, Technical Evangelist for Docker, covered the union between the two topics and why this is important. He provided an overview of Immutable Infrastructure then showed how an Immutable Continuous Delivery pipeline can be applied as a best practice for "DevOps." He ended the session with some interesting case study examples.
Jul. 29, 2015 02:15 PM EDT Reads: 146
Jul. 29, 2015 02:00 PM EDT Reads: 281
Discussions about cloud computing are evolving into discussions about enterprise IT in general. As enterprises increasingly migrate toward their own unique clouds, new issues such as the use of containers and microservices emerge to keep things interesting. In this Power Panel at 16th Cloud Expo, moderated by Conference Chair Roger Strukhoff, panelists addressed the state of cloud computing today, and what enterprise IT professionals need to know about how the latest topics and trends affect t...
Jul. 29, 2015 02:00 PM EDT Reads: 1,172
SYS-CON Events announced today that HPM Networks will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. For 20 years, HPM Networks has been integrating technology solutions that solve complex business challenges. HPM Networks has designed solutions for both SMB and enterprise customers throughout the San Francisco Bay Area.
Jul. 29, 2015 01:45 PM EDT Reads: 426
Microservices Total Cost of Ownership: Too Soon? By @Aruna13 | @DevOpsSummit #DevOps #Docker #Containers #Microservices
Microservices are hot. And for good reason. To compete in today’s fast-moving application economy, it makes sense to break large, monolithic applications down into discrete functional units. Such an approach makes it easier to update and add functionalities (text-messaging a customer, calculating sales tax for a specific geography, etc.) and get those updates / adds into production fast. In fact, some would argue that microservices are a prerequisite for true continuous delivery. But is it too...
Jul. 29, 2015 01:00 PM EDT Reads: 670
Summer is finally here and it’s time for a DevOps summer vacation. From San Francisco to New York City, our top summer conferences list is going to continuously deliver you to the summer destinations of your dreams. These DevOps parties are hitting all the hottest summer trends with Microservices, Agile, Continuous Delivery, DevSecOps, and even Continuous Testing. Move over Kanye. These are the top 5 Summer DevOps Conferences of 2015.
Jul. 29, 2015 01:00 PM EDT Reads: 666
[video] Infrastructure as a Toolbox By @SoftLayer at @CloudExpo New York | #IoT #API #Containers #Microservices
Countless business models have spawned from the IaaS industry. Resell Web hosting, blogs, public cloud, and on and on. With the overwhelming amount of tools available to us, it's sometimes easy to overlook that many of them are just new skins of resources we've had for a long time. In his General Session at 16th Cloud Expo, Phil Jackson, Lead Technology Evangelist at SoftLayer, broke down what we've got to work with and discuss the benefits and pitfalls to discover how we can best use them to d...
Jul. 29, 2015 01:00 PM EDT Reads: 1,949
Puppet Labs has published their annual State of DevOps report and it is loaded with interesting information as always. Last year’s report brought home the point that DevOps was becoming widely accepted in the enterprise. This year’s report further validates that point and provides us with some interesting insights from surveying a wide variety of companies in different phases of their DevOps journey.
Jul. 29, 2015 01:00 PM EDT Reads: 169
[session] The Container New World By @KeGilpin | @DevOpsSummit #DevOps #Docker #Containers #Microservices
Containers are changing the security landscape for software development and deployment. As with any security solutions, security approaches that work for developers, operations personnel and security professionals is a requirement. In his session at DevOps Summit, Kevin Gilpin, CTO and Co-Founder of Conjur, will discuss various security considerations for container-based infrastructure and related DevOps workflows.
Jul. 29, 2015 01:00 PM EDT Reads: 1,062
What we really mean to ask is whether microservices architecture is SOA done right. But then, of course, we’d have to figure out what microservices architecture was. And if you think defining SOA is difficult, pinning down microservices architecture is unquestionably frying pan into fire time. Given my years at ZapThink, fighting to help architects understand what Service-Oriented Architecture really was and how to get it right, it’s no surprise that many people ask me this question.
Jul. 29, 2015 10:30 AM EDT Reads: 378
[video] An Interview with @ProfitBricksUSA CEO @AchimWeiss | @CloudExpo #DevOps #Docker #Containers #Microservices
"ProfitBricks was founded in 2010 and we are the painless cloud - and we are also the Infrastructure as a Service 2.0 company," noted Achim Weiss, Chief Executive Officer and Co-Founder of ProfitBricks, in this SYS-CON.tv interview at 16th Cloud Expo, held June 9-11, 2015, at the Javits Center in New York City.
Jul. 29, 2015 10:15 AM EDT Reads: 1,110
One of the ways to increase scalability of services – and applications – is to go “stateless.” The reasons for this are many, but in general by eliminating the mapping between a single client and a single app or service instance you eliminate the need for resources to manage state in the app (overhead) and improve the distributability (I can make up words if I want) of requests across a pool of instances. The latter occurs because sessions don’t need to hang out and consume resources that could ...
Jul. 29, 2015 10:15 AM EDT Reads: 156