Microservices Expo Authors: Liz McMillan, Elizabeth White, Zakia Bouachraoui, Jason Bloomberg, Pat Romanski

Related Topics: Microservices Expo

Microservices Expo: Blog Feed Post

Data Mining in Streaming Data

Symbolic Representation, Dimension Reduction, Clustering...

Lately, I’ve been working on some interesting projects involving not just the usual suspects of stream processing, but data mining within high velocity time series.  In conjunction with that effort, I’ve been doing a lot of research in the areas of symbolic representation, dimension reduction, clustering, indexing, classification, and anomaly detection.  A prolific  researcher in this area is Dr. Eamonn Keogh – I’ll be applying some of his team’s ideas so some interesting customer problems and telling you all about it here.  Let’s get started!


In dealing with real time streaming numerical data, there is just too much of it sometimes to do anything meaningful with it in real time.  For example, in pattern recognition, trying to compute nearest neighbors using continuous, highly dimensional data is a compute nightmare.  Or, once you’ve identified a pattern of interest, finding similar patterns either in historical data or in streaming data is extremely compute intensive, and until recently, outside the scope of streaming engines.  This is because if you need to go outside of main memory, even if you’re distributed like we are, say, “Hello!” to my friend, Latency!


There are several numerical techniques one can employ to summarize streaming numerical data.  The problem with these representations is that they are all continuous, or real valued.  Another large problem, according to Dr. Keogh, is that none of the popular techniques allows a distance measure that lower bounds a distance measure found in the underlying data.  This means that once you’ve conflated your data, any analysis on that representation might not be accurate, or representative of the underlying data stream.  Also, because the resulting values are not discrete, we can’t use algorithms like hashing or search, Well, that’s no good!  So what to do?


Symbolic Aggregate approXimation (SAX) allows data to be conflated, discretized, and distance to be calculated between observations.  That means we can use all of the wholesome goodness out there in the areas of clustering, indexing (search), classification, and anomaly detection while also dramatically reducing the amount of data we need to crunch.  Getting us closer to integrating streaming events and historical data.  Nirvana.  SAX is the result of much work done and still being done by Dr. Keogh and his team at University of California – Riverside and lots of information about that work can be found here.


First, we need to do some prep work, and I recommend reading the papers – they’re informative and there’s really not too much math either.  As a precursor to SAX encoding, we’ve got some work to do.  We’ll use Piecewise Aggregate Approximation as in intermediate step and before applying PAA, we’ll normalize the data.  In my next post, we’ll show some spiffy charts and graphs as we implement SAX within DarkStar (our distributed event processing system that incorporates streaming map/reduce & CEP functionality).  Go read the papers and then come back for some fun.


Read the original blog entry...

More Stories By Colin Clark

Colin Clark is the CTO for Cloud Event Processing, Inc. and is widely regarded as a thought leader and pioneer in both Complex Event Processing and its application within Capital Markets.

Follow Colin on Twitter at http:\\twitter.com\EventCloudPro to learn more about cloud based event processing using map/reduce, complex event processing, and event driven pattern matching agents. You can also send topic suggestions or questions to [email protected]

Microservices Articles
The now mainstream platform changes stemming from the first Internet boom brought many changes but didn’t really change the basic relationship between servers and the applications running on them. In fact, that was sort of the point. In his session at 18th Cloud Expo, Gordon Haff, senior cloud strategy marketing and evangelism manager at Red Hat, will discuss how today’s workloads require a new model and a new platform for development and execution. The platform must handle a wide range of rec...
CloudEXPO New York 2018, colocated with DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI, Machine Learning and WebRTC to one location.
DevOpsSummit New York 2018, colocated with CloudEXPO | DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City. Digital Transformation (DX) is a major focus with the introduction of DXWorldEXPO within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term.
Adding public cloud resources to an existing application can be a daunting process. The tools that you currently use to manage the software and hardware outside the cloud aren’t always the best tools to efficiently grow into the cloud. All of the major configuration management tools have cloud orchestration plugins that can be leveraged, but there are also cloud-native tools that can dramatically improve the efficiency of managing your application lifecycle. In his session at 18th Cloud Expo, ...
"We do one of the best file systems in the world. We learned how to deal with Big Data many years ago and we implemented this knowledge into our software," explained Jakub Ratajczak, Business Development Manager at MooseFS, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
All zSystem customers have a significant new business opportunity to extend their reach to new customers and markets with new applications and services, and to improve the experience of existing customers. This can be achieved by exposing existing z assets (which have been developed over time) as APIs for accessing Systems of Record, while leveraging mobile and cloud capabilities with new Systems of Engagement applications. In this session, we will explore business drivers with new Node.js apps ...
Digital Transformation is well underway with many applications already on the cloud utilizing agile and devops methodologies. Unfortunately, application security has been an afterthought and data breaches have become a daily occurrence. Security is not one individual or one's team responsibility. Raphael Reich will introduce you to DevSecOps concepts and outline how to seamlessly interweave security principles across your software development lifecycle and application lifecycle management. With ...
Two apparently distinct movements are in the process of disrupting the world of enterprise application development: DevOps and Low-Code. DevOps is a cultural and organizational shift that empowers enterprise software teams to deliver better software quicker – in particular, hand-coded software. Low-Code platforms, in contrast, provide a technology platform and visual tooling that empower enterprise software teams to deliver better software quicker -- with little or no hand-coding required. ...
Using new techniques of information modeling, indexing, and processing, new cloud-based systems can support cloud-based workloads previously not possible for high-throughput insurance, banking, and case-based applications. In his session at 18th Cloud Expo, John Newton, CTO, Founder and Chairman of Alfresco, described how to scale cloud-based content management repositories to store, manage, and retrieve billions of documents and related information with fast and linear scalability. He addres...
While some developers care passionately about how data centers and clouds are architected, for most, it is only the end result that matters. To the majority of companies, technology exists to solve a business problem, and only delivers value when it is solving that problem. 2017 brings the mainstream adoption of containers for production workloads. In his session at 21st Cloud Expo, Ben McCormack, VP of Operations at Evernote, discussed how data centers of the future will be managed, how the p...