Big Data and why
Data is becoming more and more important to the business model than ever before. The more data the better; in fact, store everything!
This has led to a new trend called "Big Data". Big Data focuses on systems that can be scaled (usually distributed on a network or over several networks) so that they are running at peak performance despite the load. Because of this, big data requires the appropriate infrastructure for optimal use. Your developers must know how to optimize your hardware or else your business will end up wasting a lot of very valuable hardware.
All of the largest and most successful companies have already jumped into big data years ago. Every one of our employees have been part of developing the big data systems and data procurement in other companies ranging from banking to health care. There have been far too many situations in the past that could have been easily solved if "we simply had more data". But fear not! This is where our team excels and has already dealt with the specifics on what to get, from where, and how.
Relative to the inner-workings of a company, big data easily joins into the lesser known concept: the internet of things. The internet of things is quite simply hooking anything and everything up for data reading. If the object has a processor in it, it should be sending you data. And luckily with technology advancements, wireless devices have become ridiculously cheap. Companies can protect themselves from lawsuits in a magnitude of ways using big data such as: hospitals being able to track patients on premises, hotel master keys not falling into the hands of random individuals, or (more detail of the story here) sooner tracking true patient diagnosis. Yet, big data can also work outside of the company data.
We love working with Big Data and in fact it is our passion. We have experience with MongoDB, CouchDB, Cassandra, and other No-SQL databases. We can help you build a distributed system, or help you maintain your existing one.
Why don't you think you need it?
Banks Love It
Banks are all crazing over big data collection. They collect every single value of every stock. With this data, complex algorithms are created and modified daily to try to predict the future stocks.
These banks are also able to look inward at their customers. Every cent you spend with them is collected and stored. They use this to look for particular patterns and spending habits to try to recommend different account types, when to inquire a customer about a home loan, or refinance. Banks use big data to get statistical advantages everyday.
Health Care Craves It
Health care has been lagging behind the market for some time in the IT side but some large improvements have been made. They are finally starting to realize the necessities of simply having more data.
What if a group of patients began contracting infections with one specific doctor - the hospital would have the data to be able to discover this immediately. What if a group of patients became diagnosed with respiratory problems days after admittance? With big data it could be discovered that they all used the same room which wasn't being cleaned well enough.
Having all the data can be nice, but it's sometimes more about what you need.The most recent need I recognized at Health First was trying to find the most revenue generated by a specific hospital unit, and then determine if the unit needed an expansion.
Google Can't Get Enough
Google may be the number one company trying to collect big data. They are an ad revenue company and they are ridiculously good at it. Google revolutionized the ad market by using big data and is now one of the most recognizable companies in the world.
Older methods of advertising would be when a company would purchase an ad to run at a certain time and just hope a potential client saw it. Google tried something differently: they created a collection of programs to collect user data.
Google was then able to guarantee specifics about who would view the ad. If a company wants to target male teenagers, google used their data collection only show the specific ad to those individuals. Companies would only pay for ads that were viewed by their targeted audience.
Big data made google what it is today: a powerhouse.
Why not you?
Big data by definition simply means collecting more, any, and all data. Thus the right question is, "What additional data, that we are not currently collecting, could help us assist our customers better?" That's where this conversation started for every company that currently uses it.
If you do not know this question yourself, then ask other employees! From trying this with every client down to every last employee in the company, I have never once heard a unanimous 'there is nothing more we need'. Not once.
- Can be as cheap as the cost of a single computer
- Collect data that matters for your company
- Could be literally anything, you just need to go get it!
- Literally zero risk investment (not worth the cost? turn it off and sell the hardware)
Set expectations to achieve the highest return value
What is the goal?
It is paramount to remember that this is not just data for the sake of data. The company is going to reach out and collect specific data to achieve some goal and yield something in return.
Are you going to use this new data to stream-line a process, collect customer reviews, catalog ad revenue, better manage inventory and logistics, or try some suggestive selling?
What potential value will the company gain by receiving this new data?
Where will the data come from?
Before any budgets or timelines are set, this should be clearly defined. Better to plan ahead and not let this one come up and become a surprise later on because it can be tricky.
Is your new data going to be internal from the company? Do you need to simply add a time-stamp when packages reach a certain marker in an assembly line? Is a hospital going to suddenly start tracking it's patient locations around the hospitals with wireless vital sign instruments?
Is the new piece of data external to the company? Are you going to gather customer reviews online? Will you try to gather a list of customer clicks on a website to see what your customers are interested in?
How often will data be collected?
This is a rather simple concept but still important. Are you attempting to collect daily averages and only need 5-10 readings per day? Or does every single inventory item need this data piece logged whether it be a time stamp or something entirely new?
Depending on this info, you will also be able to determine what type of data structure to pursue for this endeavor.
How will the new data be processed?
As stated many times, this is not about collecting data - you need to perform some action with this new data or else all the effort was wasted.
Is this new data going to run through a processor and simply look for out of bounds values for simple quality validation? This is very common in factories.
Is the data going to be collected for later analytics and determine larger scope changes for the entire product? These would generally help lead lean initiatives.
Is the data going be a time stamp and a location in which customers may use for package tracking. Fedex performs this type of data logging for nearly every single package to allow higher transparency into the company's inner workings.
Your New data: Structured vs Non-Structured
Structured / Relational
Structured data is described best as a normal relational database: mysql, Access, Sql Server, Oracle, db2, etc. The table and columns must be predefined before any data may enter into it. As you can tell, the structure is tightly linked to the data being stored - as per the name.This is generally the easiest to learn as the idea comes 2nd nature.
This works very well when your column types won't change which is why these are so popular for out-of-the-box products that companies buy every day. These companies know how they want their product to work and with structured data, they can enforce this better.
Structured data generally processes slower but is able to hold much higher data integrity. The data is mathematically oriented thus allowing one appropriate table modification to affect all linked database records.
In recent years, there have been new methods to make structured data much more expansive. Database tables can apply horizontal and vertical partitioning to allow some form of parallel processing inside an individual table. Virtual hard drives are able to combine the total storage among multiple disks to allow for increased singular storage.
It should be noted that the two previous methods are not truly definitive of unlimited scalability. These have just been implementations to allow for future scaling.
Pros: higher data integrity, easier, more common, strict predefined data, some scalability procedures
Cons: Must construct and format new table and/or columns to accept any new data
Unstructured / Non-Relational
Unstructured data should be thought of as a fluid. It will shape into whatever you want it to be. There is nothing at all holding it back from changing types of objects within the same model structure.
Generally when you do not know what possible types of data you will return, or even if your data pool is too large. Unstructured is the way to go.
There is no limit to the types of things you can store into unstructured databases. You can store everything from standard names and birthdays to json objects, pictures, videos, and literally everything else.
Unstructured data is ridiculously faster because it does not make use of table joins. This can lead to some problems however. If one record is typo'd across 20 million rows. Each individual record needs to be modified.
Unstructured data has literally no limits to its scability. The way that non-relational data is stored and processed, you are easily able to add additional hard drives to expand your storage. Minimal configuration is necessary and the product can just keep chugging along with no hindrance.
Pros: massive performance increase, can record any and all objects / data, unlimited scability with ease
Cons: low data integrity thus any typos must be fixed individually, less known
And I hate using this term: it depends. If you need high data integrity in which all the data related to each other, use structured data in a relational database. However, if you are simply collecting large amounts of data and trusting the integrity or the amount of data you are logging is just too massive - use non-structured data in a non-relational database. It's in the name for a reason.