Five top tips for big data projects

Most businesses are in the earliest stages of big data adoption. Few have thought further than the technology itself, and not how big data will impact their operational processes and information architectures

Whether projects are past the pilot stage and being deployed in production, or still on the horizon, they require strategic thinking and adequate planning to avoid well-worn pitfalls that prevent them from achieving success. Here are five top tips to help businesses achieve optimum value from big data projects:

1. Don’t focus on volume
It might seem like a paradox, but big data is both large and small: it’s diverse in origin, style, consistency and quality. In certain industries, some organisations are dealing with massive quantities of data, while others have much smaller data sets to exploit but might have a broader range of sources and formats to deal with. Make sure you go after the right data: identify all the sources that are relevant and don’t be embarrassed if you don’t need to scale your data computing cluster to hundreds of nodes right away.

2. Don’t leave data behind
Some of the data you need for your big data projects is clearly identified – such as transactional data used or generated by business applications. However, more of it is hidden on servers, or in log files, desktops or manufacturing systems. Much of this is neglected and tends to be referred to as ‘dark data’. Some of it is even going to waste in the ‘exhaust fumes’ of IT. This ‘exhaust data’ – generated by sensors and logs – is purged after a certain amount of time or never stored in the first place. All of it is potentially relevant. Don’t restrain your project to the first category: inventory dark data and deploy collection mechanisms for exhaust data, so that it also contributes value to your business.

3. Don’t move everything – distribute data logically
Too many organisations looking for ways to break down data silos focus on bringing all the data together in one central location. Hadoop is a great storage resource for large data volumes (and is itself distributed across clusters). However, you need to think ‘distribution’ beyond Hadoop. It’s not always necessary to duplicate and replicate everything. Some data is already readily available in the enterprise data warehouse, where fast, random access can be applied. Some of it might be better off staying in the location in which it was produced. The ‘logical data warehouse’ concept applies well in the non-big-data world. Make use of it for big data too.

4. It’s not just about storage – think processing platforms
Hadoop is not only a repository for big data with its distributed file system but also an engine that gives businesses the potential to process data and extract meaningful information from it. A broad ecosystem of tools and programming paradigms exists that covers all use cases of data manipulation. From YARN to MapReduce or from HiveQL to Pig – complemented by Impala, Stinger or Drill – or through the merging of Hadoop and SQL engines such as HAWK, there are processing resources available that make getting data out of the platform superfluous to requirements. All the resources are here, at your fingertips.

5. Last but not least, don’t treat big data as an island
Sandboxes are fine for proof of concepts but, when big data projects go live, they need to be dealt with as an integral part of your overall IT infrastructure and information architecture rather than a siloed project. You need to connect big data applications to other systems, both upstream and downstream. It is critical that big data becomes part of your overall IT and information governance policy.

Interest in big data is increasing rapidly and this is being reflected by more companies rolling out big data strategies. Unfortunately, many companies are still stuck on the starting blocks. The technologies are still in their infancy – and because the platforms and applications are new, big data projects are inevitably in the spotlight from the word go and quickly have to live up to a tremendously high level of expectation.

Leading-edge technologies lower the adoption barrier and make it easier to get started. Yet moving pilot projects into mainstream IT requires more than just technology. If businesses take note and follow the five tips above, they will stand a better chance of avoiding the pitfalls on their big data journey.