Having been involved in a variety of big data projects over the last few years, I’ve spotted some common problems that keep popping up and wreaking havoc. The simple truth is that these are far more complex than typical software projects – more complex than some teams plan for.
In this post, I’ll explore these repeating issues and get to the bottom of why they keep happening, as well as the best way to avoid them.
The proliferation of open source components
Big data projects are inherently more complex than typical software projects. One major reason is the number of open source components that are required. There is a huge amount of innovation happening in this space, with new and improved components being created all the time. When these offer benefits over older components, they are inevitably swapped in.
An example would be using Apache Spark as the core data processing engine over the older Hadoop. Or, it could be using Kafa for messaging in place of older technology such as JMS.
By experimenting with new combinations, you could end up facing specific problems which other companies have never seen before – especially when compared to an off-the-shelf stack, such as those offered by Oracle.
However, there are a number of vendors in the open source space that provide bundling and support of open source stacks. Notable options include Cloudera and MapReduce: if you’re experimenting with open source components, they are both worth talking to. Their expertise can provide peace of mind when deploying these complex systems, from early production to support if something goes wrong.
Lack of experience with greenfield projects
Building something from scratch requires a totally different set of skills from those required to simply manage suppliers. For greenfield projects, many organisations outsource to external companies or buy off-the-shelf products. If you move from that process to pursuing initiatives like building a data lake in-house, your organisation might lack the specific expertise required to do so successfully.
When starting from scratch with any big data project, it’s worth doing a full audit of your internal skill sets and experience to make sure you can successfully pull it off.
One of the biggest challenges with big data projects is data integration. As organisations evolve, heterogeneous data sources inevitably spread throughout the company, each of which has its own quirks and difficulties. The challenge of extracting the valuable data from these sources and understanding the entities within these products shouldn’t be underestimated.
What’s more, there’s the human element: some teams might provide compartmentalised data to the wider organisation, omitting certain information which might reflect poorly on them.
Committing time to doing upfront analysis into each data source – and then doing small proof of concept projects to ensure data can actually be extracted – can really help improve clarity around your data integration challenges.
Your new big data initiative might be a good opportunity to push all your data to the cloud, somewhere like AWS S3 or GCloud. But of course nothing’s ever that simple and there are a few major challenges you’ll need to address.
One is ensuring that the network has enough bandwidth to support the initial data migration and subsequent continual data transfer. The transfer of historical data can also present a big challenge, and a more manual approach such AWS snowball may be needed to push the data.
Building up a good estimation of your data transfer requirements and pushing sample loads will help mitigate against the risk of unpleasant surprises down the line.
Lack of internal expertise
Big data projects often exist on the bleeding edge of technology. As such, your organisation might lack engineers with the right skills to conduct such complex projects. In this case, you’ll need to rely heavily on external resources – an option which can pose its own series of risks and additional costs. For example, you’ll need to put training in place to ensure that the knowledge of building and maintaining your system is kept strictly within the organisation.
No involvement from relevant stakeholders
While this kind of project is often sold to the IT department (those who are attracted to the prestige and kudos around building such projects) it might be marketing or accounts that actually uses the finished product on a daily basis. This can lead to projects that are built at great expense but which end up not being optimised for the actual end-user.
It is essential to have involvement from all key stakeholders to ensure the project is delivered well for the right users.
Bad business processes
When it comes to big data projects, broken or dysfunctional business processes (especially those propped up by the legacy systems) quickly become unsustainable with the new systems.
The project team needs to resolve these deep-rooted issues while building out the larger project. This inevitably leads to delays and a level of complexity which wasn’t necessarily considered at the project outset.
Kicking off a separate project where business processes are reviewed and optimised prior to embarking on the big data project is always a good idea.
Data lakes becoming data swamps
There has been a lot of hype around the data lake and the idea of offering users a data buffet where they can store all sorts of data in different formats. In reality, this can add significant (and unnecessary) complexity to the project, drawing resources away from more essential tasks.
Data lakes are notoriously difficult to productionize, and many wind up sitting around as pet projects within the IT department.
Modern data warehouses have moved on in leaps and bounds and now offer a much more practical alternative to data lakes. With such data warehouses, SQL can be used and there are constraints on relationships and data types, which contribute to a higher quality of data.
Slow data access speeds
Even if you get all your data into the lake, you still need to figure out a way to read it. You might use some elaborate parquet format with Apache Hive, but from personal experience, I can say this never works anywhere near as well as you’d hope – especially compared to modern cloud data warehouses such as Redshift and Snowflake.
It’s not just data access speed, either: from a security perspective, users lose a lot of the fine-grained row-level security that comes with a traditional warehouse.
While there’s no shortage of challenges when building big data systems, they’re far from insurmountable. By addressing even a few of the pitfalls described above, your organisation can reduce the risk of failure for its next project.
Get in touch with us if you have a big data challenge that we can help with. Visit our product page to learn more.