#CEDigital – a small conference about big data

Common definitions of Big Data (such as the one in the English edition of Wikipedia, for example) make it clear that we are talking about enormous volumes of data – “data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage and process data within a tolerable elapsed time”. On 25 September 2019, at Bentley Systems digital academy in London, Constructing Excellence held a small conference on big data, and I helped to set the scene.

The big picture

Humanity’s ability to create data is growing almost exponentially. Activities in meteorology, genomics, complex physics, biological and environmental research, internet search, and finance and business informatics (to name just a few) are all spewing out huge volumes of data. In our daily personal and business lives, we are increasingly surrounded by devices that contribute to these volumes (mobile devices, software logs, digital cameras, microphones, RFID readers, wireless sensor networks and streaming instrumentation, among others). Population growth, wider literacy, associated use of mobile devices, and adoption of social media and the ‘Internet of things’ are accelerating the trend.

Ten years ago (2009), we and our hardware and software created just under one zettabyte (that’s a billion billion megabytes) of data – by 2016, this figure had grown to 16.3ZB, and by 2025 it will be ten times as much: 163ZB. And a growing proportion of this data (around 80%) is also unstructured data – data that cannot be neatly defined in rows and columns or in databases – captured in images, video, audio, PDFs, point-clouds, emails, word-processed documents and the like. Such semi-structured and unstructured data requires more storage, is more difficult to manage and protect using legacy solutions, and is more complex to analyse.

BIM, the built environment, BI and Big Data

Core BIM processes, by contrast, tend to involve the creation and sharing of highly structured model data held in interrogable databases – and, at a project level, the data sets are usually well within the capabilities of commonly used tools. Yes, common data environments, CDEs, may also hold a wealth of associated unstructured data, but model authoring applications and the numerous workflows related to creating new built assets (or refurbishing existing ones) tend to be founded on structured data. As a result, we can generate a lot of business intelligence, BI, from all this information, holding it in data warehouses, and presenting it in reports and dashboards, but BI presents a tiny proportion of what might be contained in ‘Big Data‘.

Some AEC technology vendors might like you to think that their platforms will deliver ‘big data’ insights, but usually they are just crunching what is in their databases.

BI uses descriptive statistics with data with high information density to measure things, detect trends etc.
In contrast, Big Data analytics uses inductive statistics to infer laws (regressions, nonlinear relationships, and causal effects) from large data sets to reveal relationships or dependencies and to perform predictions. Crucially, Big Data analytics is primarily (often c.90%) focused on un- and semi-structured data.

I used a water analogy to explain the difference. A data warehouse is like a store of bottled water. This water (data) has been filtered, disinfected and packaged for easy consumption – it is clean, refined and structured, and we have confidence in its quality. By contrast, a data lake is like a large body of water in a more natural state: water constantly streams in from different sources to fill the lake, and people can look at the surface, dip a toe in, dive in, or take samples. That data lake holds a vast amount of water (and other things – from microscopic pollutants to plants and animals); its water is not clean, refined or structured, and large volumes of it may need significant processing. Understanding the health and value of the data lake may also involve looking at its situation, and its interdependencies with other systems, and re-appraising it periodically as it will be constantly changing.

Big Data analytics

Analysing big data is typically a multi-step process involving data- and text mining, data optimisation, natural language processing, searching, path/pattern analysis and statistical analysis. Often millions, even billions, of data points need to be processed, so analytics is often conducted on massively parallel software running across multiple servers (technologies include MapReduce, Hadoop and Apache Spark).

Human subject matter experts help identify what data might need to be ingested, how that data might need to be linked, and whether additional processing or data might also be needed. Artificial intelligence and machine learning are also exploited, as algorithms engage in anomaly detection, association rule learning, clustering, classification, regression and summarisation. The outcomes from typical big data analytics can be visualised through various dashboard ‘lenses’: groups / fractal maps, links and networks, geographical maps, and lists.

The information-intensive built environment industry has huge opportunities to exploit the data it collects, and recent UK industry debate about ‘digital twins’ and ‘national digital twins’ hints at what the future might hold. Owner-operators and their project teams often accumulate vast swathes of information, much of it in documents, and sometimes not always well-connected – disciplinary, organisational, contractual, and digital silos often need to be broken down to get the ‘big picture’ about how built projects are planned, designed, constructed and then operated and maintained. Analysing such records across entire portfolios, and even establishing data connections to other portfolios, may yield further insights. Such insights might be even more valuable when decision-makers can also draw on data showing the social, economic and environmental impacts of interactions with the wider built environment, and then start making informed predictions about what new investments might deliver (this is the interconnected vision of the ‘national digital twin’ put forward by the Mark Enzer-led digital framework task group of the Centre for Digital Built Britain).

A case study: BAM Ireland

At the Constructing Excellence mini-conference, Michael Murphy of BAM Ireland illustrated how contractors might exploit the hitherto under-utilised data they collect while delivering their projects. He highlighted how silo mentalities hinder this process, and mentioned a research finding suggesting “95% of all data captured goes unused in the engineering and construction industries.” In its use of data, he said construction needs to shift from being reactive (responding to events that have already happened) to become proactive (actively identifying risks by analysing an organisation’s processes), and then predictive (analysing processes and the environment to identify potential future problems). Murphy also cited a 2019 KPMG survey of 223 business leaders which forecast that data analytics and predictive capability would be the number one priority for tomorrow’s construction businesses.

“What if every team member, across every discipline, could predict and act to prevent risk, every day?” Murphy suggested. BAM worked with Autodesk to develop and apply its Construction IQ technology, which analysed data collected by its BIM 360 suite during delivery of projects. Like many other construction organisations, BAM was often engaged on multiple projects simultaneously, and so looked to digitise its information capture processes rather than rely on traditional paper-based data management methods (time-consuming to compile, with data rarely used to its full potential).

While using the BIM 360 suite on a seven-project BIM to FM programme to deliver court buildings for Ireland’s Ministry of Justice, BAM began to exploit the thousands of pieces of data its teams were collecting. The initial project and programme dashboards were alarming, though: Murphy said 100s of high risk issues appeared to be outstanding on each project, suggesting BAM Ireland was a high risk contractor. However, further investigation revealed that many issues had been dealt with, but (highlighting a training issue) the BIM 360 users involved had not closed the issue in the platform.

Once users got used to reporting and tracking issues through to closure, they were able to provide a more accurate view of their ongoing project risks, and those of subcontractors engaged on the projects. The analysis and reporting tools then became more useful in answering questions such as: What safety risks are trending on my project? Which projects are carrying more risk? What disciplines drive RFIs in my project? What are the root causes of RFIs in my projects?

Murphy said BAM Ireland achieved a 20% improvement in quality and safety on site with the added capacity to make better decisions supported by the solutions. Managers now have an easily accessible, cross-project dashboard that improves oversight across multiple complex projects. Data capture techniques improved – workflows were 95% digital – reducing paper usage to only mandatory, legal documents. And project staff now spends 25% more time focusing on tasks and risk items through the use of Construction IQ. Murphy finished his presentation with a quote from BAM Ireland’s head of digital construction, Paul Brennan:

“While other construction software solutions are simply focused on digitizing paper based workflows, BIM 360 is taking it a step further to truly harness the power of data. This is where our team sees the most value and where we can really start to have a positive impact on improving the challenges our industry is faced with.”

Cookie	Duration	Description
__RequestVerificationToken	session	This cookie is set by web application built in ASP.NET MVC Technologies. This is an anti-forgery cookie used for preventing cross site request forgery attacks.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
language	session	This cookie is used to store the language preference of the user.

Cookie	Duration	Description
browser_id	5 years	This cookie is used for identifying the visitor browser on re-visit to the website.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
NID	6 months	NID cookie, set by Google, is used for advertising purposes; to limit the number of times the user sees an ad, to mute unwanted ads, and to measure the effectiveness of ads.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Cookie	Duration	Description
__cflb	23 hours	This cookie is used by Cloudflare for load balancing.
DcLcid	3 months	No description available.
FormsWebSessionId	1 month	No description
LP-741543354C0C743D47CQ89DDFF4DA17D3B71	1 month	No description
responder-47CQ-HBO	session	No description
responder-47CQ-HPQ	session	No description
Survey-Started-9c67eb5e9c3e57d45mprc4951b758bf21233	1 hour	No description
Survey-Started-f165ec20227728b35lba6d06fd738085a30c	1 hour	No description
usenewauthrollout	1 month	No description
vis	25 years	No description available.
X-Mapping-djcbddng	session	No description