Yes, you read that right. Data Science is not only about turning data into information, but it can also be the other way around! Convert analogue floor plans and texts from PDFs fully automatically into structured data. Even hand-drawn maps. Even if it is a scan with a hopelessly low resolution.
Clients Antea Group and Nazca Solutions had their doubts whether it was even possible. Pipple also had to scratch her head. But with scraping, Optical Character Recognition (OCR) and object recognition, they got it done.
A well-stocked database
Deep in the digital archives of Dutch governments are hundreds of thousands of soil reports, waiting for a second life. Once stored, only a fraction of these files see the light of day again. That’s because digging up an existing soil contamination report is more time-consuming than a new exploratory drilling. A shame, thought René Rummens, senior advisor at Antea Group. ‘The Netherlands has been completely leaked, there is plenty of information. What you need is a central, well-stocked database.
Data collection
And so Antea Group and Nazca IT-Solutions together created Bodem Digitaal Op de Kaart (BDOK). BDOK collects and bundles up-to-date soil data into products for contractors, network operators and consultancy firms. This allows them to optimize their work processes around excavation work in soils, whether contaminated or not. The most difficult here are the historical sources. “We get all the information from the reports that are available for public use. We structure it so that it fits into the database and can be retrieved. Most of these reports are in the archives of governments as PDF documents. But there are also physically binders in them,’ says René.
Recognize GPS coordinates
The first challenge pipple was presented with: automatically converting laboratory results in a PDF into separate data. They succeeded fairly quickly with the help of text recognition. Whether they would succeed in the other assignment was a mystery to all parties, René confesses. ‘GPS coordinates are an important parameter in our database. But that’s a relatively new invention. In old reports, drilling locations are simply marked by drawn crosses in a handmade floor plan. Even an address is often missing. The challenge for Pipple was to be able to automatically derive the exact coordinates from such a drawing.’
The result is miraculous. Within two months, Pipple automatically recognized drilling locations in 75% of recent reports.
Object recognition through machine learning
Pipple used seven hundred reports to create a model. Some were years old and others more recent. The model analyzes in three phases. Thanks to object recognition, it distinguishes a map in a report. Then, based on open source sources, it identifies the location. Finally, it reduces the drill points to a coordinate. René calls the result miraculous. “Within two months, Pipple automatically recognized the drilling locations in 75% of the recent reports, accurate to a few meters. Machine learning will only make that better. Pipple has succeeded in turning analogue information into separate data. In archives, hundreds of thousands more PDFs are waiting for this analysis application. This makes a mountain of data available and allows us to make much more speed.’