Friday, January 2, 2015

Spark SQL DBF Library

Happy new year all…It’s been a while. I was crazy busy from May till mid December of last year implementing BigData  geospatial solutions at client sites all over the world. Was in Japan a couple of times, Singapore, Malaysia, UK, and do not recall the times I was in Redlands, Texas and DC.  In addition, I’ve been investing heavily in Spark and Scala. Do not recall the last time I implemented a Hadoop MapReduce job !

One of the resolutions for the new year (in addition to the usual eating right, exercising more and the never-off-the-bucket-list biking Mt Ventoux) is to blog more. One post per month as a minimum.

So…to kick to year right, I’ve implemented a library to query DBF files using Spark SQL. With the advent of Spark 1.2, a custom relation (table) can be defined as a SchemaRDD.  A sample implementation is demonstrated by Databrick’s spark-avro, as Avro files have embedded schema and data so it is relatively easy to convert that to a SchemaRDD. We, in the geo community have such a “old” format that encapsulates schema and data; the DBF format. Using the Shapefile project, I was able to create an RDD using the spark context Hadoop file API and the implementation of a DBFInputFormat. Then using the DBFHeader fields information, each record was mapped onto a Row to be processed by SparkSQL.  This is mostly work in progress and is far from been optimized, but it works !


Like usual, all the source code can be downloaded from here. Happy new year all.

No comments: