Friday, January 24, 2014

Hadoop and Shapefiles

Shapefiles are still today the ubiquitous way to share and exchange geospatial data. I’ve been getting a lot of requests lately from BigData Hadoop users to read shapefiles directly off HDFS, I mean after all, the 3rd V (variety) should allow me to do that. Since the format of shapefiles was developed by Esri, there was always an "uneasiness" in me as an Esri employee in using third party open source tools (geotools and JTS) to read these shapefiles when we have just released on Github our geometry API. In addition, I always thought that these powerful libraries were too heavy for my needs, when I just wanted to plow through a shapefile in a map or reduce phase in a job. So, I decide to write my own simple java implementation that for now just reads points and polygons.  This 20% implementation should for now cover my 80% usage. I know that there exist a lot of java implementations on the net that read the shp and dbf format, but I wanted one that is tailored to by BigData needs specially when it comes to writable instances and more importantly, it generates geometry instances based on our geometry model and API. Like usual all the source code can be found here.

No comments: