Bulk Loading Shapefiles Into Postgres/Postgis

Recently I’ve been doing a fair bit of work with geospatial data, mostly on the data preparation side. While there are common data formats, I have found that because so much of this data are sourced from government agencies, the data are often in many files that can be concatenated.

In this example, I will show how to take a few dozen county-level shapefiles of parcel data from Utah and load it into a single table in Postgres/Postgis.

Step 1: Downloading Shapefiles

The following shell commands come from an in-progress collaboration with a friend, where we are going to analyze daily air quality in Utah over the past several years. Utah is open-data-friendly, providing shapefiles for every parcel of land in Utah.

While it may have been possible to use wget or curl to download every shapefile, they are stored within Google Drive with a bunch of hashed URLs, so I just clicked on each file instead of trying to be clever. So if you want to follow along with this blog post exactly, you’ll need to download the 25 zip files of Utah shapefiles:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
$ ls -lrt utah_lir_shapefiles/
total 408688
-rw-rw-r-- 1 rzwitch rzwitch 954984 Jun 1 13:10 Parcels_Beaver_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 7183466 Jun 1 13:10 Parcels_BoxElder_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 9152777 Jun 1 13:10 Parcels_Cache_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 3279384 Jun 1 13:10 Parcels_Carbon_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 356058 Jun 1 13:10 Parcels_Daggett_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 18908413 Jun 1 13:10 Parcels_Davis_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 3900415 Jun 1 13:10 Parcels_Duchesne_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 2689950 Jun 1 13:10 Parcels_Garfield_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 2156109 Jun 1 13:10 Parcels_Grand_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 8107608 Jun 1 13:10 Parcels_Iron_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 1975537 Jun 1 13:10 Parcels_Juab_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 3273485 Jun 1 13:10 Parcels_Kane_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 2741403 Jun 1 13:10 Parcels_Millard_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 1110627 Jun 1 13:10 Parcels_Morgan_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 2970626 Jun 1 13:10 Parcels_Rich_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 200183664 Jun 1 13:11 Parcels_SaltLake_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 1397522 Jun 1 13:11 Parcels_SanJuan_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 1576757 Jun 1 13:11 Parcels_Sanpete_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 7911261 Jun 1 13:11 Parcels_Summit_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 4480456 Jun 1 13:11 Parcels_Tooele_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 69690149 Jun 1 13:11 Parcels_Utah_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 5025674 Jun 1 13:11 Parcels_Wasatch_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 35896908 Jun 1 13:11 Parcels_Washington_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 298313 Jun 1 13:11 Parcels_Wayne_LIR.zip
-rw-rw-r-- 1 rzwitch rzwitch 23225130 Jun 1 13:11 Parcels_Weber_LIR.zip

Step 2: Bulk Unzip

With all of these files in the same directory at the same level (i.e. no subfolders), it’s pretty easy to bulk unzip the files, with one caveat: to move the contents of the unzipped files into a new directory, you need to use the -d flag:

mkdir utah_lir_shapefiles_unzipped && unzip utah_lir_shapefiles/*.zip -d utah_lir_shapefiles_unzipped

1
2
3
4
5
6
7

The reason I created a new directory (`mkdir`) and then unzipped the files into a new directory is that when doing analysis, I always like to keep the source data separate, so that I always have the option of starting completely over. It also can make regular expression globs easier :)

## Step 3: Creating Postgis Table Definition

After all of the county zip files are unzipped, you get 25 sub-directories structured like the following:

ls -ltr total 10916 -rw-rw-rw- 1 rzwitch rzwitch 67868 Sep 3 2017 Parcels_Beaver_LIR.shx -rw-rw-rw- 1 rzwitch rzwitch 28280 Sep 3 2017 Parcels_Beaver_LIR.shp.xml -rw-rw-rw- 1 rzwitch rzwitch 1503304 Sep 3 2017 Parcels_Beaver_LIR.shp -rw-rw-rw- 1 rzwitch rzwitch 3036 Sep 3 2017 Parcels_Beaver_LIR.sbx -rw-rw-rw- 1 rzwitch rzwitch 83052 Sep 3 2017 Parcels_Beaver_LIR.sbn -rw-rw-rw- 1 rzwitch rzwitch 425 Sep 3 2017 Parcels_Beaver_LIR.prj -rw-rw-rw- 1 rzwitch rzwitch 9471508 Sep 3 2017 Parcels_Beaver_LIR.dbf -rw-rw-rw- 1 rzwitch rzwitch 5 Sep 3 2017 Parcels_Beaver_LIR.cpg

1
2
3

The .shp files from the 25 counties all have the same format, which is very convenient. In this step, we can use the `shp2pgsql` utility that comes with Postgis to read a shapefile, determine the proper schema, then create the table in the database:

shp2pgsql -I -s 26912 -p utah_lir_shapefiles_unzipped/Parcels_Beaver_LIR/Parcels_Beaver_LIR.shp \ utahlirparcels | psql -h localhost -U ;

1
2
3
4
5
6
7

The key flag here is `-p`, which means ‘prepare mode’; the shapefile will get read, a table created, but no data loaded. By not loading the data in this step, it makes looping over the files easier later, as no special logic is required to keep the `Parcels_Beaver_LIR.shp` from being duplicated in Postgis (because it was never loaded in the first place).

## Step 4: Bulk Loading Shapefiles into Postgis

The last steps of the loading process are to 1) get all of the shapefile locations and 2) feed them to shp2pgsql:

for i in $(find utah_lir_shapefiles_unzipped/ -type f -name ‘*.shp’); do shp2pgsql -I -s 26912 -a $i utahlirparcels | psql -h localhost -U ; done; ```

To get all of the shapefile locations, I use find with flags -type f (files type) and name to search for the pattern within the directory. This command goes through the entire set of subdirectories and gets all the .shp files. From there, I iterate over the list of files using for i in..., then pass the value of $i into a similar shp2pgsql as above. However, rather than using flag -p for ‘prepare’, we are now going to use flag -a for ‘append’. This will perform an INSERT INTO utahlirparcels() statement for Postgres, loading in the actual data from the 25 shapefiles.

Spend Time Now To Save Time Later

Like so much of shell scripting, figuring out these commands took longer than I would’ve expected. Certainly, they took longer to figure out than it would’ve taken to copy-paste a shp2pgsql 25 times! But by taking the time upfront to figure out a generic method of looping over shapefiles, the next time (and every time after that) I find myself needing to do this, this code will be available to load multiple shapefiles into Postgis.