PHP Shapefile 3.1.0
Writing performance boost, append capabilities and a few bugfixes
30 October 2019
Exactly two months have passed since v3 release and honestly I wasn’t expecting such a featureful new release this early on. It was actually driven by the great testing and analyzing work conducted by a user who decided to rely on this library for a big project and took his time to analyze where better performance could be achived as well as pointing out a few issues and requesting some new features. Thank you so much, Pablo!
Taking the chance to review the code, I also found and fixed a few nasty bugs that were introduced with v3.
Capability to append records to existing Shapefiles comes with a potentially breaking change, as well as proper DBF field names sanitization. I am aware a minor release is not supposed to bring any breaking change, but I really didn’t feel like jumping to v4 after just 2 months and for such a minor detail that I, in all honesty, don’t think it will break many people’s code.
What’s new in Version 3.1.0
- Writing buffer in
- Capability to append records to existing Shapefiles
- Various new public methods for both
- Sanitize and resolve conflicting and duplicated DBF field names
- Other minor code and performance improvements across the library
c+bfile access mode for
- Convert field names to uppercase for
- Decouple field names sanitization and
- Accept explicit
- Write correct record number in SHP record headers
- Suppress PHP Warnings in
- Fix bug causing a corrupted DBF file when a
- Fix corner case bug when no record has been written yet
Writing buffer in
Records are now not immediately written into files (or, more technically, resource streams). Rather they are buffered into memory and written when they reach a certain amount. This allows up to a 50% reduction in writing time (or even more in some specific cases)!
There is a brand new
Shapefile::OPTION_BUFFERED_RECORDS constructor option for
ShapefileWriter class that specifies the number of records to keep into the memory buffer before writing them. Use a value equal or less than
0 to keep all records into a buffer and write them at once when finalizing the file. Default value is
10, which should suit most scenarios (taking into account an average of total records, complexity of data, system configuration, etc.).
There is also a new
ShapefileWriter::flushBuffer() method, which allows to manually write memory buffers to files. It will effectively reset the record counter that causes automatic buffer flushing according to the value specified with
Shapefile::OPTION_BUFFERED_RECORDS constructor option.
Capability to append records to existing Shapefiles
The library can now append records to an existing Shapefile. This is achieved thanks to a new
Shapefile::OPTION_EXISTING_FILES_MODE constructor option for
ShapefileWriter class, that defines the behaviour with existing files with the same name and can assume the following 3 values:
Shapefile::MODE_PRESERVE: Throws a
Shapefile::ERR_FILE_EXISTSerror (this is the default)
Shapefile::MODE_OVERWRITE: Overwrites existing files
Shapefile::MODE_APPEND: Appends new records to existing files
Of course this feature made obsolete and useless
Shapefile::OPTION_OVERWRITE_EXISTING_FILES constructor option for
ShapefileWriter class, which has been removed. This might break some existing code (yet very young, 2 months old at max), but I think it was worth it.
Various new public methods for both
ShapefileWriter classes now expose 3 new methods:
Also, a few methods that were only available to
ShapefileReader are now exposed by
They can come in handy when
Shapefile::OPTION_EXISTING_FILES_MODE is set to
Shapefile::MODE_APPEND, in order to check existing files definition.
Sanitize and resolve conflicting and duplicated DBF field names
DBF field names can be maximum 10 characters and contain only letters, numbers and underscores.
The library will now always sanitize invalid and/or duplicated field names added with any
ShapefileWriter field-adding method (addCharField(), addNumericField(), etc.): names will be truncated at 10 characters, not allowed characters replaced with underscores and in case of resulting duplicated names or conflicts, a serial number from
99 will be added.
This feature comes with another potential (or should I say virtual?) breaking change:
$flag_sanitize_name parameter was removed from all
ShapefileWriter field-adding methods. It is actually not a very big deal, since it would only affect people relying on a custom sanitization function of theirs that wasn’t working well. Since the library wouldn’t allow illegal field names anyways, a
Shapefile::ERR_DBF_FIELD_NAME_NOT_UNIQUE error type was being thrown in case the field name was not valid.
Shapefile::ERR_DBF_FIELD_NAME_NOT_UNIQUE error type was removed too.
Shapefile::ERR_DBF_FIELD_NAME_NOT_VALID error type is now defined as
"Too many field names conflicting", that is, more than 100!
Other minor code and performance improvements across the library
Since at the beginning this release was focused on performance, a few tiny touches here and there in the codebase were made in order to improve it generally. Most notably the DBF values enconding logic has been changed when parsing numbers, relying on PHP native
number_format() for non-textual numeric input.
c+b file access mode for
The addition of appending records to existing files required reading and not just writing access to them. Default file access mode for
ShapefileWriter class is now
c+b. This should not be a problem, unless you have some weird writing-but-not-reading permission on files or directories (in that case, fix them please, that’s not useful for shapefiles…).
Convert field names to uppercase for
Geometry data when
Shapefile::OPTION_DBF_FORCE_ALL_CAPS is enabled
That was a big bug that went unnoticed since v3: when
Shapefile::OPTION_DBF_FORCE_ALL_CAPS is enabled all field names must silently be considered as uppercase when adding a
Geometry to a Shapefile.
Decouple field names sanitization and
Another silly bug causing
Shapefile::OPTION_DBF_FORCE_ALL_CAPS to be tied to field name sanitization. Fixed now.
null values when
Shapefile::OPTION_ENFORCE_GEOMETRY_DATA_STRUCTURE is enabled
Shapefile::ERR_GEOM_MISSING_FIELD error was being erroneously raised when a field had an explicit
null value and
Shapefile::OPTION_ENFORCE_GEOMETRY_DATA_STRUCTURE was enabled.
Write correct record number in SHP record headers
There was a bug causing wrong record number to be written in SHP record headers (record count in Shapefiles starts from
1, not from
Suppress PHP Warnings in
Who needs a PHP Warning when a
ShapefileException is thrown anyways?
Fix bug causing a corrupted DBF file when a
Shapefile::ERR_GEOM_MISSING_FIELD is raised
This one was tricky and it became very evident after file-append capabilities: a Shapefile record is actually spread across multiple files, so the library must make sure that no file gets corrupted when an exception is raised and the execution interrupted. Now all records already written will be successfully included in the output file (unless something very nasty happens, like killing the process at OS level…).
Fix corner case bug when no record has been written yet
If something goes wrong or you are intentionally trying to create an empty Shapefile without even specifying its type, this situation is now handled correctly, resulting in an empty Shapefile of type
Wait, what about a reading buffer in
I tried to apply some buffering strategies to ShapefileReader, but it turned out it did not improve its performance and in some case it even worsened it.
ShapefileReader class is able to randomly access the Shapefile by design (a great feature indeed) and the only viable solution was a kludgy speculative buffer, keeping in mind that there can be garbage (A.K.A. unused bytes) between one record and another.
The result brought some overhead that hardly compensated with a better throughput and the code was messy, so I ultimately decided to ditch it.
Giving some technical insight but cutting it short: PHP streams are complicated beasts and when reading a file resource the underlying OS and its filesystem already take care of many aspects regarding buffering and optimization. Better relying on them, because it is much easier to mess things up than improve them.
Download and documentation
Go to the Lab page: PHP Shapefile