Lab Fellow Ben Roberts-Pierel highlights one of his favorite data exploration tools

As the scientific community expands the sources and quantity of data available for analysis, methods and tools for speeding up operations are becoming increasingly important. One such tool is the Swifter package for Pandas. For those of you who do not spend exorbitant amounts of time programming, Pandas is a basic but powerful Python module that reads filetypes like CSVs into data frames and then allows for a huge range of processing options. As anybody that has used Pandas (or even Excel) for scientific data analysis will know, simply adding another sensor or a few more years of high temporal resolution data can exponentially increase the size of your dataset. Suddenly we are trying to apply operations to millions and millions of rows of data which can take a great deal of time. This is where the Swifter module comes in. From the GitHub page, Swifter is “A package that efficiently applies any function to a Pandas dataframe or series in the fastest available manner.” In other words, the main function from this package, swiftapply, will look at an operation given to Pandas through the apply function and decide if it is fastest (based on the function and dataset size) to use conventional Pandas apply, to vectorize the operation (much faster) or when that is not possible to rely on Dask to scale the operation to multiple cores. The figure below (from GitHub) gives a simple example of speed comparisons with more examples and a notebook here.
Those wanting to read more about how it works and why its a useful tool can consult the GitHub repo or a couple of short articles on its use: link, link.
|