Current data collection systems are literally drowning in data between increases in logging, deployment of ever-increasing IoT services, and improving connectivity globally. The biggest challenge facing data scientists is trying to figure out what it all is, what it means, and how to use it in meaningful ways. Metadata is a great way to help data scientists and ML practitioners accomplish these tasks.
What is metadata?
By definition, metadata is data about your data (“Metadata,” 2021). In its simplest form, it could be a data dictionary or schema that provides guidance on the data contained in a given dataset. All possible values for a nominal collection, ranges for numeric collections, datetime formats, and periodicity of time series data are but a few examples. In a more ideal metadata scenario you would also have the theoretical possible range for numeric data documented as well as the operational limitations of the data collection device (e.g if a temperature sensor is reliable from -20C to 100C, record that as well as what happens when it goes above or below the designed range). Even better still would be metadata that provides information like sensor specs, model info, and last calibration date.
Sadly, the current state of most datasets provides nowhere near that level of detail, and when it does it’s often not in a machine readable format that would support automation of any kind. What we typically see are comma/tab separated files (CSV/TSV) where at best we have column names that provide some implicit metadata (temp_c, rel_humidity, num_dependents, etc), and at worst no headings at all or something like col1, col2, col3… While these formats offer easy viewing, parsing, and interoperability, what has been lost is a very important part of data collection and provenance that would make the data much more useful and relevant.
How can we fix this?
While metadata is sorely lacking in many datasets and datasources, the fix is actually relatively simple – actually use the capabilities of many existing data collection and storage tools that have been around for a long time.
Data catalog systems
Many well-established data catalog systems like CKAN have provided structured metadata capabilities for years, yet most catalog entries I’ve run across have little or no metadata associated with the dataset beyond when it was uploaded and who the contributor was. One nice feature of data catalog systems is that the metadata resides alongside the data files, so even if the data was collected in a format like CSV you have access to structured metadata bundled with it so you don’t have to muddle with the CSV file itself. Many data catalog systems also provide API access to the dataset record, allowing for automation of data retrieval and merging of metadata by a client system.
An added bonus? Many systems like CKAN and its brethren are open source and are able to run on modest servers with plenty of storage available to them either locally or via block storage services.
Existing file formats
While simple to generate and process, CSV is far from the ideal storage format for data science. Building a system around CSV means you have to explicitly code the schema based on whatever metadata is provided or can be gathered, and even that is often not enough. Is the date format MM/DD/YYY or DD/MM/YYYY? If column #3 is sparse, how many lines does our CSV reader have to read ahead to auto-detect the type and range? How much numeric precision did we really need from the source, and how much were we actually given when truncated for text output?
File formats like HDF5 and Parquet directly support explicit metadata in their schemas, offer high compression storage options, store numeric data in standard binary formats for preservation of precision, and have wide support in data processing and analysis systems. Another format that provides the ease of use of CSV files yet provides for cleartext metadata embedding is ARFF from the Weka community. The Weka toolkit even provides command line and interactive tools for converting common formats to and from ARFF. Unfortunately ARFF is not as widely supported in non-Weka systems as the other formats, but is a viable option for bundling metadata with cleartext data when it makes sense.
Databases, by their very nature, support quite a bit of metadata via table schemas. Additionally, many systems also allow for custom metadata per column beyond the requirements of the database itself. Add to that the fact that by their very nature DBMS systems support a large number of clients and can be tuned for extremely high performance, scalability, and stability they provide an excellent platform for data collection.
While SQL may not consistently represent the metadata between DBMS vendors, exporting data to a common format that supports storage of the DBMS managed metadata makes for a viable alternative for many data collection efforts.
Feature stores are a relatively new trend, taking the concept of data lakes and oceans to the next level by providing access to preprocessed and/or raw data at the feature level (Hirschtein, 2020). Often these feature store elements have provenance, engineering/ETL pipelines, and descriptive statistics associated with them and support direct querying for subsets of the available data. While providing a more complex approach to automation of available data sources, they are becoming a mainstay in data science and machine learning. The downside is that due to the compute and storage requirements, feature stores are often only available in a cloud setting, which introduces potential vendor lockin and ingress/egress costs for external integrations. They also often miss the mark in context metadata, instead focusing on the specific datapoint and not how it fits into the grander scheme of its parent data source.
In order to better support data science efforts and un-/self-supervised learning, we need to collectively do a better job of providing metadata for our data sources. The tools exist, we need to make the gathering and inclusion of structured metadata part of our data collection and sourcing workflows. There really is no reason to ever have to guess what col1 contains in a CSV file when we have so many better tools at our disposal.
Hirschtein, A. (2020, April 9). What are Feature Stores and Why Are They Critical for Scaling Data Science? Medium. https://towardsdatascience.com/what-are-feature-stores-and-why-are-they-critical-for-scaling-data-science-3f9156f7ab4
Metadata. (2021). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Metadata&oldid=1059734118