Data file format Guidelines
Data provided for bulk upload shall follow these guidelines.
Fit for purpose
Whatever the format, the data provided needs to be complete and detailed enough for the intended use.
E.g. CSV files, not XLSX. Human-readable is nice but that is not the purpose of these files. Tools for humans, such as spreadsheets, do their own interpretation of the data. This means that what is seen may not be what is in the file, and that means they are not reliable for data transfers.
Machine generated is better than manually generated. The format must not vary without consultation.
It should be possible to deduce unique (plant, device, time, metric) for each value from the information that is in the file and the filename. Repeated strings can be freely included as long as files are compressed for transfer.
Easy to parse
- Use standard formats, e.g. for timestamps, ISO 8601: yyyy-MM-ddTHH:mm:ss
- Use a period for the decimal marker. Do not use a comma.
- Quote any text fields that may include the delimiter character, e.g. 123123, "A good, bad error", 99898
- Use ASCII characters. Non-ASCII characters have variable encodings and cause confusion in this context.
- Put different categories of data in different fields, or at least clearly delimit them. A hard to parse example is: NDW_GE15xxx_WTG0037_Ambient_Temp_1. This contains site, device and metric fields but uses the same delimiter, '_', between all words.
- Row counts: do not include in the file. If providing for validation, send separately.
Data extraction error messages
Including error messages generated by the data extraction process instead of a data value simply ensures that the file will take longer to parse. Error messages seen instead of values include: No more values, I/O Timeout, Bad, Not Connect, Comm Fail, RPC Resolver is Off-Line, Intf Shut. These error messages are only relevent to the data extractor; there is nothing the data load can do but ignore the value.
Standard column formats
There are two standard formats.
Modified third normal form
The column structure is fixed: plant; device; metric; timestamp; mean value; min value; max value; stdev value. Where a metric does not have mean/min/max/sdev, e.g. a status code, the remaining columns should be present but their cells will be empty. Text metrics can be handled in the same fashion.
This is the preferred file format.
The column structure is variable, e.g.: plant,device,timestamp, toweraccrmsmean, yawerrmean The column header is mapped to a metric. Columns may be added, removed or rearranged as needed. However, all files provided MUST have the same column structure.
- We need standard deviation for wind speed, hub speed and real power.
File names shall be unique and meaningful. Include: * plant * device (if for a single device) * generation time as YYYYMMDDhhmmss, E.g. DataOwnerId.plant.device.YYYYMMDDhhmmss.csv The date range of values included in the file is also useful.
File names should not include spaces. Restrict file name characters to: A-Z, a-z, 0-9, _ and period.
Files should be compress before upload using either gzip or bzip2.
Files compressed using other programs may be rejected (e.g. rar, zip, compress).
- Discover the full list of plant/fleet identifiers
- What Turbine identifiers will be in the data files? Customer's internal identifiers, as used in data extracts, must be mapped to DCL identifiers.
A map of customer metric names to Sentient names must be worked out with the customer during on-boarding. Changes to be managed through consultation.
These should be provided to customers to help explain what is required. Before any large volume of data is transferred, sample files should be provided by the customer to verify that they will meet the requirements.