Skip to main content

SampleRecord

Description

Samples the records of a FlowFile based on a specified sampling strategy (such as Reservoir Sampling). The resulting FlowFile may be of a fixed number of records (in the case of reservoir-based algorithms) or some subset of the total number of records (in the case of probabilistic sampling), or a deterministic number of records (in the case of interval sampling).

Tags

interval, range, record, reservoir, sample

Properties

In the list below required Properties are shown with an asterisk (*). Other properties are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display NameAPI NameDefault ValueAllowable ValuesDescription
Record Reader *record-readerController Service:
RecordReaderFactory

Implementations:
AvroReader
CEFReader
CSVReader
ExcelReader
GrokReader
JsonPathReader
JsonTreeReader
ReaderLookup
ScriptedReader
Syslog5424Reader
SyslogReader
WindowsEventLogReader
XMLReader
YamlTreeReader
Specifies the Controller Service to use for parsing incoming data and determining the data's schema
Record Writer *record-writerController Service:
RecordSetWriterFactory

Implementations:
AvroRecordSetWriter
CSVRecordSetWriter
FreeFormTextRecordSetWriter
JsonRecordSetWriter
RecordSetWriterLookup
ScriptedRecordSetWriter
XMLRecordSetWriter
Specifies the Controller Service to use for writing results to a FlowFile
Sampling Strategy *sample-record-sampling-strategyReservoir Sampling
  • Interval Sampling
  • Range Sampling
  • Probabilistic Sampling
  • Reservoir Sampling
Specifies which method to use for sampling records from the incoming FlowFile
Sampling Interval *sample-record-intervalSpecifies the number of records to skip before writing a record to the outgoing FlowFile. This property is only used if Sampling Strategy is set to Interval Sampling. A value of zero (0) will cause no records to be included in theoutgoing FlowFile, a value of one (1) will cause all records to be included, and a value of two (2) will cause half the records to be included, and so on.

Supports Expression Language, using FlowFile attributes and Environment variables.

This property is only considered if:
  • the property Sampling Strategy has a value of interval
Sampling Range *sample-record-rangeSpecifies the range of records to include in the sample, from 1 to the total number of records. An example is '3,6-8,20-' which includes the third record, the sixth, seventh and eighth records, and all records from the twentieth record on. Commas separate intervals that don't overlap, and an interval can be between two numbers (i.e. 6-8) or up to a given number (i.e. -5), or from a number to the number of the last record (i.e. 20-). If this property is unset, all records will be included.

Supports Expression Language, using FlowFile attributes and Environment variables.

This property is only considered if:
  • the property Sampling Strategy has a value of range
Sampling Probability *sample-record-probabilitySpecifies the probability (as a percent from 0-100) of a record being included in the outgoing FlowFile. This property is only used if Sampling Strategy is set to Probabilistic Sampling. A value of zero (0) will cause no records to be included in theoutgoing FlowFile, and a value of 100 will cause all records to be included in the outgoing FlowFile..

Supports Expression Language, using FlowFile attributes and Environment variables.

This property is only considered if:
  • the property Sampling Strategy has a value of probabilistic
Reservoir Size *sample-record-reservoirSpecifies the number of records to write to the outgoing FlowFile. This property is only used if Sampling Strategy is set to reservoir-based strategies such as Reservoir Sampling.

Supports Expression Language, using FlowFile attributes and Environment variables.

This property is only considered if:
  • the property Sampling Strategy has a value of reservoir
Random Seedsample-record-random-seedSpecifies a particular number to use as the seed for the random number generator (used by probabilistic strategies). Setting this property will ensure the same records are selected even when using probabilistic strategies.

Supports Expression Language, using FlowFile attributes and Environment variables.

This property is only considered if:
  • the property Sampling Strategy has a value of probabilistic or reservoir

Dynamic Properties

This component does not support dynamic properties.

Relationships

NameDescription
failureIf a FlowFile fails processing for any reason (for example, any record is not valid), the original FlowFile will be routed to this relationship
originalThe original FlowFile is routed to this relationship if sampling is successful
successThe FlowFile is routed to this relationship if the sampling completed successfully

Reads Attributes

This processor does not read attributes.

Writes Attributes

NameDescription
mime.typeThe MIME type indicated by the record writer
record.countThe number of records in the resulting flow file

State Management

This component does not store state.

Restricted

This component is not restricted.

Input Requirement

This component requires an incoming relationship.

System Resource Considerations

ScopeDescription
MEMORYAn instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

See Also