SCARF — Simple Combinatorial Association Rule Finder

Version 1.0

SCARF is a program written by Balázs Szalkai in 2012-13 for finding combinatorial association rules in data tables. It is released under GNU GPL version 3.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
This help applies for Linux. Usage is very similar under Windows.

Introduction

Combinatorial association rules are used for discovering relationships between multiple parameters. For instance, if the rows of a data table correspond to transactions, where the items bought are marked with X characters, then association rules can demonstrate which items are frequently bought together.

An example rule could be beer=X & (milk=X or bread=X) ---> banana=X. This rule states that if someone buys beer and either milk or bread, then they will buy banana as well.

Of course there are no rules without exceptions. There are no "false" or "true" rules, but some of them are "more true" than others. The strength of a rule can be evaluated through multiple mathematical quantities like universe, support, lift or p-value (see the Glossary).

Usage

Run SCARF from a command line like this:

./scarf <parameter-file>
where:

SCARF will return the rules ordered by their goodness decreasing. (see Glossary) SCARF can produce both plain-text and XML output. The XML output can be viewed in a browser, thanks to the corresponding XML stylesheet.

Parameter file format

A parameter file must be a text file containing key-value pairs separated by a single newline, and optional empty lines which may separate these pairs. An example is listed below:

table
maintable.csv
output
out.txt
xml_output
out.xml
nthreads
4
pattern
1 2
rulecount
2000
min_universe
600
min_support
65
min_confidence
0.5
min_lift
1.2
max_pvalue
0.05
simpl_param
0.98

column
age scaled lhs
column
eyecolor generic lhs
column
haircolor generic lhs
column
height scaled lhs
column
beautifulness scaled rhs
The following keys are available:

tablePath to the input table (see table format below). Relative paths are interpreted relative to the location of the parameter file.
outputPath to the plain text output file (optional). If not given, the output is written on the standard output. Relative paths are interpreted relative to the location of the parameter file.
xml_outputPath to the XML output file (optional). If not given, no XML output is generated. Relative paths are interpreted relative to the location of the parameter file.
nthreadsLevel of concurrency (number of parallel threads). If not given, or zero, then the number of threads will be equal to the number of logical CPUs.
patternThe pattern for the rules to mine. This must be numbers separated by single spaces. The pattern can be thought of as several OR clauses AND'ed together. Each number denotes the number of subclauses for the corresponding OR clause. For example, a value of 1 1 2 may make SCARF produce rules like age=B & eyecolor=b & (haircolor=b or height=CDE) --> beautifulness=D. Note that produced rules may be simpler than this pattern if that's optimal, but you will never get more complicated rules than the pattern you specify! More complicated patterns with large OR clauses will cause SCARF run slower.
rulecountThe maximum number of rules to output.
min_universeMinimum universe requirement for a rule (see Glossary).
min_lhs_supportMinimum LHS support requirement for a rule (see Glossary).
min_rhs_supportMinimum RHS support requirement for a rule (see Glossary).
min_supportMinimum support requirement for a rule (see Glossary).
min_confidenceMinimum confidence requirement for a rule (see Glossary).
min_liftMinimum lift requirement for a rule (see Glossary).
min_leverageMinimum leverage requirement for a rule (see Glossary).
max_pvalueMaximum p-value requirement for a rule (see Glossary).
simpl_paramSCARF will attempt to simplify resulting rules to compensate for database errors, sparsity and possible overfitting. Simpler rules are generally more robust to database errors. E.g. if the value of this parameter is 0.98, then SCARF will remove a single elementary clause from a rule if the rule's goodness does not decrease by more than 2%, it will remove two elementary clauses from a rule if the rule's goodness does not decrease by more than 4%, etc.
columnThe descriptor for a column in your input table. The format of the value for this key must be column_name column_type column_place.
  • column_name is the name of the column, as specified in the first line of the database file. Alternatively, you can refer to a column by a hashmark followed by a zero-based column index. That is, #2 is an alternative name for the third column.
  • column_type may be one of the following: all, range:<ORDERING>, single.
    • all: All value sets are considered for this column. Most suitable for less variable columns, i.e. where the number of possible values is around four or less.
      The program will run slow if you use this column type for columns with lots of possible values!
    • range:<ORDERING>: The values for this column are ordered. Only "less than x" and "greater than x" type value sets are considered for this column. <ORDERING> must be a character sequence which enumerates the values for this column in the order from minimum to maximum.
      For example, range:lnh means that this column may take the values n, h, l which may mean normal, high, low respectively, and are ordered the following way: low<normal<high. This column may yield the elementary clauses column=nh, column=h, column=l, column=ln.
      This column type is suitable for columns having values which represent a discretized numerical measurement (higher ASCII values represent higher numeric values), e.g. for columns like age, height, education level, etc.
    • single: Most suitable for columns with lots of possible values which are unordered. Only single-item value sets are considered.
      For example, if such a column takes the values abxy, then the possible elementary clauses are column=a, column=b, column=x, column=y.
    To sum up, the column type should be
    • range:<ORDERING> if there is an ordering defined on the possible values
    • all if there are only very few possible values
    • single in all other cases
  • column_place may be lhs, rhs or lhsrhs, and it specifies whether this column may occur only on the LHS, only on the RHS or on both the LHS and RHS. If you wish that a column would neither occur on the LHS nor on the RHS of rules, then do not input any column entry in the parameter file for that column.

The format of the input table

The input table must be a comma- or semicolon-separated CSV file.

The first row must contain the column names separated by the separator character.
Column names may only contain English letters, numbers, underscores, dots and square brackets. In other words, they must match the regular expression ^[a-zA-Z0-9_\.\[\]]+$.

The other rows must contain the data itself.
All attribute values are single characters in SCARF. This means that each cell should be empty or one character long. Cells longer than a character generate a warning, and only the first non-whitespace character of these cells is considered.

Empty cells are used to represent N/A (data not available). Thus you can supply sparse data tables for SCARF if some measurements were not taken for the whole population.

Glossary

ColumnA column of the data table corresponds to an attribute of the entries in the data table. Columns may also be called fields or variables.
A column may have N/A values for some rows, where the corresponding attribute is unknown or missing. This is designated with an empty cell.
ConfidenceThe probability that the RHS is true, supposed that the LHS is true.
In mathematical terms: the conditional probability Pr(LHS and RHS) / Pr(LHS), where the event space consists of the rows of the universe with equal probability assigned.
E-valueBecause SCARF tries out a lot of rule candidates during execution, it is possible that a rule will get a good (i.e. small) p-value by mere chance. Thus the E-value is a more accurate measure of the randomness of a rule. It equals to the p-value, multiplied by the number of all the possible rules examined by SCARF.
The E-value can be viewed as an experiment-wide p-value. The smaller the E-value, the more related the LHS and RHS of a rule.
GoodnessThe goodness of a rule is defined as its lift.
LeverageMeasures the absolute level of dependence between the LHS and the RHS.
In mathematical terms: the difference Pr(LHS and RHS) - Pr(LHS)*Pr(RHS).
LHSLeft-hand side.
LHS supportThe number of data rows in the universe, where the LHS of a rule is true.
LiftMeasures the relative level of dependence between the LHS and the RHS.
In mathematical terms: the ratio Pr(LHS and RHS) / (Pr(LHS)*Pr(RHS)).
N/AData not available. Refers to an unknown (missing) value in the data table.
p-valueThe Chi-Squared test is a well-known method which measures the dependence between the LHS and the RHS. It yields a number between 0 and 1 called the p-value. The smaller the p-value, the more related the LHS and RHS of a rule.
RowA data row is an entry (or record) in the data table. Each row is an assignment of some values to the columns.
RHSRight-hand side.
RHS supportThe number of data rows in the universe, where the RHS of a rule is true.
SupportThe number of data rows in the universe, where both the LHS and RHS are true.
UniverseThe (number of) data rows where all columns occurring in a rule have a non-N/A value.