This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.This help applies for Linux. Usage is very similar under Windows.
Combinatorial association rules are used for discovering relationships between multiple parameters. For instance, if the rows of a data table correspond to transactions, where the items bought are marked with X characters, then association rules can demonstrate which items are frequently bought together.
An example rule could be beer=X & (milk=X or bread=X) ---> banana=X. This rule states that if someone buys beer and either milk or bread, then they will buy banana as well.
Of course there are no rules without exceptions. There are no "false" or "true" rules, but some of them are "more true" than others. The strength of a rule can be evaluated through multiple mathematical quantities like universe, support, lift or p-value (see the Glossary).
./scarf <parameter-file>where:
table maintable.csv output out.txt xml_output out.xml nthreads 4 pattern 1 2 rulecount 2000 min_universe 600 min_support 65 min_confidence 0.5 min_lift 1.2 max_pvalue 0.05 simpl_param 0.98 column age scaled lhs column eyecolor generic lhs column haircolor generic lhs column height scaled lhs column beautifulness scaled rhsThe following keys are available:
table | Path to the input table (see table format below). Relative paths are interpreted relative to the location of the parameter file. |
output | Path to the plain text output file (optional). If not given, the output is written on the standard output. Relative paths are interpreted relative to the location of the parameter file. |
xml_output | Path to the XML output file (optional). If not given, no XML output is generated. Relative paths are interpreted relative to the location of the parameter file. |
nthreads | Level of concurrency (number of parallel threads). If not given, or zero, then the number of threads will be equal to the number of logical CPUs. |
pattern | The pattern for the rules to mine. This must be numbers separated by single spaces. The pattern can be thought of as several OR clauses AND'ed together. Each number denotes the number of subclauses for the corresponding OR clause. For example, a value of 1 1 2 may make SCARF produce rules like age=B & eyecolor=b & (haircolor=b or height=CDE) --> beautifulness=D. Note that produced rules may be simpler than this pattern if that's optimal, but you will never get more complicated rules than the pattern you specify! More complicated patterns with large OR clauses will cause SCARF run slower. |
rulecount | The maximum number of rules to output. |
min_universe | Minimum universe requirement for a rule (see Glossary). |
min_lhs_support | Minimum LHS support requirement for a rule (see Glossary). |
min_rhs_support | Minimum RHS support requirement for a rule (see Glossary). |
min_support | Minimum support requirement for a rule (see Glossary). |
min_confidence | Minimum confidence requirement for a rule (see Glossary). |
min_lift | Minimum lift requirement for a rule (see Glossary). |
min_leverage | Minimum leverage requirement for a rule (see Glossary). |
max_pvalue | Maximum p-value requirement for a rule (see Glossary). |
simpl_param | SCARF will attempt to simplify resulting rules to compensate for database errors, sparsity and possible overfitting. Simpler rules are generally more robust to database errors. E.g. if the value of this parameter is 0.98, then SCARF will remove a single elementary clause from a rule if the rule's goodness does not decrease by more than 2%, it will remove two elementary clauses from a rule if the rule's goodness does not decrease by more than 4%, etc. |
column | The descriptor for a column in your input table. The format of the value for this key must be column_name column_type column_place.
|
The input table must be a comma- or semicolon-separated CSV file.
The first row must contain the column names separated by the separator character.
Column names may only contain English letters, numbers, underscores, dots and square brackets. In other words, they must match the regular expression ^[a-zA-Z0-9_\.\[\]]+$.
The other rows must contain the data itself.
All attribute values are single characters in SCARF. This means that each cell should be empty or one character long. Cells longer than a character generate a warning, and only the first non-whitespace character of these cells is considered.
Empty cells are used to represent N/A (data not available). Thus you can supply sparse data tables for SCARF if some measurements were not taken for the whole population.
Column | A column of the data table corresponds to an attribute of the entries in the data table. Columns may also be called fields or variables. A column may have N/A values for some rows, where the corresponding attribute is unknown or missing. This is designated with an empty cell. |
Confidence | The probability that the RHS is true, supposed that the LHS is true. In mathematical terms: the conditional probability Pr(LHS and RHS) / Pr(LHS), where the event space consists of the rows of the universe with equal probability assigned. |
E-value | Because SCARF tries out a lot of rule candidates during execution, it is possible that a rule will get a good (i.e. small) p-value by mere chance. Thus the E-value is a more accurate measure of the randomness of a rule. It equals to the p-value, multiplied by the number of all the possible rules examined by SCARF. The E-value can be viewed as an experiment-wide p-value. The smaller the E-value, the more related the LHS and RHS of a rule. |
Goodness | The goodness of a rule is defined as its lift. |
Leverage | Measures the absolute level of dependence between the LHS and the RHS. In mathematical terms: the difference Pr(LHS and RHS) - Pr(LHS)*Pr(RHS). |
LHS | Left-hand side. |
LHS support | The number of data rows in the universe, where the LHS of a rule is true. |
Lift | Measures the relative level of dependence between the LHS and the RHS. In mathematical terms: the ratio Pr(LHS and RHS) / (Pr(LHS)*Pr(RHS)). |
N/A | Data not available. Refers to an unknown (missing) value in the data table. |
p-value | The Chi-Squared test is a well-known method which measures the dependence between the LHS and the RHS. It yields a number between 0 and 1 called the p-value. The smaller the p-value, the more related the LHS and RHS of a rule. |
Row | A data row is an entry (or record) in the data table. Each row is an assignment of some values to the columns. |
RHS | Right-hand side. |
RHS support | The number of data rows in the universe, where the RHS of a rule is true. |
Support | The number of data rows in the universe, where both the LHS and RHS are true. |
Universe | The (number of) data rows where all columns occurring in a rule have a non-N/A value. |