# SCARF

## Simple Combinatorial Association Rule Finder

← Back to SCARF

SCARF is an online tool provided by the PIT Bioinformatics Group for finding (or "mining") combinatorial association rules in data tables.

## Introduction

Combinatorial association rules are used for discovering relationships between attributes. For instance, if the rows of a data table correspond to transactions, where the items bought are marked with X characters, then association rules can demonstrate which items are frequently bought together.

An example rule could be beer=X & (milk=X or bread=X) ---> banana=X. This rule states that if someone buys beer and either milk or bread, then they will buy banana as well.

There are no "completely false" or "completely true" rules. The strength of a rule can be evaluated through multiple mathematical quantities like universe, support, lift or p-value (see the Glossary).

## Using SCARF

First, you will need to upload a data table. The data table must consist of a header and multiple records. Read the description of the accepted table format here. There is an example data table for illustration.

After you have selected the data table, you will have to specify some parameters for rule mining:

 Pattern A wildcard rule template such as ☐ & (☐ or ☐) ⇒ ☐. The boxes will be filled with elementary equalities (e.g. banana=X). Max. rule count The maximum number of rules to output. Min. universe Minimum universe requirement for a rule (see Glossary). Min. support Minimum support requirement for a rule (see Glossary). Min. confidence Minimum confidence requirement for a rule (see Glossary). Min. lift Minimum lift requirement for a rule (see Glossary). Max. p-value Maximum p-value requirement for a rule (see Glossary). Columns Here you can specify which columns (i.e. transaction attributes) you want to allow on the LHS/RHS of the rules, respectively. E.g. you can allow 'beer' and 'water' on the left-hand side, and 'apples' and 'oranges' on the right-hand side. In other words, you can use this option to specify which attributes you would like to infer from which attributes. Name/Email Rule mining is a complex process and usually takes several minutes. Results are not available instantly. Therefore we need your name and e-mail address to send you an e-mail when your results are ready.

## The format of the input table

The input table must be a comma- or semicolon-separated CSV file.

• The separator char is automatically recognized.
• The character set should be ASCII. Unicode and UTF-8 characters are not recognized.
• Leading and trailing whitespaces in cells are not considered.

The first row must contain the column names separated by the separator character.
Column names may only contain English letters, numbers, underscores, dots and square brackets. In technical terms, they must match the regular expression ^[a-zA-Z0-9_\.]+\$.

The other rows must contain the data itself.
All attribute values are single characters in SCARF. This means that each cell should be empty or one character long. Cells longer than a character generate a warning, and only the first non-whitespace character of these cells is considered.

Empty cells are used to represent N/A (data not available). Thus you can supply sparse data tables for SCARF if some measurements were not taken for the whole population.

## Glossary

 Column A column of the data table corresponds to an attribute of the entries in the data table. Columns may also be called fields or variables.A column may have N/A values for some rows, where the corresponding attribute is unknown or missing. This is designated with an empty cell. Confidence The probability that the RHS is true, supposed that the LHS is true.In mathematical terms: the conditional probability Pr(LHS and RHS) / Pr(LHS), where the event space consists of the rows of the universe with equal probability assigned. E-value Because SCARF tries out a lot of rule candidates during execution, it is possible that a rule will get a good (i.e. small) p-value by mere chance. Thus the E-value is a more accurate measure of the randomness of a rule. It equals to the p-value, multiplied by the number of all the possible rules examined by SCARF.The E-value can be viewed as an experiment-wide p-value. The smaller the E-value, the more related the LHS and RHS of a rule. Goodness The goodness of a rule is defined as its lift. Leverage Measures the absolute level of dependence between the LHS and the RHS.In mathematical terms: the difference Pr(LHS and RHS) - Pr(LHS)*Pr(RHS). LHS Left-hand side. LHS support The number of data rows in the universe, where the LHS of a rule is true. Lift Measures the relative level of dependence between the LHS and the RHS.In mathematical terms: the ratio Pr(LHS and RHS) / (Pr(LHS)*Pr(RHS)). N/A Data not available. Refers to an unknown (missing) value in the data table. p-value The Chi-Squared test is a well-known method which measures the dependence between the LHS and the RHS. It yields a number between 0 and 1 called the p-value. The smaller the p-value, the more related the LHS and RHS of a rule. Row A data row is an entry (or record) in the data table. Each row is an assignment of some values to the columns. RHS Right-hand side. RHS support The number of data rows in the universe, where the RHS of a rule is true. Support The number of data rows in the universe, where both the LHS and RHS are true. Universe The (number of) data rows where all columns occurring in a rule have a non-N/A value.

← Back to SCARF