Simple Combinatorial Association Rule Finder
SCARF is an online tool provided by the PIT Bioinformatics Group for finding (or "mining") combinatorial association rules in data tables.
By using the service you accept our terms of use.
Introduction
Combinatorial association rules are used for discovering relationships between attributes. For instance, if the rows of a data table correspond to transactions, where the items bought are marked with X characters, then association rules can demonstrate which items are frequently bought together.
An example rule could be beer=X & (milk=X or bread=X) ---> banana=X. This rule states that if someone buys beer and either milk or bread, then they will buy banana as well.
There are no "completely false" or "completely true" rules. The strength of a rule can be evaluated through multiple mathematical quantities like universe, support, lift or p-value (see the Glossary).
Using SCARF
First, you will need to upload a data table. The data table must consist of a header and multiple records. Read the description of the accepted table format here. There is an example data table for illustration.
After you have selected the data table, you will have to specify some parameters for rule mining:
Pattern | A wildcard rule template such as ☐ & (☐ or ☐) ⇒ ☐. The boxes will be filled with elementary equalities (e.g. banana=X). |
Max. rule count | The maximum number of rules to output. |
Min. universe | Minimum universe requirement for a rule (see Glossary). |
Min. support | Minimum support requirement for a rule (see Glossary). |
Min. confidence | Minimum confidence requirement for a rule (see Glossary). |
Min. lift | Minimum lift requirement for a rule (see Glossary). |
Max. p-value | Maximum p-value requirement for a rule (see Glossary). |
Columns | Here you can specify which columns (i.e. transaction attributes) you want to allow on the LHS/RHS of the rules, respectively. E.g. you can allow 'beer' and 'water' on the left-hand side, and 'apples' and 'oranges' on the right-hand side. In other words, you can use this option to specify which attributes you would like to infer from which attributes. |
Name/Email | Rule mining is a complex process and usually takes several minutes. Results are not available instantly. Therefore we need your name and e-mail address to send you an e-mail when your results are ready. |
The format of the input table
The input table must be a comma- or semicolon-separated CSV file.
- The separator char is automatically recognized.
- The character set should be ASCII. Unicode and UTF-8 characters are not recognized.
- Leading and trailing whitespaces in cells are not considered.
The first row must contain the column names separated by the separator character.
Column names may only contain English letters, numbers, underscores, dots and square brackets. In technical terms, they must match the regular expression ^[a-zA-Z0-9_\.\[\]]+$.
The other rows must contain the data itself.
All attribute values are single characters in SCARF. This means that each cell should be empty or one character long. Cells longer than a character generate a warning, and only the first non-whitespace character of these cells is considered.
Empty cells are used to represent N/A (data not available). Thus you can supply sparse data tables for SCARF if some measurements were not taken for the whole population.
Glossary
Column | A column of the data table corresponds to an attribute of the entries in the data table. Columns may also be called fields or variables. A column may have N/A values for some rows, where the corresponding attribute is unknown or missing. This is designated with an empty cell. |
Confidence | The probability that the RHS is true, supposed that the LHS is true. In mathematical terms: the conditional probability Pr(LHS and RHS) / Pr(LHS), where the event space consists of the rows of the universe with equal probability assigned. |
E-value | Because SCARF tries out a lot of rule candidates during execution, it is possible that a rule will get a good (i.e. small) p-value by mere chance. Thus the E-value is a more accurate measure of the randomness of a rule. It equals to the p-value, multiplied by the number of all the possible rules examined by SCARF. The E-value can be viewed as an experiment-wide p-value. The smaller the E-value, the more related the LHS and RHS of a rule. |
Goodness | The goodness of a rule is defined as its lift. |
Leverage | Measures the absolute level of dependence between the LHS and the RHS. In mathematical terms: the difference Pr(LHS and RHS) - Pr(LHS)*Pr(RHS). |
LHS | Left-hand side. |
LHS support | The number of data rows in the universe, where the LHS of a rule is true. |
Lift | Measures the relative level of dependence between the LHS and the RHS. In mathematical terms: the ratio Pr(LHS and RHS) / (Pr(LHS)*Pr(RHS)). |
N/A | Data not available. Refers to an unknown (missing) value in the data table. |
p-value | The Chi-Squared test is a well-known method which measures the dependence between the LHS and the RHS. It yields a number between 0 and 1 called the p-value. The smaller the p-value, the more related the LHS and RHS of a rule. |
Row | A data row is an entry (or record) in the data table. Each row is an assignment of some values to the columns. |
RHS | Right-hand side. |
RHS support | The number of data rows in the universe, where the RHS of a rule is true. |
Support | The number of data rows in the universe, where both the LHS and RHS are true. |
Universe | The (number of) data rows where all columns occurring in a rule have a non-N/A value. |
Terms of use
You can use this service only if you accept the following terms: We do not guarantee anything about this service: We do not state anything about the usability of this service, and we do not state that the results that we may return can be used for any purpose. We cannot guarantee that this service will be available in the future, and we cannot guarantee that your query would generate any output at all.
Privacy: We will not give out your data to anyone, and, regularly, only you can retrieve the results to your query using the unique webpage identifier generated for you. However, we cannot guarantee that others do not intercept the traffic between you and our server. Therefore, do not use our webserver for proprietary data analysis, we cannot guarantee the data integrity and safety for you.
How to cite?
Balazs Szalkai, Vince K. Grolmusz, Vince I. Grolmusz: Identifying Combinatorial Biomarkers by Association Rule Mining in the CAMD Alzheimer's Database, Archives of Gerontology and Geriatrics Vol. 73, pp. 300-307 (2017), https://doi.org/10.1016/j.archger.2017.08.006
Balazs Szalkai, Vince Grolmusz: SCARF: A Biomedical Association Rule Finding Webserver, arXiv preprint arXiv:1709.09850 (2017)