GEDEVO-M is a tool for multiple network alignment. Start ./gedevo-m with no parameters for a short description of the command line params. See also test scripts for examples of GEDEVO with set parameters. GEDEVO-M v0.1 supplementary README [29th Jan 2013] this version of GEDEVO-M does not support: * directed edges, i.e. all edges hangled as undirected; * greedy initialization; * pre-matching. The remaining text is taken from the README file of GEDEVO (see http://gedevo.mpi-inf.mpg.de/) =========================================================== = FILE FORMATS = =========================================================== *** edge list file *** Text file with protein interactions of a network. Each line represents a directed interaction. Two entries in the opposite direction each are considered an undirected edge. Protein names are truncated at the first pipe `|` character, and are case sensitive! - This is handled in the same way for every other file type, too. It looks like this: protein1 protein2 protein2 protein3 protein2 protein1 [...] *** NTW file *** Text file with protein interactions. Each line represents a directed interaction. Two entries in the opposite direction each are considered an undirected edge. A .ntw file can store multiple networks, therefore the first word in each line is the network name this interaction belongs to. It looks like this: Header Line group_name protein1 protein2 group_name protein2 protein3 group_name protein2 protein1 [...] The header line is ignored. Anything else after the first 3 words per line is ignored. *** SIF file *** Text file with protein interactions. Common format for import/export with Cytoscape. Only one network can be stored in a .sif file. It looks like this: protein1 interaction protein2 protein2 interaction protein3 protein42 interaction protein55 [...] An interaction can be one of the following: DirectedEdge directed d UndirectedEdge undirected u The interaction type is not case sensitive. If the interaction type is none of the above, undirected is assumed. *** SIGS file *** Text file with graphlet signature vectors for one network. It looks like this: proteinName 1 3 4 4 1 2 3 2 4 6 6 3 2 1 3 5 3 2 1 [...] [...] The first word is the protein name, followed by 73 integer numbers which represent the number of occurances of each graphlet. *** DIST file *** Text file with a matrix of numbers, used for precomputed BLAST E-values. It looks like this: 3 5 1 proteinA1 1 proteinA2 1 proteinA3 2 proteinB1 2 proteinB2 2 proteinB3 2 proteinB4 2 proteinB5 0.11 0.12 0.13 0.14 0.15 0.21 0.22 0.23 0.24 0.25 0.31 0.32 0.33 0.34 0.35 Assuming two networks A (3 nodes) and B (5 nodes). 3 and 5, respectively, are the number of nodes in each group (this directly corresponds with the --groups setting. For this matrix, it is important which network is the first one!). The two initial numbers are followed by the protein names in the order they appear (group 1 ~ columns, group 2 ~ rows). The rest are the individual distances. For example: d(proteinA2, proteinB1) = 0.21 if used, DIST files have to be spcified for all pairs of input networks. For example for four networks net1, net2, net3, net4 use options --blast net1 net2 file12.dist \ --blast net1 net3 file13.dist \ --blast net1 net4 file14.dist \ --blast net2 net3 file22.dist \ --blast net2 net4 file23.dist \ --blast net3 net4 file34.dist =========================================================== = INTERNAL CONFIG OPTIONS = =========================================================== These are the available parameters for --config opt=value (or -c opt=value). If in doubt, leave these at the defaults, or grep the source code for the variable of interest. The recongized options are listed with their default values. Some of these have a direct equivalent command line parameter, for example: --pop = --config maxAgents=N *** Final weights *** * weightGED = 1 * weightGraphlets = 0 * weightNodeDist = 0 * weightPairsum = 0 These weights scale the relative influence of every scoring function when evaluating the final score of an individual; the resulting value defines survival chance. If a weight is zero, the corresponding function is not used for the final score. weightNodeDist is the simple node degree distance, which is equal to the 2-graphlet signature distance. It is automatically used instead of the graphlet score if no graphlet signatures are available, but the graphlet score is used. weightPairsum is all pair scores summed up (see below). *** Pairwise weights *** * pairWeightGED = 1 * pairWeightGraphlets = 1 * pairWeightNodeDist = 0 These are similar to the final weights, but they are applied to every pair; the resulting value measures the difference of one specific node pair and the probability of it being picked up and re-assigned by the EA during offspring generation. For example, pairs with a low score are likely to be kept intact, whereas pairs with a high score are likely to be broken and re-assigned. It is important that the pair score and the final score use sane multipliers, that is, offspring generation should go in a direction that will be considered good in the evaluation phase. In other words, both scores must correlate, otherwise the EA will work against itself and give no useful results. pairWeightNodeDist is the same as above, but per-pair. *** BLAST threshold *** * thresholdBLAST = 0.000001 If the BLAST score for a given pair is below or equal to this threshold, it is used exclusively to score this pair, otherwise a combination of the other pairwise scores is used instead, appropriately scaled by their weights. ** Default pair score ** * pairNullValue = 1 This is used for scores where no better value is available, i.e. when mapping a valid node against nil. This value controls how "forcefully" the two graphs are aligned, in other words, how high the penalty for mapping a node to nil is. A value of 1 is the worst score possible, so the algorithm will try to prevent node <-> nil pairs at all costs. Lower values will cause exceptionally bad pairs to be considered worse than mapping against nil, so preferably hard-to-align nodes will be paired with nil and thus counted as inserted or deleted. In short: a value of 1 produces an alignment as compact as possible, therefore minimizing the GED, lower values produce more biologically meaningful alignments, and very low values align only nodes that fit really well, thus causing the alignment to fall apart into unconnected subgraphs. *** Data preprocessing *** * forceUndirectedEdges = 0 If enabled, any directed edges will be transformed into undirected edges. This is useful when the input network data are known to be undirected, but only edges in one direction are present. Additionally, if only undirected edges exist, a slightly faster implementation for GED score calculation is automatically chosen. * matchSameNames = 1 This allows pre-matching proteins that, based on their name, are known to be equal. Pre-matched pairs will never be separated, even if the other data are incomplete and the resulting score would cause these pairs to be broken quickly. (Names are case sensitive!) *** GED penalties *** * edgeAddedGEDScore = 1 * edgeRemovedGEDScore = 1 * edgeSubstitutedGEDScore = 0 * edgeFlippedGEDScore = 0.8 * edgeDirectedToUndirectedGEDScore = 0.2 * edgeUndirectedToDirectedGEDScore = 0.2 * nodeAddedGEDScore = 0 * nodeRemovedGEDScore = 0 The raw GED for a mapping is calculated from the number of removed/added/substituted/flipped edges required to transform one graph into the other graph, given this mapping. The raw counts are multiplied with these penalties and summed up, producing the actual GED score. A substitution is desirable and should not contribute to the GED score; the same is true for node insertions and deletions, because any of the latter already cause insertion/deletion of all edges associated with that node. *** Greedy settings *** (Do not use this, it is really slow and not necessary!) * evo_greedyInitOnInit = 0 If enabled, the first iteration will perform a greedy initialization step. * evo_greedyInitPerRound = 0 Because greedy initialization is a costly operation, this is a special setting that defines the relative number of individuals to mutate using greedy. The value can be in [0 .. 1], where 0 means greedy is not used and 1 means that all individuals use greedy. However, individuals initialized with this method are not mutated otherwise, therefore a value of 1 is entirely useless. A value of 0.0075 has been observed to improve convergence for a limited number of early iterations; values above 0.1 are too slow to have any benefit in practice. * evo_greedyInitScoreLimit = 0.4 This setting defines the upper score limit that greedy considers worth keeping. If a pair of nodes has a score worse than this value, it is ignored as a potential match. If there are no usable matches satisfying the threshold, the node in question is mapped to nil. *** Population settings *** * maxAgents = 400 This controls how many new individuals are created in each offspring generation phase. The number of new individuals in each iteration has the greatest impact on convergence and execution speed, and some impact on memory utilization. Each individual has to be created and later evaluated, both taking most of the runtime of the algorithm. A value of 100 - 1000, depending on the network size, has been found to work well (1000 for smaller networks up to 500 nodes, 200 or more for large graphs up to 10000 nodes). * basicHealth = 100 * maxHealthDrop = 100 The first value is used as the initial health for each individual, which is then decreased by the second value relative to it. The worst individual will be reduced by the full amount, better ones by linearly interpolated amounts. This setting is only useful to prolong or shorten survival times, and has no direct influence on convergence, only on the total number of individuals existing at any given time. *** Miscellaneous *** * numThreads = 0 Number of worker threads to use. By default the program will detect the number of available CPUs automatically and create one thread per CPU.