We make available the search index as downloadable files in several formats including json objects designed for network graphing.
Our sitemap scraper runs four times a day. Logs from each run can be viewed online. page
We distribute the index files individually and in a single 35mb compressed tar file. tgz
tar czf public/sites.tgz sites *.txt
# Index
The index is organized as a collection of text files containing unique words extracted from various fields of each page. These are grouped into directories by site and then page within site. txt
sites/ fed.wiki.org/ words.txt links.txt sites.txt items.txt pages/ how-to-wiki/ words.txt links.txt sites.txt items.txt
We include federation wide rollups of the files which we don't use to search but maintain anyway.
words.txt links.txt sites.txt items.txt
# Counts
We accumulate various counts in another text file with one line of json for each scrape. txt
counts.txt
Here we show a sample line after being formatted as indented text. The scan counts are read from logs while the index counts are line counts of the site rollup text files.
{ "date": 1441545903, "scan": { "sites": 676, "pages": 31983 }, "index": { "counts": 3, "items": 258354, "links": 48549, "plugins": 48, "sites": 776, "words": 115927 } }
See Sitemap Scrape Statistics for counts plotted.
# Graphs
We aggregate information from the index into single files representing graphs as node and arcs in two forms.
Nodes are site names and arcs are remote sites. json
"fed.wiki.org": { "pages": 86, "links": [ "design.fed.wiki.org", "fed.coevolving.com", "fed.wiki.org", "fedwiki.rodwell.me", "glossary.asia.wiki.org" ] }
Nodes are page slugs and arcs are internal links. json
"how-to-wiki": { "forks": 37, "links": [ "add-pages", "add-paragraphs", "copy-pages", "find-sites", "follow-links" ] }
We offer javascript versions of the aggregated graph data files that can be included in a web page with a script tag. site-web.js slug-web.js
# Applications
Title Network Browser allows one to navigate from page title to page title following links going forward and backwards.
Site Network Diagram shows all visible sites connected by arcs where there are neighborhood citations.
Recent Activity Report showing sites found to have new activity in the last week.
Neo4J with batch loading and experimental interactive query plugin.