Tree: zplus/dokk.git/67d09afad1a9bec9bfdba02df380f1e4be895626/scripts/theoryofcomputing.org/README

ID: ababd727f3ddf2aacb7cb8bddcd46e16673b0956
27 lines — 1K — View raw

Articles are grouped by volumes, and they're indexed at https://theoryofcomputing.org/articles/main/
Each article page has a link to a source.zip file containing all the info about the specific
article. The idea is to download all these zip files and extract info from them.

There are instructions for mirroring with rsync but they are outdated. Therefore we need
to scrap the website using wget.

Parsing latex from Python is a nightmare (cannot find any module, and not all papers use
the same latex snippets) therefore some data is extracted from the articles' HTML pages (they
use Google Scholar citation_* <meta> tags).


Mirror the whole website:

    wget --mirror https://theoryofcomputing.org


Decompress all "source.zip" archives into "source.zip.decompressed":

    find -type f -name "source.zip" -exec unzip -d "{}.decompressed" "{}" \;


Extract data from the mirror and create the nodes:

    mkdir --parents pdf/theoryofcomputing.org
    mkdir nodes
    find -type d -regex ".*/articles/v[0-9][0-9][0-9]a[0-9][0-9][0-9]$" -exec ./toc.py {} \;