Scraping btcbase.org

07/08/19, modified 07/08/19

I wanted to add a comment to trilema, but apparently it looks like spam. So here the comment for http://trilema.com/2019/trilema-goes-dark.

Currently I'm stuck on this reversing the original logs from the output at btcbase.org because of two problems;

1. The new format using block quotes [link][text] sometimes produces links  that are ambiguous and cannot be distinguished from the old way (for example [link][link] or [absolute btcbase link][text] vs [relative btcbase link][text]).
2. Sometime the conversion to html went wrong, for example in http://btcbase.org/log/2016-12-29#1592569. I could make some exceptions for these.
3. It seems that the btcbase.org html output code puts special tokens at the end into the description of the link; i.e. (comma's and dots and closing brackets etc are all eaten and put in the description of the link).

Here are my scripts, to preserve the current status;

The scraper: http://ave1.org/tarpit/log/scrape_log.py

To do all: http://ave1.org/tarpit/log/scrape_all.py

Leave a Reply