If you missed Awoo's thread, imgur is deleting all anonymously uploaded images on may 15th. This mostly likely includes the majority of image content posted to leftist subreddits.
If you want to save any imgur content from a subreddit now is your last chance.
This is the script I am using with contributions from u/captcha
Using it requires you to download a pushshift archive of the subreddit you want to extract images from. For content up until 2022 this is easiest to obtain from redarcs.
Install dependencies for the script with
pip install argparse asyncio httpx zstandard json
Then run the script with the first argument being a path to the .zst file that you have downloaded.
Optionally you can limit the script to only download from a single domain, which is useful given the limited time frame to download from imgur specifically. To do this add the arguments --domain <domain>
Some other options
--subreddit <subreddit name>
If you get a zst file from the pullshift torrent archives you can use this to filter to only the subreddits you want
--concurrency <integer>
Max number of requests to run concurrently
-enable_retries
After attempting to download every image on the subreddit, reattempt to download any that failed to download
--max_retries <integer>
Maximum number of times to reattempt downloading files before giving up
--retry_hesitate <integer>
when reattempting downloads wait x seconds first, use this if rate limited
--retry_cooldown <integer>
how long in minutes it will wait between each set of redownload attempts. Use this to try again considerably later if hosts are unreliable. Good to run during offpeak.
-enable_proxy
I added this to toggle wireguard proxies to work around potential ip blocks. don't use this unless you know what you're doing.
example:
python script.py Chapotraphouse_submissions.zst --domain i.imgur.com -enable_retries
I'm going to miss all those DIY and what is this thing and how to in the future
async def defer(): await get_image(http_client, post) con_lock.release() await con_lock.acquire() create_task(defer()) if not os.path.isfile(post["file_name"]): undownloaded_images.append(post) break
I think that last
if not
block should be in thedefer
function.create_task
forks thedefer()
Coroutine to the background and continues on. So you now have a race condition between when theget_image
coroutine downloads the file and when themain
coroutine checks if the file exists. Themain
coroutine should always win that so your script should be marking every file as not downloaded and probably never exits.By putting the check in the
defer
coroutine you can ensure that the download has finished before you check if the file exists.Also dont try awaiting the task created as that would work but mean you're waiting for the first request to finish before starting the next.
Brief summary of how asyncio works:
async def
means the function returns aCoroutine
which is code that can be executed. If you look into its type its based offGenerator
await
runs a coroutine and stops the current code you are executing until the coroutine is finished.create_task
runs a coroutine but does not block until it finishes. It does return aTask
which you can use to check on the progress or cancel it etc.