If you missed Awoo's thread, imgur is deleting all anonymously uploaded images on may 15th. This mostly likely includes the majority of image content posted to leftist subreddits.

If you want to save any imgur content from a subreddit now is your last chance.

This is the script I am using with contributions from u/captcha

Using it requires you to download a pushshift archive of the subreddit you want to extract images from. For content up until 2022 this is easiest to obtain from redarcs.

Install dependencies for the script with

pip install argparse asyncio httpx zstandard json

Then run the script with the first argument being a path to the .zst file that you have downloaded. Optionally you can limit the script to only download from a single domain, which is useful given the limited time frame to download from imgur specifically. To do this add the arguments --domain <domain>

Some other options

--subreddit <subreddit name> If you get a zst file from the pullshift torrent archives you can use this to filter to only the subreddits you want

--concurrency <integer> Max number of requests to run concurrently

-enable_retries After attempting to download every image on the subreddit, reattempt to download any that failed to download

--max_retries <integer> Maximum number of times to reattempt downloading files before giving up

--retry_hesitate <integer> when reattempting downloads wait x seconds first, use this if rate limited

--retry_cooldown <integer> how long in minutes it will wait between each set of redownload attempts. Use this to try again considerably later if hosts are unreliable. Good to run during offpeak.

-enable_proxy I added this to toggle wireguard proxies to work around potential ip blocks. don't use this unless you know what you're doing.

example:

python script.py Chapotraphouse_submissions.zst --domain i.imgur.com -enable_retries

  • captcha [any]
    ·
    2 years ago
        async def defer():
            await get_image(http_client, post)
            con_lock.release()
    
        await con_lock.acquire()
        create_task(defer())
                        
        if not os.path.isfile(post["file_name"]):
            undownloaded_images.append(post)
            break
    

    I think that last if not block should be in the defer function. create_task forks the defer() Coroutine to the background and continues on. So you now have a race condition between when the get_image coroutine downloads the file and when the main coroutine checks if the file exists. The main coroutine should always win that so your script should be marking every file as not downloaded and probably never exits.

    By putting the check in the defer coroutine you can ensure that the download has finished before you check if the file exists.

    Also dont try awaiting the task created as that would work but mean you're waiting for the first request to finish before starting the next.

    • captcha [any]
      ·
      2 years ago

      Brief summary of how asyncio works:

      • async def means the function returns a Coroutine which is code that can be executed. If you look into its type its based off Generator
      • await runs a coroutine and stops the current code you are executing until the coroutine is finished.
      • create_task runs a coroutine but does not block until it finishes. It does return a Task which you can use to check on the progress or cancel it etc.