• dRLY [none/use name]
    ·
    11 months ago

    I think the only issue with a txt version would come down to not being able to view the items that aren't actually regular text (scans of originals and the ability to see stuff that was handwritten or whatever) or images. Of course most of the docs will be just text, but it would be easy to lose information. What is your main issue with them being in PDF?

    • Amerikan Pharaoh@lemmygrad.ml
      ·
      11 months ago

      can't control-f names we feel WOULD turn up in this doc; which means now about 1200 pages have to be trawled through entirely manually

          • IzyaKatzmann [he/him]
            ·
            edit-2
            11 months ago

            pdf2text and tesseract, i believe pdf2text uses tesseract. i have them installed on an apple silicon mac with homebrew (e.g. brew install tesseract or brew install pdf2text)

            could probably use some ai computer vision package (i haven't checked, i remember looking around before settling on pdf2text) like opencv.

            when i used pdf2text it was with pdf slides my prof provided, they ONLY gave pdfs. something about copyright and IP. super interesting prof, great scientist, great researcher, actually a member of some cool orgs like Linnaeus Society, and annoying with her lecture files.

            EDIT: if anyone wants it enough i can try to do a proof-of-concept for like ~15 random pages of a random doc and see how well it goes