• dead [he/him]
    hexagon
    hexbear
    28
    6 months ago

    All the documents are here.

    https://www.courtlistener.com/docket/4355835/giuffre-v-maxwell/?filed_after=&filed_before=&entry_gte=&entry_lte=&order_by=desc

    • InevitableSwing [none/use name]
      hexbear
      17
      edit-2
      6 months ago

      Some site with a name like CourtToText.com should exist because PDFs are the devil.

      ---

      [Edit. Much better would be if there was a law that any court PDF had to have a .txt version too.]

      • dRLY [none/use name]
        hexbear
        16
        6 months ago

        I think the only issue with a txt version would come down to not being able to view the items that aren't actually regular text (scans of originals and the ability to see stuff that was handwritten or whatever) or images. Of course most of the docs will be just text, but it would be easy to lose information. What is your main issue with them being in PDF?

        • SUPAVILLAIN@lemmygrad.ml
          hexbear
          6
          6 months ago

          can't control-f names we feel WOULD turn up in this doc; which means now about 1200 pages have to be trawled through entirely manually

              • IzyaKatzmann [he/him]
                hexbear
                3
                edit-2
                6 months ago

                pdf2text and tesseract, i believe pdf2text uses tesseract. i have them installed on an apple silicon mac with homebrew (e.g. brew install tesseract or brew install pdf2text)

                could probably use some ai computer vision package (i haven't checked, i remember looking around before settling on pdf2text) like opencv.

                when i used pdf2text it was with pdf slides my prof provided, they ONLY gave pdfs. something about copyright and IP. super interesting prof, great scientist, great researcher, actually a member of some cool orgs like Linnaeus Society, and annoying with her lecture files.

                EDIT: if anyone wants it enough i can try to do a proof-of-concept for like ~15 random pages of a random doc and see how well it goes