rust-pdf-extractor - Turning PDF Documents into Plain Text
It's the last day of July 2024, just where is the time going? It's mid-week, so I'm concluding the month with my Rust program of the week.
I wrote this a couple of weeks ago, pretty much all from my own head with the barest minimum of ChatGPT suggestions. I wrote it for use at work, where I want to get pdf content into a more maleable format. There are some technical resources used in my line of work that I'd like to convert into web based content, so I thought having something that could take a pdf and convert it to plain text would be useful.
This project leverages two community crates:
- clap (for CLI input)
- pdf_extract (for doing the heavy conversion lifting)
In addition to these crates, I lean on the standard library for file IO functions.
I may eventually convert this project into an API of some sort, but this will take me awhile to figure out.
// src/main.rs
// dependencies
use clap::Parser;
use std::fs::File;
use std::io::{self, prelude::*};
use std::path::Path;
#[derive(Parser, Debug)]
#[command(version, about, long_about = None)]
struct Args {
#[arg(short, long)]
input: String,
#[arg(short, long)]
output: String,
}
// function to extract the content of the pdf and return the bytes
fn extract_content(input: Vec<u8>) -> Result<String, Box<dyn std::error::Error>> {
let content = pdf_extract::extract_text_from_mem(&input)?;
Ok(content)
}
// function to read the input file contents
fn read_input(input_file: String, stdout: &mut dyn Write) -> Result<Vec<u8>, Box<dyn std::error::Error>> {
writeln!(stdout, "Input file name: {}", &input_file)?;
let content = std::fs::read(input_file)?;
Ok(content)
}
// function to write the output file after extraction
fn write_output(output_file: String, output: String, stdout: &mut dyn Write) -> Result<(), Box<dyn std::error::Error>> {
let path = Path::new(&output_file);
let mut file = File::create(path)?;
file.write_all(output.as_bytes())?;
writeln!(stdout, "Output file name: {}", output_file)?;
Ok(())
}
// main function
fn main() -> Result<(), Box<dyn std::error::Error>> {
let args = Args::parse();
let mut stdout = io::stdout();
let pdf = read_input(args.input, &mut stdout)?;
let text = extract_content(pdf)?;
write_output(args.output, text, &mut stdout)?;
writeln!(stdout, "Conversion from pdf to plain text completed successfully.")?;
Ok(())
}
I haven't put this code into a GitHub repo yet, but eventually will.