Catalogue
Reading Files in Python

Reading Files in Python

🌐 日本語で読む

I once needed to read the contents of a file in Python and feed it to the Azure OpenAI Service, so I put together this summary.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import os
import sys
from docx import Document
from pypdf import PdfReader

# .docx
def read_docx(filepath):
doc = Document(filepath)
full_text = []
for para in doc.paragraphs:
full_text.append(para.text)
return "\n".join(full_text)

# .pdf
def read_pdf(filepath):
reader = PdfReader(filepath)
full_text = ""
for p in reader.pages:
full_text += p.extract_text()
return full_text

# .txt, .md etc...
def read_txt(filepath):
with open(filepath, 'r') as file:
content = file.read()
return content

def main(filepath):
_, ext = os.path.splitext(filepath)

if ext == '.docx':
t = read_docx(filepath)
elif ext == '.pdf':
t = read_pdf(filepath)
else:
t = read_txt(filepath)

print(t)


if __name__ == "__main__":
filepath = sys.argv[1]
main(filepath)

https://gist.github.com/kenzo0107/456439de57b3640c053cf369ca42f358

I had previously worked on reading file contents line by line, parsing YAML, and so on, so here is that post for reference as well.

That’s all.
I hope you find this helpful.

Author

Kenzo Tanaka

Posted on

2024-05-28

Licensed under