Amazon Invoice Automation (Private Repository)
Log into Amazon
Log into Amazon using the following base snippet. Store data in session which you can use for future queries. Check response content to determine if its a successful login.
#!/usr/bin/env python3
import requests
session = requests.Session()
data = {'email':'email_id@gmail.com', 'password':'password'}
header={'User-Agent' : 'Mozilla/5.0'}
response = session.post('https://www.amazon.com/gp/sign-in.html', data, headers=header)
return response.content
Scrape Order Details Page for Order Numbers
Go to page
After successful login, go to order details page
https://www.amazon.com/gp/css/order-history
.
By default it will load order details for last 6 months which is sufficient for monthly automated invoice preparation. If required more, you can change some query parameters easily to suit your needs.
Count number of orders
Scrape Invoice links
Next find out all anchor tags with Invoice
text inside them as well as a link that points to https://www.amazon.com/gp/css/summary/print.html/ref=oh_aui_pi_o00_?orderID=123-1234567-1234567
, where order number is given by 123-1234567-1234567
. In the link, base url (//www.amazon.com
) might be omitted; so look for /gp/css/summary/print.html/ref=oh_aui_pi_o00_
.
One sample link looks like: <a class="a-link-normal" href="/gp/css/summary/print.html/ref=oh_aui_pi_o00_?ie=UTF8&orderID=123-1234567-1234567">Invoice</a>
Look into multiple pages
Number of orders might be more than 10 and hence we need to deal with pagination issues.
This is the url when I click page 2: https://www.amazon.com/gp/your-account/order-history/ref=oh_aui_pagination_1_2?ie=UTF8&orderFilter=months-6&search=&startIndex=10
This is the url for page 3: https://www.amazon.com/gp/your-account/order-history/ref=oh_aui_pagination_2_3?ie=UTF8&orderFilter=months-6&search=&startIndex=20
Explicit click to page 1: https://www.amazon.com/gp/your-account/order-history/ref=oh_aui_pagination_3_1?ie=UTF8&orderFilter=months-6&search=&startIndex=0
Clicking on next
from page 1 to 2: https://www.amazon.com/gp/your-account/order-history/ref=oh_aui_pagination_1_2?ie=UTF8&orderFilter=months-6&search=&startIndex=10
Clicking on next
from page 2 to 3: https://www.amazon.com/gp/your-account/order-history/ref=oh_aui_pagination_2_3?ie=UTF8&orderFilter=months-6&search=&startIndex=20
Clicking on previous
from page 3 to 2: https://www.amazon.com/gp/your-account/order-history/ref=oh_aui_pagination_3_2?ie=UTF8&orderFilter=months-6&search=&startIndex=10
Clicking on previous
from page 2 to 1: https://www.amazon.com/gp/your-account/order-history/ref=oh_aui_pagination_2_1?ie=UTF8&orderFilter=months-6&search=&startIndex=0
Scrape invoices (order numbers and corresponding links) till you hit the total invoice number found on first order history
page all while discarding duplicate entries.
Go to order invoice page
Next, using the already stored requests.Session
go to the invoice urls one by one. Sample url: https://www.amazon.com/gp/css/summary/print.html/ref=oh_aui_pi_o00_?orderID=123-1234567-1234567
invoice_url = 'https://www.amazon.com/gp/css/summary/print.html/ref=oh_aui_pi_o00_?orderID=123-1234567-1234567'
response = session.post(invoice_url, data, headers=header)
return response.content
Print the Invoices
You already have the response content or, the html content returned by amazon for that particular order number. Simply use that with the help of pdfkit
to save file as a pdf with possibly the order number as part of filename.
import pdfkit
order_id = '123-1234567-1234567'
pdf_file = 'invoice_' + str(order_id) + '.pdf'
html_content = response.content
# pdfkit.from_url('http://amazon.com/...', pdf_file)
# pdfkit.from_file('test.html', pdf_file)
pdfkit.from_string(html_content, pdf_file)