Form scrape using Regex and some Xpath.
import requests, regex
from pprint import pprint
from lxml import html
from lxml.etree import XPath
URL = 'http://httpbin.org/forms/post'
resp = requests.get(URL, )
respText = resp.text
resp.close()
print(resp.status_code)
respTree = html.fromstring(respText)
inputs = respTree.xpath("//input")
pprint(inputs)
[<InputElement 4657778 name='custname' type='text'>,
<InputElement 4657868 name='custtel' type='tel'>,
<InputElement 4657958 name='custemail' type='email'>,
<InputElement 46579a8 name='size' type='radio'>,
<InputElement 46579f8 name='size' type='radio'>,
<InputElement 4660278 name='size' type='radio'>,
<InputElement 46609f8 name='topping' type='checkbox'>,
<InputElement 4660a98 name='topping' type='checkbox'>,
<InputElement 4660ae8 name='topping' type='checkbox'>,
<InputElement 4660b38 name='topping' type='checkbox'>,
<InputElement 4660b88 name='delivery' type='time'>]
print(type(inputs))
print(type(inputs[0]))
<class 'list'>
<class 'lxml.html.InputElement'>
for x in inputs:
print(x)
<InputElement 4657778 name='custname' type='text'>
<InputElement 4657868 name='custtel' type='tel'>
<InputElement 4657958 name='custemail' type='email'>
<InputElement 46579a8 name='size' type='radio'>
<InputElement 46579f8 name='size' type='radio'>
<InputElement 4660278 name='size' type='radio'>
<InputElement 46609f8 name='topping' type='checkbox'>
<InputElement 4660a98 name='topping' type='checkbox'>
<InputElement 4660ae8 name='topping' type='checkbox'>
<InputElement 4660b38 name='topping' type='checkbox'>
<InputElement 4660b88 name='delivery' type='time'>
Need to convert to string before you can split into another list…¶
firstA = inputs[0]
firstB = str(inputs[0])
print(type(firstA))
print(type(firstB))
<class 'lxml.html.InputElement'>
<class 'str'>
itemSplit = firstB.split()
itemSplit
['<InputElement', '4657778', "name='custname'", "type='text'>"]
Now you can get at the name and type.¶
Notice… I did not use lowercase t as “type” is a python keyword.¶
name = itemSplit[2]
Type = itemSplit[3]
print(name)
print(Type)
name='custname'
type='text'>
Or just regex it…¶
You can see the regex object, it returns a list.¶
c = regex.findall(r"(?<=name=').*?(?=')", firstB)
print(c)
print(type(c))
print(c[0])
['custname']
<class 'list'>
custname
t = regex.findall(r"(?<=type=').*?(?=')", firstB)
print(t[0])
Let’s put everything into a list with regex instead.¶
form = respTree.xpath("//form[@method='post']")
print(type(form))
print(type(form[0]))
print(str(form[0]))
<class 'list'>
<class 'lxml.html.FormElement'>
<Element form at 0x54d0c28>
Not what we expected¶
Hmmm… Well, this is a pain!! let’s just try regex and I will explain all tha xpath stuff later…give you a hint tho “IO” package/module.¶
allTypes = regex.findall(r"(?<=type=').*?(?=')", resp.text)
allTypes
Oops! what happened?¶
We closed the connetion like good boys and girls is what happened.¶
Good thing we stuck it in a variable!!¶
Do you see what else?¶
Look at the regex closely.¶
Here is the HTML so we can see what we are doing.¶
('<!DOCTYPE html>\n'
'<html>\n'
' <head>\n'
' </head>\n'
' <body>\n'
' <!-- Example form from HTML5 spec '
"http://www.w3.org/TR/html5/forms.html#writing-a-form's-user-interface -->\n"
' <form method="post" action="/post">\n'
' <p><label>Customer name: <input name="custname"></label></p>\n'
' <p><label>Telephone: <input type=tel name="custtel"></label></p>\n'
' <p><label>E-mail address: <input type=email '
'name="custemail"></label></p>\n'
' <fieldset>\n'
' <legend> Pizza Size </legend>\n'
' <p><label> <input type=radio name=size value="small"> Small '
'</label></p>\n'
' <p><label> <input type=radio name=size value="medium"> Medium '
'</label></p>\n'
' <p><label> <input type=radio name=size value="large"> Large '
'</label></p>\n'
' </fieldset>\n'
' <fieldset>\n'
' <legend> Pizza Toppings </legend>\n'
' <p><label> <input type=checkbox name="topping" value="bacon"> Bacon '
'</label></p>\n'
' <p><label> <input type=checkbox name="topping" value="cheese"> Extra '
'Cheese </label></p>\n'
' <p><label> <input type=checkbox name="topping" value="onion"> Onion '
'</label></p>\n'
' <p><label> <input type=checkbox name="topping" value="mushroom"> '
'Mushroom </label></p>\n'
' </fieldset>\n'
' <p><label>Preferred delivery time: <input type=time min="11:00" '
'max="21:00" step="900" name="delivery"></label></p>\n'
' <p><label>Delivery instructions: <textarea '
'name="comments"></textarea></label></p>\n'
' <p><button>Submit order</button></p>\n'
' </form>\n'
' </body>\n'
'</html>')
Notice the quotes?¶
I switched them, now we can use the regex!¶
allNames = regex.findall(r'(?<=name=").*?(?=")', respText)
allNames
['custname',
'custtel',
'custemail',
'topping',
'topping',
'topping',
'topping',
'delivery',
'comments']
allValues = regex.findall(r'(?<=value=").*?(?=")', respText)
allValues
['small', 'medium', 'large', 'bacon', 'cheese', 'onion', 'mushroom']
allTypes = regex.findall(r'(?<=type=).*?(?=\s)', respText)
allTypes
['tel',
'email',
'radio',
'radio',
'radio',
'checkbox',
'checkbox',
'checkbox',
'checkbox',
'time']
This is not looking good, my lists are uneven 😦¶
print('Names ' + str(len(allNames)))
print('Types ' + str(len(allTypes)))
print('Values ' + str(len(allValues)))
Names 9
Types 10
Values 7
Notice I converted integers into Strings there?¶
The “len” function returns an int, but not anymore.¶
allLabels = regex.findall(r'(?<=<label>).*?(?=</label>)', respText)
allLabels
['Customer name: <input name="custname">',
'Telephone: <input type=tel name="custtel">',
'E-mail address: <input type=email name="custemail">',
' <input type=radio name=size value="small"> Small ',
' <input type=radio name=size value="medium"> Medium ',
' <input type=radio name=size value="large"> Large ',
' <input type=checkbox name="topping" value="bacon"> Bacon ',
' <input type=checkbox name="topping" value="cheese"> Extra Cheese ',
' <input type=checkbox name="topping" value="onion"> Onion ',
' <input type=checkbox name="topping" value="mushroom"> Mushroom ',
'Preferred delivery time: <input type=time min="11:00" max="21:00" step="900" name="delivery">',
'Delivery instructions: <textarea name="comments"></textarea>']
So what should I use?¶
The great thing is that is totally up to you and your needs.¶
Now you know several ways and yes there are several more.¶
This regex syntax is good for “re” packeage too.¶
I used new “regex” package as it will replace “re” soon.¶
Just “pip install regex” to get it.¶
As for the Xpath, I will be doing a seperate tutorial for this as it is more complex.¶
What to do now?¶
The obvious utily is to just see and create the post code manually Otherwise, think outside the box. 😉¶
Think about how you can automate this for most pages…¶
You must be logged in to post a comment.