Scraping a Form for Input Fields Python 3.4

xpath, python, regex, ipython notebook

Form scrape using Regex and some Xpath.

Twitter @CodeDocta

 

import requests, regex
from pprint import pprint
from lxml import html
from lxml.etree import XPath


URL = 'http://httpbin.org/forms/post'
resp = requests.get(URL, )
respText = resp.text
resp.close()
print(resp.status_code)


200


respTree = html.fromstring(respText)
inputs = respTree.xpath("//input")
pprint(inputs)


[<InputElement 4657778 name='custname' type='text'>,
 <InputElement 4657868 name='custtel' type='tel'>,
 <InputElement 4657958 name='custemail' type='email'>,
 <InputElement 46579a8 name='size' type='radio'>,
 <InputElement 46579f8 name='size' type='radio'>,
 <InputElement 4660278 name='size' type='radio'>,
 <InputElement 46609f8 name='topping' type='checkbox'>,
 <InputElement 4660a98 name='topping' type='checkbox'>,
 <InputElement 4660ae8 name='topping' type='checkbox'>,
 <InputElement 4660b38 name='topping' type='checkbox'>,
 <InputElement 4660b88 name='delivery' type='time'>]


print(type(inputs))
print(type(inputs[0]))


<class 'list'>
<class 'lxml.html.InputElement'>


for x in inputs:
    print(x)
   


<InputElement 4657778 name='custname' type='text'>
<InputElement 4657868 name='custtel' type='tel'>
<InputElement 4657958 name='custemail' type='email'>
<InputElement 46579a8 name='size' type='radio'>
<InputElement 46579f8 name='size' type='radio'>
<InputElement 4660278 name='size' type='radio'>
<InputElement 46609f8 name='topping' type='checkbox'>
<InputElement 4660a98 name='topping' type='checkbox'>
<InputElement 4660ae8 name='topping' type='checkbox'>
<InputElement 4660b38 name='topping' type='checkbox'>
<InputElement 4660b88 name='delivery' type='time'>

Need to convert to string before you can split into another list…

 

firstA = inputs[0]
firstB = str(inputs[0])
print(type(firstA))
print(type(firstB))


<class 'lxml.html.InputElement'>
<class 'str'>


itemSplit = firstB.split()
itemSplit


['<InputElement', '4657778', "name='custname'", "type='text'>"]

Now you can get at the name and type.

Notice… I did not use lowercase t as “type” is a python keyword.

 

name = itemSplit[2]
Type = itemSplit[3]

print(name)
print(Type)


name='custname'
type='text'>

Or just regex it…

You can see the regex object, it returns a list.

 

c = regex.findall(r"(?<=name=').*?(?=')", firstB)
print(c)

print(type(c))
print(c[0])


['custname']
<class 'list'>
custname


t = regex.findall(r"(?<=type=').*?(?=')", firstB)
print(t[0])


text

Now you can loop thru inputs list and convert to string and add to another list or…

just Xpath the //form and regex what you need.

Let’s put everything into a list with regex instead.

But first I will show you the form real quick….

 

form = respTree.xpath("//form[@method='post']")
print(type(form))
print(type(form[0]))
print(str(form[0]))


<class 'list'>
<class 'lxml.html.FormElement'>
<Element form at 0x54d0c28>

Not what we expected

Hmmm… Well, this is a pain!! let’s just try regex and I will explain all tha xpath stuff later…give you a hint tho “IO” package/module.

 

allTypes = regex.findall(r"(?<=type=').*?(?=')", resp.text)
allTypes


[]

Oops! what happened?

We closed the connetion like good boys and girls is what happened.

Good thing we stuck it in a variable!!

Do you see what else?

Look at the regex closely.

Here is the HTML so we can see what we are doing.

 

pprint(respText)


('<!DOCTYPE html>\n'
 '<html>\n'
 '  <head>\n'
 '  </head>\n'
 '  <body>\n'
 '  <!-- Example form from HTML5 spec '
 "http://www.w3.org/TR/html5/forms.html#writing-a-form's-user-interface -->\n"
 '  <form method="post" action="/post">\n'
 '   <p><label>Customer name: <input name="custname"></label></p>\n'
 '   <p><label>Telephone: <input type=tel name="custtel"></label></p>\n'
 '   <p><label>E-mail address: <input type=email '
 'name="custemail"></label></p>\n'
 '   <fieldset>\n'
 '    <legend> Pizza Size </legend>\n'
 '    <p><label> <input type=radio name=size value="small"> Small '
 '</label></p>\n'
 '    <p><label> <input type=radio name=size value="medium"> Medium '
 '</label></p>\n'
 '    <p><label> <input type=radio name=size value="large"> Large '
 '</label></p>\n'
 '   </fieldset>\n'
 '   <fieldset>\n'
 '    <legend> Pizza Toppings </legend>\n'
 '    <p><label> <input type=checkbox name="topping" value="bacon"> Bacon '
 '</label></p>\n'
 '    <p><label> <input type=checkbox name="topping" value="cheese"> Extra '
 'Cheese </label></p>\n'
 '    <p><label> <input type=checkbox name="topping" value="onion"> Onion '
 '</label></p>\n'
 '    <p><label> <input type=checkbox name="topping" value="mushroom"> '
 'Mushroom </label></p>\n'
 '   </fieldset>\n'
 '   <p><label>Preferred delivery time: <input type=time min="11:00" '
 'max="21:00" step="900" name="delivery"></label></p>\n'
 '   <p><label>Delivery instructions: <textarea '
 'name="comments"></textarea></label></p>\n'
 '   <p><button>Submit order</button></p>\n'
 '  </form>\n'
 '  </body>\n'
 '</html>')

Notice the quotes?

I switched them, now we can use the regex!

 

allNames = regex.findall(r'(?<=name=").*?(?=")', respText)
allNames


['custname',
 'custtel',
 'custemail',
 'topping',
 'topping',
 'topping',
 'topping',
 'delivery',
 'comments']


allValues = regex.findall(r'(?<=value=").*?(?=")', respText)
allValues


['small', 'medium', 'large', 'bacon', 'cheese', 'onion', 'mushroom']


allTypes = regex.findall(r'(?<=type=).*?(?=\s)', respText)
allTypes


['tel',
 'email',
 'radio',
 'radio',
 'radio',
 'checkbox',
 'checkbox',
 'checkbox',
 'checkbox',
 'time']

This is not looking good, my lists are uneven 😦

 

print('Names ' + str(len(allNames)))
print('Types ' + str(len(allTypes)))
print('Values ' + str(len(allValues)))


Names 9
Types 10
Values 7

Notice I converted integers into Strings there?

The “len” function returns an int, but not anymore.

 

allLabels = regex.findall(r'(?<=<label>).*?(?=</label>)', respText)
allLabels


['Customer name: <input name="custname">',
 'Telephone: <input type=tel name="custtel">',
 'E-mail address: <input type=email name="custemail">',
 ' <input type=radio name=size value="small"> Small ',
 ' <input type=radio name=size value="medium"> Medium ',
 ' <input type=radio name=size value="large"> Large ',
 ' <input type=checkbox name="topping" value="bacon"> Bacon ',
 ' <input type=checkbox name="topping" value="cheese"> Extra Cheese ',
 ' <input type=checkbox name="topping" value="onion"> Onion ',
 ' <input type=checkbox name="topping" value="mushroom"> Mushroom ',
 'Preferred delivery time: <input type=time min="11:00" max="21:00" step="900" name="delivery">',
 'Delivery instructions: <textarea name="comments"></textarea>']

So what should I use?

The great thing is that is totally up to you and your needs.

Now you know several ways and yes there are several more.

This regex syntax is good for “re” packeage too.

I used new “regex” package as it will replace “re” soon.

Just “pip install regex” to get it.

As for the Xpath, I will be doing a seperate tutorial for this as it is more complex.

What to do now?

The obvious utily is to just see and create the post code manually Otherwise, think outside the box. 😉

Think about how you can automate this for most pages…

 

Leave a comment