Usage
>>> from parsel import Selector
>>> html = b"""<form><input type="hidden" name="foo" value="bar" /></form>"""
>>> selector = Selector(body=html, base_url="https://example.com")
>>> form = selector.css("form")
You can use form2request()
to generate form submission
request data:
>>> from form2request import form2request
>>> request_data = form2request(form)
>>> request_data
Request(url='https://example.com?foo=bar', method='GET', headers=[], body=b'')
form2request()
does not make requests, but you can use its
output to build requests with any HTTP client software. It also provides
conversion methods for common use cases, e.g. for the
requests library:
>>> import requests
>>> request = request_data.to_requests()
>>> requests.send(request)
<Response [200]>
form2request()
supports user-defined form data, choosing a specific submit button (or none), and
overriding form attributes.
Getting a form
form2request()
requires an HTML form object. You can get
one using parsel, as seen above,
or you can use lxml:
>>> from lxml.html import fromstring
>>> root = fromstring(html, base_url="https://example.com")
>>> form = root.xpath("//form")[0]
If you use a library or framework based on parsel or lxml, chances are they also let you get a form object. For example, when using a Scrapy response:
>>> from scrapy.http import TextResponse
>>> response = TextResponse("https://example.com", body=html)
>>> form = response.css("form")
Here are some examples of XPath expressions that can be useful to get a form
using parsel’s Selector.xpath
or
lxml’s HtmlElement.xpath
:
To find a form by one of its attributes, such as
id
orname
, use//form[@<attribute>="<value>"]
. For example, to find<form id="foo" …
, use//form[@id="foo"]
.When using
Selector.css
,#<id>
(e.g.#foo
) finds byid
, and[<attribute>="<value>"]
(e.g.[name=foo]
or[name="foo bar"]
) finds by any other attribute.To find a form by index, by order of appearance in the HTML code, use
(//form)[n]
, wheren
is a 1-based index. For example, to find the 2nd form, use(//form)[2]
.
If you prefer, you could use the XPath of an element inside the form, and then visit parent elements until you reach the form element. For example:
element = root.xpath('//input[@name="zip_code"]')[0]
while True:
if element.tag == "form":
break
element = element.getparent()
form = element
For some use cases, you can use Formasaurus, a ML-based solution that can can automatically find a form of a specified type (e.g. a search form), its default key-value pairs, and its submit button. Its Usage documentation includes an example featuring form2request.
Setting form data
While there are forms made entirely of hidden fields, like the one above, most often you will work with forms that expect user-defined data:
>>> html = b"""<form><input type="text" name="foo" /></form>"""
>>> selector = Selector(body=html, base_url="https://example.com")
>>> form = selector.css("form")
Use the data
parameter of form2request()
, to define
the corresponding data:
>>> form2request(form, {"foo": "bar"})
Request(url='https://example.com?foo=bar', method='GET', headers=[], body=b'')
You may sometimes find forms where more than one field has the same name
attribute:
>>> html = b"""<form><input type="text" name="foo" /><input type="text" name="foo" /></form>"""
>>> selector = Selector(body=html, base_url="https://example.com")
>>> form = selector.css("form")
To specify values for all same-name fields, instead of a dictionary, use an iterable of key-value tuples:
>>> form2request(form, (("foo", "bar"), ("foo", "baz")))
Request(url='https://example.com?foo=bar&foo=baz', method='GET', headers=[], body=b'')
Sometimes, you might want to prevent a value from a field from being included
in the generated request data. For example, because the field is removed or
disabled through JavaScript, or because the field or a parent element has the
disabled
attribute (currently not supported by form2request):
>>> html = b"""<form><input name="foo" value="bar" disabled /></form>"""
>>> selector = Selector(body=html, base_url="https://example.com")
>>> form = selector.css("form")
To remove a field value, set it to None
:
>>> form2request(form, {"foo": None})
Request(url='https://example.com', method='GET', headers=[], body=b'')
Overriding form attributes
You can override the method and enctype attributes of a form:
>>> form2request(form, method="POST", enctype="text/plain")
Request(url='https://example.com', method='POST', headers=[('Content-Type', 'text/plain')], body=b'foo=bar')
Using request data
The output of form2request()
,
Request
, is a simple request data container:
>>> request_data = form2request(form)
>>> request_data
Request(url='https://example.com?foo=bar', method='GET', headers=[], body=b'')
While form2request()
does not make requests, you can use
its output request data to build an actual request with any HTTP client
software.
Request
also provides conversion methods for common use
cases:
to_scrapy()
, for Scrapy 1.1.0+:>>> request_data.to_scrapy(callback=self.parse) <GET https://example.com?foo=bar>
to_requests()
, for requests 1.0.0+ (see an example above).to_poet()
, for web-poet 0.2.0+:>>> request_data.to_poet() HttpRequest(url=RequestUrl('https://example.com?foo=bar'), method='GET', headers=<HttpRequestHeaders()>, body=b'')