I have a string that I need to split at 2 separate parts, but all I find is how to split the string using identifiers like "," and other punctuation.
string = "<p>The brown dog jumped over the... <a href="; target="something">... but then splashed in the water<p>
hyperlink = re.split(r'(?=https)',string)
print(hyperlink[0])
In the example above, I need to extract just the url in the string "; then print out. However, I can only find out how to split the string at "https", so everything past the url comes with it.
I hope this makes sense. After a bunch of searching and testing I can figure out how to do this.
I have a string that I need to split at 2 separate parts, but all I find is how to split the string using identifiers like "," and other punctuation.
string = "<p>The brown dog jumped over the... <a href="https://google" target="something">... but then splashed in the water<p>
hyperlink = re.split(r'(?=https)',string)
print(hyperlink[0])
In the example above, I need to extract just the url in the string "https://google" then print out. However, I can only find out how to split the string at "https", so everything past the url comes with it.
I hope this makes sense. After a bunch of searching and testing I can figure out how to do this.
There are many ways this can be achieved but a simple one is using find()
and then slicing.
find()
will find the starting position of a substring in a string. using this you can then slice there.
e.g.
string = '<p>The brown dog jumped over the... <a href="https://google" target="something">... but then splashed in the water<p>'
# Find where the URL starts
start_word = "https"
start_index = string.find(start_word)
# For URLs, we need to find where it ends - usually at a quote mark
end_index = string.find('"', start_index)
# Extract just the URL
result = string[start_index:end_index]
print(result)
Output:
"https://google"
The find()
method returns the index where the substring begins. Then, using these positions, we slice the string to extract just the section we want.
There are various regular expressions and functions from the re module that will achieve your objective.
Here's one:
import re
string = '<p>The brown dog jumped over the... <a href="https://google" target="something">... but then splashed in the water<p>'
m = re.findall(r'^.*href="(.*)"\s.*$', string)
print(*m)
Output:
https://google
If you prefer not to use re then:
kw = 'href="'
start = string.find(kw) + len(kw)
end = string[start:].find('"')
result = string[start : end + start]
print(result)
...will give the same output.
As someone suggested in commen you could also use modules for parsing xml
or html
like lxm
or BeautifulSoup
- and sometimes it is simpler method.
from bs4 import BeautifulSoup
html = '<p>The brown dog jumped over the... <a href="https://google" target="something">... but then splashed in the water<p>'
soup = BeautifulSoup(html, 'html.parser')
hyperlink = soup.find('a').attrs['href']
#target = soup.find('a').attrs['target']
The parser expressions that pyparsing creates to match HTML tags avoid many of the classical issues with using tools like regex to parse HTML:
handles case insensitivity (of tags and tag attribute names)
handles quoted and unquoted attributevalues
detects closed tags (opening tags that end with '/')
ignores embedded whitespace
In this case, we just need to search for an <a>
tag, and let pyparsing grab the tag attributes, as attributes on the parsed result:
string = """<p>The brown dog jumped over the... <a href="https://google" target="something">... but then splashed in the water<p>"""
import pyparsing as pp
# make_html_tags returns a pair of parser expressions, one for the opening tag
# and one for the matching closing tag - we just need the opening tag
a_tag, _ = pp.make_html_tags("a")
# search_string will return a sequence of all matches, like re.findall
anchor = a_tag.search_string(string)[0]
print(anchor.href)
# https://google
print(anchor.target)
# something