Python: using split() to split a string at 2 separate points - Stack Overflow

admin2025-04-20  0

I have a string that I need to split at 2 separate parts, but all I find is how to split the string using identifiers like "," and other punctuation.

string = "<p>The brown dog jumped over the... <a href="; target="something">... but then splashed in the water<p>

hyperlink = re.split(r'(?=https)',string)

print(hyperlink[0])

In the example above, I need to extract just the url in the string "; then print out. However, I can only find out how to split the string at "https", so everything past the url comes with it.

I hope this makes sense. After a bunch of searching and testing I can figure out how to do this.

I have a string that I need to split at 2 separate parts, but all I find is how to split the string using identifiers like "," and other punctuation.

string = "<p>The brown dog jumped over the... <a href="https://google" target="something">... but then splashed in the water<p>

hyperlink = re.split(r'(?=https)',string)

print(hyperlink[0])

In the example above, I need to extract just the url in the string "https://google" then print out. However, I can only find out how to split the string at "https", so everything past the url comes with it.

I hope this makes sense. After a bunch of searching and testing I can figure out how to do this.

Share Improve this question edited Mar 3 at 16:11 Justin Bertsch asked Mar 3 at 15:59 Justin BertschJustin Bertsch 154 bronze badges 2
  • 1 did you consider, for parsing html data, using html parser? docs.python./3/library/html.parser.html exists – KamilCuk Commented Mar 3 at 16:15
  • 1 Wait, you edited your question, and it become a chameleon. I did not notice. Please ask a separate question for your new question. See meta.stackoverflow/questions/266767/… . Kindly restore your question before the edit, the answer below is already accepted, and ask a new question. – KamilCuk Commented Mar 3 at 16:22
Add a comment  | 

4 Answers 4

Reset to default 2

There are many ways this can be achieved but a simple one is using find() and then slicing. find() will find the starting position of a substring in a string. using this you can then slice there. e.g.

string = '<p>The brown dog jumped over the... <a href="https://google" target="something">... but then splashed in the water<p>'

# Find where the URL starts
start_word = "https"
start_index = string.find(start_word)

# For URLs, we need to find where it ends - usually at a quote mark
end_index = string.find('"', start_index)

# Extract just the URL
result = string[start_index:end_index]

print(result)

Output:

"https://google"

The find() method returns the index where the substring begins. Then, using these positions, we slice the string to extract just the section we want.

There are various regular expressions and functions from the re module that will achieve your objective.

Here's one:

import re

string = '<p>The brown dog jumped over the... <a href="https://google" target="something">... but then splashed in the water<p>'

m = re.findall(r'^.*href="(.*)"\s.*$', string)

print(*m)

Output:

https://google

If you prefer not to use re then:

kw = 'href="'
start = string.find(kw) + len(kw)
end = string[start:].find('"')
result = string[start : end + start]
print(result)

...will give the same output.

As someone suggested in commen you could also use modules for parsing xml or html
like lxm or BeautifulSoup - and sometimes it is simpler method.

from bs4 import BeautifulSoup

html = '<p>The brown dog jumped over the... <a href="https://google" target="something">... but then splashed in the water<p>'

soup = BeautifulSoup(html, 'html.parser')

hyperlink = soup.find('a').attrs['href']

#target = soup.find('a').attrs['target']

The parser expressions that pyparsing creates to match HTML tags avoid many of the classical issues with using tools like regex to parse HTML:

  • handles case insensitivity (of tags and tag attribute names)

  • handles quoted and unquoted attributevalues

  • detects closed tags (opening tags that end with '/')

  • ignores embedded whitespace

In this case, we just need to search for an <a> tag, and let pyparsing grab the tag attributes, as attributes on the parsed result:

string = """<p>The brown dog jumped over the... <a href="https://google" target="something">... but then splashed in the water<p>"""

import pyparsing as pp

# make_html_tags returns a pair of parser expressions, one for the opening tag 
# and one for the matching closing tag - we just need the opening tag
a_tag, _ = pp.make_html_tags("a")

# search_string will return a sequence of all matches, like re.findall
anchor = a_tag.search_string(string)[0]
print(anchor.href)
# https://google

print(anchor.target)
# something
转载请注明原文地址:http://conceptsofalgorithm.com/Algorithm/1745085293a284102.html

最新回复(0)