dataframe - Remove duplicated value inside in cell in python with multiple row - Stack Overflow

admin2025-04-19  0

Hi Im cleaning bunch of emails and need help of removing data on each cell.In my data frame it looks like this

I want to remove the duplicated values on each value on each cell with multiple row duplicates too.

My data

left center
One email1emailcom email2emailcom email3emailcom
Two email1emailcom email15emailcom email12emailcom
Three email15emailcom email16emailcom

Hi Im cleaning bunch of emails and need help of removing data on each cell.In my data frame it looks like this

I want to remove the duplicated values on each value on each cell with multiple row duplicates too.

My data

left center
One email1emailcom email2emailcom email3emailcom
Two email1emailcom email15emailcom email12emailcom
Three email15emailcom email16emailcom

Output

left center
One email1emailcom email2emailcom email3emailcom
Two email15emailcom email12emailcom
Three email16emailcom
Share Improve this question edited Mar 3 at 19:42 Vokult asked Mar 3 at 19:37 VokultVokult 135 bronze badges 3
  • Looks like what... – sudden_appearance Commented Mar 3 at 19:38
  • 1 I added pictures as stackoverflow thinks my table is a code. but stackoverflow not showing the picture – Vokult Commented Mar 3 at 19:40
  • Sorry I'm new to python, How do i do this? can you give me example? – Vokult Commented Mar 3 at 19:56
Add a comment  | 

1 Answer 1

Reset to default 1

If you want to keep a running track of emails you have already seen, then you likely want to leverage some global set() construct, say an actual global or a class variable or a closure. I will use the latter here:

import pandas

df = pandas.DataFrame({
    "left": [
        "One",
        "Two",
        "Three"
    ],
    "center": [
        "email1emailcom email2emailcom email3emailcom",
        "email1emailcom email15emailcom email12emailcom",
        "email15emailcom email16emailcom"
    ]
})

## ---------------------
## Construct a closure that will use a set() to keep track of distinct emails
## and return work function that will filter out duplicates using the set
## from the emails string passed to it
## ---------------------
def make_email_filter():

    distinct_emails = set()  ## keep track of distinct emails

    ## ---------------------
    ## Work function that will filter out duplicates including in-row
    ## and cross-row duplicates
    ## ---------------------
    def _work(emails_string):
        these_emails = []
        for email in emails_string.split():
            if email in distinct_emails:
                continue
            distinct_emails.add(email)
            these_emails.append(email)
        return " ".join(these_emails)
    ## ---------------------

    ## ---------------------
    ## Return our work function back to the caller
    ## ---------------------
    return _work
    ## ---------------------
## ---------------------

## ---------------------
## Update our column based on application of the work function
## ---------------------
df["center"] = df["center"].apply(make_email_filter())
## ---------------------

print(df)

That will give us a result like:

    left                                        center
0    One  email1emailcom email2emailcom email3emailcom
1    Two               email15emailcom email12emailcom
2  Three                               email16emailcom

If you want to ignore both emails you have already seen AND emails from a bad domain, you can just expand your test. Here I will do two tests just to be explicit, but in practice, I would combine these into one test.

import pandas

df = pandas.DataFrame({
    "left": [
        "One",
        "Two",
        "Three"
    ],
    "center": [
        "email1emailcom email2emailcom email3emailcom",
        "email1emailcom email15emailcom email12emailcom",
        "email15emailcom email16emailcom [email protected]"
    ]
})

## ---------------------
## Construct a closure that will use a set to keep track of distinct emails
## and return work function that will filter out duplicates
## from the emails string passed to it.
## Note the addition of an init parameter to allow for filtering out
## emails from a specific domain.
## ---------------------
def make_email_filter(discard_domain=""):

    distinct_emails = set()  ## keep track of distinct emails

    ## ---------------------
    ## Work function that will filter out duplicates including in-row
    ## and cross-row duplicates
    ## ---------------------
    def _work(emails_string):
        these_emails = []
        for email in emails_string.split():

            ## ---------------------
            ## Ignore this email if it is from a bad domain
            ## ---------------------
            if discard_domain and email.endswith(discard_domain):
                continue
            ## ---------------------

            ## ---------------------
            ## Ignore this email if we have already seen it
            ## ---------------------
            if email in distinct_emails:
                continue
            ## ---------------------

            distinct_emails.add(email)
            these_emails.append(email)
        return " ".join(these_emails)
    ## ---------------------

    ## ---------------------
    ## Return our work function back to the caller
    ## ---------------------
    return _work
    ## ---------------------
## ---------------------

## ---------------------
## Update our column based on application of the work function
## ---------------------
df["center"] = df["center"].apply(make_email_filter(discard_domain="@amazon"))
## ---------------------

print(df)

Again giving us:

0    One  email1emailcom email2emailcom email3emailcom
1    Two               email15emailcom email12emailcom
2  Three                               email16emailcom
转载请注明原文地址:http://conceptsofalgorithm.com/Algorithm/1745075459a283542.html

最新回复(0)