Re: Python help (finding duplicates)

Top Page
Attachments:
Message as email
+ (text/plain)
Delete this message
Reply to this message
Author: Kevin Faulkner
Date: 2010-08-28 23:12 -000
To: Main PLUG discussion list
Subject: Re: Python help (finding duplicates)
On Saturday 28 August 2010 11:48:10 Joseph Sinclair wrote:
> OK,
> I've attached a complete program that works, if you want to just get it
> done, but I've also described what went wrong in your first attempt below.
>

I really appreciate what you have done. I more so like the description of what
I did wrong. Using readlines() is a better approach like you said, less disk
thrashing. I was using /usr/bin/python3, so print() is now a function. My next
step is to take the host list and identify where the IP is using pygeoip.
Thank you again. :)
-Kevin
> # the i value was just for debugging, so I dropped it
> primaryfile = open('/tmp/extract','r')
> # read the primary file into a list for speed and so you aren't reading
> more than once primary_lines = primaryfile.readlines()
> # you didn't specify a mode for this, so it defaulted to read-only.  Be
> explicit for clarity secondaryfile = open('/tmp/unload', 'r')
> # Open a separate file for output, otherwise you would have been writing
> and reading the same file over and over again, which usually causes errors
> outputfile = open('/tmp/result-file', 'w')
> # read the second file into a list, then you can scan through it over and
> over without hammering disk and re-reading a file you might have modified.
> secondary_lines = secondaryfile.readlines()
> # print is a statement, not a function.
> print 'opened files'
> # loop through the list, not the file
> for line in primary_lines:
>    pcompare = line
>    # print is a statement, use the formatting operator to print variable
> values print 'primary line = %s' % (pcompare)
>    # loop through the list, not the file
>    for row in secondary_lines:
>      scompare = row
>      if pcompare == scompare:
>        # print as a statement, not a function
>        print 'secondary line = %s' % (scompare)
>        # you were writing random # characters in a file (most likely after
> the line read), this writes a comment to a new file, which is usually
> clearer. # invert the test, and add the line to a set here then write out
> the set at the end to get an output of lines without duplication.
> outputfile.write('#%s' % (scompare))
> print 'Done'

>


> Kevin Faulkner wrote:
> > Sorry about the time issue.
> >
> > On Friday 27 August 2010 23:50:00 you wrote:
> >> I hope these are small files, the algorithm you wrote is not going to
> >> run well as file size gets large (over 10,000 entries) Have you checked
> >> the space/tab situation? Python uses indentation changes to indicate
> >> the end of a block, so inconsistent use of tabs and spaces freaks it
> >> out. Here are
> >
> >> a couple questions:
> > This is not a school project, so you won't be doing my homework or
> > anything :) The space/tab issue is okay, but the script does not even
> > get to the print(i), I even tried for line in secondaryfile: and the for
> > loop still wouldn't be executed.
> >
> >> Are these always numbers?
> >
> > Yes, they are IP's from an Apache error log.
> >
> >> Do the files have to remain in their original order, or can you reorder
> >> them during processing? How often does this have to run?
> >
> > they are not in order because one list is 852 entries and another list is
> > 3300 entries. This script only needs to run once.
> >
> >> Do you have to "comment" the duplicate, or can you remove it?
> >
> > The plan is to remove it, but I wanted to see if my removal method would
> > work, so I was trying to put a comment next to it.
> >
> >> Are there any other requirements not obvious from the description below?
> >
> > No real requirements, if anyone would like the original files I can give
> > them to you, a lot of them are bots.
> > Thank you :)
> > -Kevin
> >
> >> Kevin Faulkner wrote:
> >>> I was trying to pull duplicates out of 2 different files. Needless to
> >>> say there are duplicates I would place a # next to the duplicate.
> >>> Example files: file 1:    file 2:
> >>> 433.3    947.3
> >>> 543.1    749.0
> >>> 741.1    859.2
> >>> 238.5    433.3
> >>> 839.2    229.1
> >>> 583.6    990.1
> >>> 863.4    741.1
> >>> 859.2    101.8

> >>>
> >>> import string
> >>> i=1
> >>> primaryfile = open('/tmp/extract','r')
> >>> secondaryfile = open('/tmp/unload')
> >>>
> >>> for line in primaryfile:
> >>>    pcompare = line
> >>>    print(pcompare)

> >>>
> >>>    for row in secondaryfile:
> >>>      i = i + 1
> >>>      print(i)
> >>>      scompare = row

> >>>
> >>>      if pcompare == scompare:
> >>>        print(scompare)
> >>>        secondaryfile.write('#')

> >>>
> >>> With this code it should go through the files and find a duplicate and
> >>> place a '#' next to it. But for some reasonson it doesn't even get to
> >>> the second for statement. I don't know what else to do. Please offer
> >>> some assistance. :) ---------------------------------------------------
> >>> PLUG-discuss mailing list -
> >>> To subscribe, unsubscribe, or to change your mail settings:
> >>> http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
> >
> > ---------------------------------------------------
> > PLUG-discuss mailing list -
> > To subscribe, unsubscribe, or to change your mail settings:
> > http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss

---------------------------------------------------
PLUG-discuss mailing list -
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss