|
| Author |
Message |
Dazzy Agent

Joined: 09 Jan 2004 Posts: 1731
 
|
Posted: Sat Mar 26, 2005 2:50 pm Post subject: |
|
|
This is to open a file and compare each word with each other to find the duplicates and get rid of them, then print to a new file... This is probbally really bad code, i cant think of any other methods or where to look but im sure they exsist....Im using really long files so time isnt the best at the moment lasting upto half a hour! that is 174808 words though
Heres the code,
| Code: | | print "Ok, starting comparison!";<br /><br /><br />open (THING, "words.txt") or print "Cannot find first words";<br />my @words = <THING>;<br />close THING;<br />chomp @words;<br /><br />my $count = scalar(@words);<br />my @second;<br />my $add;<br />foreach $word (@words){<br /> $add = "yes";<br /> foreach $sec (@second){<br /> if($word eq $sec){<br /> print "\nFound a duplicate! $word";<br /> $add = "no";<br /> }<br /> <br /> }<br /> if($add eq "yes"){<br /> unshift(@second,$word);<br /> }<br />}<br /> <br />chomp @second;<br />@second = sort {$a <=> $b} @second;<br />$count2 = scalar(@second);<br />my $finished = join("\n",@second);<br />open (THING2, "words2.txt") or print "Cannot find first words";<br />print THING2 $finished;<br />close THING2;<br />my $duplicates = $count - $count2;<br />print "\nOk, Finished! $count2 words were written!\nWe found " . $duplicates . "words..."; |
|
|
| Back to top |
|
 |
Siebe God Like

Joined: 06 Jan 2004 Posts: 562 Location: Netherlands
    
|
Posted: Sat Mar 26, 2005 4:40 pm Post subject: |
|
|
| Code: | | #! usr/bin/perl -w<br />use strict;<br /><br />sub removeDupes {<br /> my $string = shift;<br /> my $copy = ""; # Copy we are going to use later on<br /> <br /> # First append and prefix with spaces for our regular expressions<br /> $string = " $string ";<br /> <br /> # Then perform a loop and remove all duplicates<br /> do {<br /> # Replace and backup the copy<br /> $copy = $string; # <-- For while() clause<br /> $string =~ s/\s(\w+\s)(.*)\1(.*)/ $1$2$3/g;<br /> <br /> # Test if the strings are equal, if so, we performed no operations and should quit<br /> } while(length($copy) != length($string));<br /> <br /> # Remove the prefixing and appending spaces<br /> $string =~ s/^\s+//g;<br /> $string =~ s/\s+$//g;<br /> <br /> return $string; # Done! :-)<br />}<br /><br />print removeDupes("one one two two three four four five six six one two three four five six six") . "\n"; |
|
|
| Back to top |
|
 |
Cer Upgraded Agent

Joined: 03 Feb 2004 Posts: 3776 Location: Michigan
  votes: 4
|
|
| Back to top |
|
 |
Mojave Almost An Agent

Joined: 01 Nov 2003 Posts: 1434
 
|
Posted: Sat Mar 26, 2005 6:56 pm Post subject: |
|
|
I like using a hash, not sure how efficient it is though:
| Code: | | print "Ok, starting comparison!";<br /><br />open (THING, "words.txt") or print "Cannot find first words";<br />my @words = <THING>;<br />close THING;<br />chomp @words;<br />my $count = scalar @words;<br /><br />my %hash = map { $_, 1 } @words;<br /><br />my $count2 = scalar keys %hash;<br /><br />open (THING2, "words2.txt") or print "Cannot find first words";<br />print THING2 join( "\n", sort keys %hash );<br />close THING2;<br />my $duplicates = $count - $count2;<br />print "\nOk, Finished! $count2 words were written!\nWe found " . $duplicates . "words..."; |
|
|
| Back to top |
|
 |
brother Senior Member

Joined: 06 Aug 2004 Posts: 156 Location: Belgium
  
|
Posted: Sat Mar 26, 2005 9:40 pm Post subject: |
|
|
Here is my (short) way of doing it, very efficient and using only the original array holding the lines of the file and one temporary hash...
@words contains our list of words...
| Code: | | my %done = undef;<br />@done{@words} = ();<br />@words = sort keys %done; |
|
|
| Back to top |
|
 |
Mojave Almost An Agent

Joined: 01 Nov 2003 Posts: 1434
 
|
Posted: Sat Mar 26, 2005 10:35 pm Post subject: |
|
|
Um, did you try that code out, brother? It doesn't work. But you're doing essentially what I was doing, turning the list into a hash for easy merging. And mine has the added benefit of working. 
EDIT: On second try, I found your method does work, it just doesn't compile under strict and warnings, which I always use. In any case, my method is shorter, one line instead of two! hehe |
|
| Back to top |
|
 |
brother Senior Member

Joined: 06 Aug 2004 Posts: 156 Location: Belgium
  
|
Posted: Sat Mar 26, 2005 10:45 pm Post subject: |
|
|
Hmm weird, i use strict & warnings too... Never threw up on me before, i'll try the map thing though.
EDIT: It does EXACTLY the same... even sorting the keys at the end (merged in the writing code) 
Again, thanks for the map command suggestion, i'll see if i can update some of my code and do some benchmarks, it will probably point out using 'map' is more efficient. |
|
| Back to top |
|
 |
Dazzy Agent

Joined: 09 Jan 2004 Posts: 1731
 
|
Posted: Sat Mar 26, 2005 11:48 pm Post subject: |
|
|
thanks everyone, was only a 5 min thing to sort out some files but thought it might spark some interest here, which it obviously has so hey, kill two birds with one stone isnt it  |
|
| Back to top |
|
 |
|